Unverified Commit 1042c552 authored by John Pohl's avatar John Pohl Committed by GitHub
Browse files

fix: Support FastVideo example on current public version (#7431)

parent 708858a1
......@@ -15,12 +15,12 @@ This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo)
- **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
- **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
- **Optimized inference:** FP4 quantization and `torch.compile` are enabled by default for maximum throughput.
- **Optimized inference:** FP4 quantization and `torch.compile` are available via `--enable-optimizations`; attention backend selection is controlled separately via `--attention-backend`.
- **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming).
- **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.
> [!IMPORTANT]
> This example is optimized for **NVIDIA B200/B300** GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing `--disable-optimizations` to `worker.py`, which disables FP4 quantization, `torch.compile`, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.
> `worker.py` defaults to `--attention-backend TORCH_SDPA` for broader compatibility across GPUs, including systems such as H100. For the B200/B300-oriented path, enable FP4/compile with `--enable-optimizations` and, if desired, opt into flash-attention explicitly with `--attention-backend FLASH_ATTN`.
## Docker Image Build
......@@ -31,12 +31,35 @@ The local Docker workflow builds a runtime image from the [`Dockerfile`](https:/
- Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support)
- Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source
The Dockerfile exposes `TORCH_CUDA_ARCH_LIST` as a build argument (default: `10.0 10.0a` for Blackwell). Pass `--build-arg` to target a different architecture:
```bash
# Blackwell (default)
docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="10.0 10.0a"
# Hopper
docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a"
```
`MAX_JOBS` (default: `4`) controls parallel compilation jobs for flash-attention. Lower it if the build runs out of memory:
```bash
docker build examples/diffusers/ --build-arg MAX_JOBS=2
```
When using Docker Compose, set these as environment variables before running `docker compose up --build`:
```bash
# Hopper on a memory-constrained builder
TORCH_CUDA_ARCH_LIST="9.0 9.0a" MAX_JOBS=2 COMPOSE_PROFILES=4 docker compose up --build
```
> [!WARNING]
> The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
## Warmup Time
On first start, workers download model weights and run compile/warmup steps. Expect roughly **10–20 minutes** before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
On first start, workers download model weights. When `--enable-optimizations` is enabled, compile/warmup steps can push the first ready time to roughly **10–20 minutes** (hardware-dependent). After the first successful optimized response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
> [!TIP]
> When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts.
......@@ -82,7 +105,7 @@ Environment variables:
| `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
| `NUM_GPUS` | `1` | Number of GPUs |
| `HTTP_PORT` | `8000` | Frontend HTTP port |
| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (e.g., `--disable-optimizations`) |
| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (for example, `--enable-optimizations --attention-backend FLASH_ATTN`) |
| `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` |
Example:
......@@ -91,12 +114,12 @@ Example:
MODEL=FastVideo/LTX2-Distilled-Diffusers \
NUM_GPUS=1 \
HTTP_PORT=8000 \
WORKER_EXTRA_ARGS="--disable-optimizations" \
WORKER_EXTRA_ARGS="--enable-optimizations --attention-backend FLASH_ATTN" \
./run_local.sh
```
> [!NOTE]
> `--disable-optimizations` is a `worker.py` flag (not a `dynamo.frontend` flag), so pass it through `WORKER_EXTRA_ARGS`.
> `--enable-optimizations` and `--attention-backend` are `worker.py` flags, not `dynamo.frontend` flags, so pass them through `WORKER_EXTRA_ARGS` when you want a non-default worker configuration.
The script writes logs to:
......@@ -214,7 +237,8 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
|---|---|---|
| `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
| `--num-gpus` | `1` | Number of GPUs for distributed inference |
| `--disable-optimizations` | off | Disables FP4 quantization, `torch.compile`, and switches attention from FLASH_ATTN to TORCH_SDPA |
| `--enable-optimizations` | off | Enables FP4 quantization and `torch.compile` |
| `--attention-backend` | `TORCH_SDPA` | Sets `FASTVIDEO_ATTENTION_BACKEND`; choices: `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, `SAGE_SLA_ATTN` |
### Request Parameters (`nvext`)
......@@ -233,7 +257,7 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
|---|---|---|
| `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding |
| `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset |
| `FASTVIDEO_ATTENTION_BACKEND` | `FLASH_ATTN` | Attention backend (`FLASH_ATTN` or `TORCH_SDPA`) |
| `FASTVIDEO_ATTENTION_BACKEND` | `TORCH_SDPA` | Attention backend; `worker.py` sets this from `--attention-backend` and validates `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, and `SAGE_SLA_ATTN` |
| `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs |
| `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging |
......@@ -241,10 +265,12 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
| Symptom | Cause | Fix |
|---|---|---|
| OOM during Docker build | `flash-attention` compilation uses too much RAM | Lower `MAX_JOBS` in the Dockerfile |
| 10–20 min wait on first start | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
| OOM during Docker build | `flash-attention` compilation uses too much RAM | Pass `--build-arg MAX_JOBS=2` (or lower) at build time |
| `no kernel image available for this GPU` or CUDA arch error at runtime | Image was built for a different GPU architecture | Rebuild with the correct `TORCH_CUDA_ARCH_LIST` (e.g. `9.0 9.0a` for Hopper) |
| 10–20 min wait on first start with optimizations enabled | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
| ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward |
| Poor performance on non-B200/B300 GPUs | FP4 and flash-attention optimizations require CUDA arch 10.0 | Pass `--disable-optimizations` to `worker.py` |
| Lower throughput than expected on B200/B300 | FP4/compile and flash-attention are configured separately | Pass `--enable-optimizations` and, if desired, `--attention-backend FLASH_ATTN` |
| Startup or import failure after enabling optimizations or changing the attention backend | FP4 and some attention backends depend on specific hardware/software support | Re-run `worker.py` without `--enable-optimizations`, or use `--attention-backend TORCH_SDPA` |
## Source Code
......
......@@ -8,7 +8,7 @@ RUN apt-get update \
&& apt-get install -yq libucx0 python3-dev python3-pip python3-venv git protobuf-compiler curl ffmpeg libclang-dev \
&& apt-get clean
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
COPY --from=ghcr.io/astral-sh/uv:0.10.11 /uv /uvx /bin/
ENV UV_LINK_MODE=copy
RUN uv venv /opt/dynamo/venv --python 3.12 \
......@@ -20,25 +20,36 @@ RUN uv venv /opt/dynamo/venv --python 3.12 \
ENV VIRTUAL_ENV=/opt/dynamo/venv
ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"
# flash-attn compilation is memory-intensive. If the build OOMs, lower MAX_JOBS.
# The flash-attn install notes call this out for machines with <96GB RAM and many CPU cores.
RUN git clone https://github.com/RandNMR73/flash-attention \
# Override at build time to target a different GPU architecture, e.g.:
# docker build --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a" ...
ARG TORCH_CUDA_ARCH_LIST="10.0 10.0a"
# Lower MAX_JOBS if the build OOMs (machines with <96GB RAM and many CPU cores).
# docker build --build-arg MAX_JOBS=2 ...
ARG MAX_JOBS=4
# flash-attention ignores TORCH_CUDA_ARCH_LIST and uses its own FLASH_ATTN_CUDA_ARCHS variable.
# Translate from PyTorch format ("10.0 10.0a", space-separated with dots) to flash-attention
# format ("100;100a", semicolon-separated without dots).
RUN export FLASH_ATTN_CUDA_ARCHS=$(echo "${TORCH_CUDA_ARCH_LIST}" | sed 's/ /;/g; s/\.//g') \
&& echo "Building flash-attention for TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST} FLASH_ATTN_CUDA_ARCHS=${FLASH_ATTN_CUDA_ARCHS} MAX_JOBS=${MAX_JOBS}" \
&& git clone https://github.com/RandNMR73/flash-attention \
&& cd flash-attention \
&& git switch fa4-compile \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install . --no-build-isolation \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install ./flash_attn/cute \
&& uv pip install . --no-build-isolation \
&& uv pip install ./flash_attn/cute \
&& rm -rf ../flash-attention
# Install Dynamo with /v1/videos support.
RUN uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0#subdirectory=lib/bindings/python' \
&& uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0'
RUN uv pip install ai-dynamo==1.0.0
# Install FastVideo directly from the public upstream repository.
# Checkout with --recurse-submodules to get the required submodules as well.
RUN . /opt/dynamo/venv/bin/activate \
RUN echo "Building FastVideo for TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}" \
&& . /opt/dynamo/venv/bin/activate \
&& uv pip install setuptools_scm scikit-build-core cmake ninja \
&& git clone --recurse-submodules https://github.com/hao-ai-lab/FastVideo.git /tmp/FastVideo \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" uv pip install --no-build-isolation /tmp/FastVideo
&& uv pip install --no-build-isolation /tmp/FastVideo
ENV FASTVIDEO_VIDEO_CODEC=libx264
ENV FASTVIDEO_X264_PRESET=ultrafast
......
......@@ -8,6 +8,9 @@ x-backend-base: &backend-base
build:
context: ..
dockerfile: Dockerfile
args:
TORCH_CUDA_ARCH_LIST: ${TORCH_CUDA_ARCH_LIST:-10.0 10.0a}
MAX_JOBS: ${MAX_JOBS:-4}
image: dynamo-fastvideo-diffusers:latest
restart: on-failure
command: python worker.py
......@@ -33,6 +36,9 @@ services:
build:
context: ..
dockerfile: Dockerfile
args:
TORCH_CUDA_ARCH_LIST: ${TORCH_CUDA_ARCH_LIST:-10.0 10.0a}
MAX_JOBS: ${MAX_JOBS:-4}
image: dynamo-fastvideo-diffusers:latest
restart: on-failure
command: >
......
......@@ -16,12 +16,17 @@ with different resolutions and quality settings without restarting.
One request at a time (asyncio.Lock — VideoGenerator is not re-entrant).
Usage:
python worker.py [--model MODEL] [--num-gpus N] [--disable-optimizations]
python worker.py [--model MODEL] [--num-gpus N] [--enable-optimizations]
[--attention-backend ATTENTION_BACKEND]
Options:
--model HuggingFace model path
(default: FastVideo/LTX2-Distilled-Diffusers)
--num-gpus Number of GPUs (default: 1)
--enable-optimizations
Enable FP4 quantization (if available) and torch.compile
--attention-backend
Attention backend (default: TORCH_SDPA)
Request format (sent to /v1/videos):
prompt: text description of the desired video
......@@ -46,10 +51,11 @@ import tempfile
import time
import uuid
import torch
import uvloop
from fastvideo import VideoGenerator
from fastvideo.configs.pipelines.base import PipelineConfig
from fastvideo.layers.quantization.fp4_config import FP4Config
from fastvideo.platforms.interface import AttentionBackendEnum
from pydantic import BaseModel, Field
from dynamo.llm import ModelInput, ModelType, register_llm # type: ignore[attr-defined]
......@@ -58,6 +64,14 @@ from dynamo.runtime import DistributedRuntime, dynamo_endpoint
logger = logging.getLogger(__name__)
DEFAULT_MODEL = "FastVideo/LTX2-Distilled-Diffusers"
DEFAULT_ATTENTION_BACKEND = "TORCH_SDPA"
# FastVideo exposes NO_ATTENTION in the enum, but it is not a selectable
# inference backend for this worker's FASTVIDEO_ATTENTION_BACKEND override.
ATTENTION_BACKEND_CHOICES = tuple(
backend_name
for backend_name in AttentionBackendEnum.__members__
if backend_name != "NO_ATTENTION"
)
# ── Request / Response models ─────────────────────────────────────────────────
......@@ -133,14 +147,14 @@ class FastVideoBackend:
def __init__(self, args: argparse.Namespace) -> None:
self.model_name: str = args.model
self.num_gpus: int = args.num_gpus
self.disable_optimizations: bool = args.disable_optimizations
self.enable_optimizations: bool = args.enable_optimizations
self.attention_backend: str = args.attention_backend
# One request at a time — VideoGenerator is not re-entrant
self._generate_lock = asyncio.Lock()
self.generator: VideoGenerator | None = None
attn_backend = "TORCH_SDPA" if self.disable_optimizations else "FLASH_ATTN"
os.environ["FASTVIDEO_ATTENTION_BACKEND"] = attn_backend
os.environ["FASTVIDEO_ATTENTION_BACKEND"] = self.attention_backend
os.environ["FASTVIDEO_STAGE_LOGGING"] = "1"
os.environ["FASTVIDEO_ENABLE_RMSNORM_FP4_PREQUANT"] = "0"
......@@ -150,33 +164,56 @@ class FastVideoBackend:
def _load():
pipeline_config = PipelineConfig.from_pretrained(self.model_name)
if not self.disable_optimizations:
optimization_kwargs = {}
if self.enable_optimizations:
major, minor = torch.cuda.get_device_capability()
if major < 10:
logger.warning(
"FP4 quantization is only supported on NVIDIA Blackwell GPUs (compute capability 10.0+). Detected compute capability: %d.%d. Continuing without FP4 optimizations.",
major,
minor,
)
else:
logger.info(
"Using FP4 quantization for VideoGenerator model=%s",
self.model_name,
)
try:
from fastvideo.layers.quantization.fp4_config import FP4Config
except ImportError as exc:
raise RuntimeError(
"FastVideo optimizations require "
"fastvideo.layers.quantization.fp4_config, but this "
"FastVideo build does not provide it. Re-run "
"worker.py without --enable-optimizations or install a "
"FastVideo version that includes fp4_config."
) from exc
pipeline_config.dit_config.quant_config = FP4Config()
return VideoGenerator.from_pretrained(
self.model_name,
num_gpus=self.num_gpus,
ltx2_refine_enabled=True,
ltx2_refine_lora_path="", # disable refine lora for distilled model
ltx2_refine_num_inference_steps=2,
ltx2_refine_guidance_scale=1.0,
ltx2_refine_add_noise=True,
pipeline_config=pipeline_config,
enable_torch_compile=not self.disable_optimizations,
enable_torch_compile_text_encoder=not self.disable_optimizations,
torch_compile_kwargs={
optimization_kwargs = {
"ltx2_refine_enabled": True,
"ltx2_refine_lora_path": "", # disable refine lora for distilled model
"ltx2_refine_num_inference_steps": 2,
"ltx2_refine_guidance_scale": 1.0,
"ltx2_refine_add_noise": True,
"enable_torch_compile": True,
"enable_torch_compile_text_encoder": True,
"torch_compile_kwargs": {
"backend": "inductor",
"fullgraph": True,
"mode": "max-autotune-no-cudagraphs",
},
dit_cpu_offload=False,
vae_cpu_offload=False,
text_encoder_cpu_offload=False,
ltx2_vae_tiling=False,
"dit_cpu_offload": False,
"vae_cpu_offload": False,
"text_encoder_cpu_offload": False,
"ltx2_vae_tiling": False,
}
return VideoGenerator.from_pretrained(
self.model_name,
num_gpus=self.num_gpus,
pipeline_config=pipeline_config,
**optimization_kwargs,
)
self.generator = await loop.run_in_executor(None, _load)
......@@ -402,10 +439,21 @@ def _parse_args() -> argparse.Namespace:
help="Number of GPUs (default: 1)",
)
parser.add_argument(
"--disable-optimizations",
"--enable-optimizations",
action="store_true",
dest="disable_optimizations",
help="Disable FP4 quantization, torch.compile, and use TORCH_SDPA attention",
dest="enable_optimizations",
help="Enable FP4 quantization (if available) and torch.compile",
)
parser.add_argument(
"--attention-backend",
choices=ATTENTION_BACKEND_CHOICES,
default=DEFAULT_ATTENTION_BACKEND,
dest="attention_backend",
help=(
"Attention backend to set via FASTVIDEO_ATTENTION_BACKEND "
f"(choices: {', '.join(ATTENTION_BACKEND_CHOICES)}; "
f"default: {DEFAULT_ATTENTION_BACKEND})"
),
)
return parser.parse_args()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment