Unverified Commit 1042c552 authored by John Pohl's avatar John Pohl Committed by GitHub
Browse files

fix: Support FastVideo example on current public version (#7431)

parent 708858a1
...@@ -15,12 +15,12 @@ This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo) ...@@ -15,12 +15,12 @@ This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo)
- **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5. - **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
- **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture. - **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
- **Optimized inference:** FP4 quantization and `torch.compile` are enabled by default for maximum throughput. - **Optimized inference:** FP4 quantization and `torch.compile` are available via `--enable-optimizations`; attention backend selection is controlled separately via `--attention-backend`.
- **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming). - **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming).
- **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers. - **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.
> [!IMPORTANT] > [!IMPORTANT]
> This example is optimized for **NVIDIA B200/B300** GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing `--disable-optimizations` to `worker.py`, which disables FP4 quantization, `torch.compile`, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility. > `worker.py` defaults to `--attention-backend TORCH_SDPA` for broader compatibility across GPUs, including systems such as H100. For the B200/B300-oriented path, enable FP4/compile with `--enable-optimizations` and, if desired, opt into flash-attention explicitly with `--attention-backend FLASH_ATTN`.
## Docker Image Build ## Docker Image Build
...@@ -31,12 +31,35 @@ The local Docker workflow builds a runtime image from the [`Dockerfile`](https:/ ...@@ -31,12 +31,35 @@ The local Docker workflow builds a runtime image from the [`Dockerfile`](https:/
- Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support) - Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support)
- Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source - Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source
The Dockerfile exposes `TORCH_CUDA_ARCH_LIST` as a build argument (default: `10.0 10.0a` for Blackwell). Pass `--build-arg` to target a different architecture:
```bash
# Blackwell (default)
docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="10.0 10.0a"
# Hopper
docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a"
```
`MAX_JOBS` (default: `4`) controls parallel compilation jobs for flash-attention. Lower it if the build runs out of memory:
```bash
docker build examples/diffusers/ --build-arg MAX_JOBS=2
```
When using Docker Compose, set these as environment variables before running `docker compose up --build`:
```bash
# Hopper on a memory-constrained builder
TORCH_CUDA_ARCH_LIST="9.0 9.0a" MAX_JOBS=2 COMPOSE_PROFILES=4 docker compose up --build
```
> [!WARNING] > [!WARNING]
> The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores. > The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
## Warmup Time ## Warmup Time
On first start, workers download model weights and run compile/warmup steps. Expect roughly **10–20 minutes** before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward. On first start, workers download model weights. When `--enable-optimizations` is enabled, compile/warmup steps can push the first ready time to roughly **10–20 minutes** (hardware-dependent). After the first successful optimized response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
> [!TIP] > [!TIP]
> When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts. > When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts.
...@@ -82,7 +105,7 @@ Environment variables: ...@@ -82,7 +105,7 @@ Environment variables:
| `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path | | `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
| `NUM_GPUS` | `1` | Number of GPUs | | `NUM_GPUS` | `1` | Number of GPUs |
| `HTTP_PORT` | `8000` | Frontend HTTP port | | `HTTP_PORT` | `8000` | Frontend HTTP port |
| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (e.g., `--disable-optimizations`) | | `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (for example, `--enable-optimizations --attention-backend FLASH_ATTN`) |
| `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` | | `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` |
Example: Example:
...@@ -91,12 +114,12 @@ Example: ...@@ -91,12 +114,12 @@ Example:
MODEL=FastVideo/LTX2-Distilled-Diffusers \ MODEL=FastVideo/LTX2-Distilled-Diffusers \
NUM_GPUS=1 \ NUM_GPUS=1 \
HTTP_PORT=8000 \ HTTP_PORT=8000 \
WORKER_EXTRA_ARGS="--disable-optimizations" \ WORKER_EXTRA_ARGS="--enable-optimizations --attention-backend FLASH_ATTN" \
./run_local.sh ./run_local.sh
``` ```
> [!NOTE] > [!NOTE]
> `--disable-optimizations` is a `worker.py` flag (not a `dynamo.frontend` flag), so pass it through `WORKER_EXTRA_ARGS`. > `--enable-optimizations` and `--attention-backend` are `worker.py` flags, not `dynamo.frontend` flags, so pass them through `WORKER_EXTRA_ARGS` when you want a non-default worker configuration.
The script writes logs to: The script writes logs to:
...@@ -214,7 +237,8 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4 ...@@ -214,7 +237,8 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
|---|---|---| |---|---|---|
| `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path | | `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
| `--num-gpus` | `1` | Number of GPUs for distributed inference | | `--num-gpus` | `1` | Number of GPUs for distributed inference |
| `--disable-optimizations` | off | Disables FP4 quantization, `torch.compile`, and switches attention from FLASH_ATTN to TORCH_SDPA | | `--enable-optimizations` | off | Enables FP4 quantization and `torch.compile` |
| `--attention-backend` | `TORCH_SDPA` | Sets `FASTVIDEO_ATTENTION_BACKEND`; choices: `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, `SAGE_SLA_ATTN` |
### Request Parameters (`nvext`) ### Request Parameters (`nvext`)
...@@ -233,7 +257,7 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4 ...@@ -233,7 +257,7 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
|---|---|---| |---|---|---|
| `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding | | `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding |
| `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset | | `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset |
| `FASTVIDEO_ATTENTION_BACKEND` | `FLASH_ATTN` | Attention backend (`FLASH_ATTN` or `TORCH_SDPA`) | | `FASTVIDEO_ATTENTION_BACKEND` | `TORCH_SDPA` | Attention backend; `worker.py` sets this from `--attention-backend` and validates `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, and `SAGE_SLA_ATTN` |
| `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs | | `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs |
| `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging | | `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging |
...@@ -241,10 +265,12 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4 ...@@ -241,10 +265,12 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
| Symptom | Cause | Fix | | Symptom | Cause | Fix |
|---|---|---| |---|---|---|
| OOM during Docker build | `flash-attention` compilation uses too much RAM | Lower `MAX_JOBS` in the Dockerfile | | OOM during Docker build | `flash-attention` compilation uses too much RAM | Pass `--build-arg MAX_JOBS=2` (or lower) at build time |
| 10–20 min wait on first start | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached | | `no kernel image available for this GPU` or CUDA arch error at runtime | Image was built for a different GPU architecture | Rebuild with the correct `TORCH_CUDA_ARCH_LIST` (e.g. `9.0 9.0a` for Hopper) |
| 10–20 min wait on first start with optimizations enabled | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
| ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward | | ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward |
| Poor performance on non-B200/B300 GPUs | FP4 and flash-attention optimizations require CUDA arch 10.0 | Pass `--disable-optimizations` to `worker.py` | | Lower throughput than expected on B200/B300 | FP4/compile and flash-attention are configured separately | Pass `--enable-optimizations` and, if desired, `--attention-backend FLASH_ATTN` |
| Startup or import failure after enabling optimizations or changing the attention backend | FP4 and some attention backends depend on specific hardware/software support | Re-run `worker.py` without `--enable-optimizations`, or use `--attention-backend TORCH_SDPA` |
## Source Code ## Source Code
......
...@@ -8,7 +8,7 @@ RUN apt-get update \ ...@@ -8,7 +8,7 @@ RUN apt-get update \
&& apt-get install -yq libucx0 python3-dev python3-pip python3-venv git protobuf-compiler curl ffmpeg libclang-dev \ && apt-get install -yq libucx0 python3-dev python3-pip python3-venv git protobuf-compiler curl ffmpeg libclang-dev \
&& apt-get clean && apt-get clean
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ COPY --from=ghcr.io/astral-sh/uv:0.10.11 /uv /uvx /bin/
ENV UV_LINK_MODE=copy ENV UV_LINK_MODE=copy
RUN uv venv /opt/dynamo/venv --python 3.12 \ RUN uv venv /opt/dynamo/venv --python 3.12 \
...@@ -20,25 +20,36 @@ RUN uv venv /opt/dynamo/venv --python 3.12 \ ...@@ -20,25 +20,36 @@ RUN uv venv /opt/dynamo/venv --python 3.12 \
ENV VIRTUAL_ENV=/opt/dynamo/venv ENV VIRTUAL_ENV=/opt/dynamo/venv
ENV PATH="${VIRTUAL_ENV}/bin:${PATH}" ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"
# flash-attn compilation is memory-intensive. If the build OOMs, lower MAX_JOBS. # Override at build time to target a different GPU architecture, e.g.:
# The flash-attn install notes call this out for machines with <96GB RAM and many CPU cores. # docker build --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a" ...
RUN git clone https://github.com/RandNMR73/flash-attention \ ARG TORCH_CUDA_ARCH_LIST="10.0 10.0a"
# Lower MAX_JOBS if the build OOMs (machines with <96GB RAM and many CPU cores).
# docker build --build-arg MAX_JOBS=2 ...
ARG MAX_JOBS=4
# flash-attention ignores TORCH_CUDA_ARCH_LIST and uses its own FLASH_ATTN_CUDA_ARCHS variable.
# Translate from PyTorch format ("10.0 10.0a", space-separated with dots) to flash-attention
# format ("100;100a", semicolon-separated without dots).
RUN export FLASH_ATTN_CUDA_ARCHS=$(echo "${TORCH_CUDA_ARCH_LIST}" | sed 's/ /;/g; s/\.//g') \
&& echo "Building flash-attention for TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST} FLASH_ATTN_CUDA_ARCHS=${FLASH_ATTN_CUDA_ARCHS} MAX_JOBS=${MAX_JOBS}" \
&& git clone https://github.com/RandNMR73/flash-attention \
&& cd flash-attention \ && cd flash-attention \
&& git switch fa4-compile \ && git switch fa4-compile \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install . --no-build-isolation \ && uv pip install . --no-build-isolation \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install ./flash_attn/cute \ && uv pip install ./flash_attn/cute \
&& rm -rf ../flash-attention && rm -rf ../flash-attention
# Install Dynamo with /v1/videos support. # Install Dynamo with /v1/videos support.
RUN uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0#subdirectory=lib/bindings/python' \ RUN uv pip install ai-dynamo==1.0.0
&& uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0'
# Install FastVideo directly from the public upstream repository. # Install FastVideo directly from the public upstream repository.
# Checkout with --recurse-submodules to get the required submodules as well. # Checkout with --recurse-submodules to get the required submodules as well.
RUN . /opt/dynamo/venv/bin/activate \ RUN echo "Building FastVideo for TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}" \
&& . /opt/dynamo/venv/bin/activate \
&& uv pip install setuptools_scm scikit-build-core cmake ninja \ && uv pip install setuptools_scm scikit-build-core cmake ninja \
&& git clone --recurse-submodules https://github.com/hao-ai-lab/FastVideo.git /tmp/FastVideo \ && git clone --recurse-submodules https://github.com/hao-ai-lab/FastVideo.git /tmp/FastVideo \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" uv pip install --no-build-isolation /tmp/FastVideo && uv pip install --no-build-isolation /tmp/FastVideo
ENV FASTVIDEO_VIDEO_CODEC=libx264 ENV FASTVIDEO_VIDEO_CODEC=libx264
ENV FASTVIDEO_X264_PRESET=ultrafast ENV FASTVIDEO_X264_PRESET=ultrafast
......
...@@ -8,6 +8,9 @@ x-backend-base: &backend-base ...@@ -8,6 +8,9 @@ x-backend-base: &backend-base
build: build:
context: .. context: ..
dockerfile: Dockerfile dockerfile: Dockerfile
args:
TORCH_CUDA_ARCH_LIST: ${TORCH_CUDA_ARCH_LIST:-10.0 10.0a}
MAX_JOBS: ${MAX_JOBS:-4}
image: dynamo-fastvideo-diffusers:latest image: dynamo-fastvideo-diffusers:latest
restart: on-failure restart: on-failure
command: python worker.py command: python worker.py
...@@ -33,6 +36,9 @@ services: ...@@ -33,6 +36,9 @@ services:
build: build:
context: .. context: ..
dockerfile: Dockerfile dockerfile: Dockerfile
args:
TORCH_CUDA_ARCH_LIST: ${TORCH_CUDA_ARCH_LIST:-10.0 10.0a}
MAX_JOBS: ${MAX_JOBS:-4}
image: dynamo-fastvideo-diffusers:latest image: dynamo-fastvideo-diffusers:latest
restart: on-failure restart: on-failure
command: > command: >
......
...@@ -16,12 +16,17 @@ with different resolutions and quality settings without restarting. ...@@ -16,12 +16,17 @@ with different resolutions and quality settings without restarting.
One request at a time (asyncio.Lock — VideoGenerator is not re-entrant). One request at a time (asyncio.Lock — VideoGenerator is not re-entrant).
Usage: Usage:
python worker.py [--model MODEL] [--num-gpus N] [--disable-optimizations] python worker.py [--model MODEL] [--num-gpus N] [--enable-optimizations]
[--attention-backend ATTENTION_BACKEND]
Options: Options:
--model HuggingFace model path --model HuggingFace model path
(default: FastVideo/LTX2-Distilled-Diffusers) (default: FastVideo/LTX2-Distilled-Diffusers)
--num-gpus Number of GPUs (default: 1) --num-gpus Number of GPUs (default: 1)
--enable-optimizations
Enable FP4 quantization (if available) and torch.compile
--attention-backend
Attention backend (default: TORCH_SDPA)
Request format (sent to /v1/videos): Request format (sent to /v1/videos):
prompt: text description of the desired video prompt: text description of the desired video
...@@ -46,10 +51,11 @@ import tempfile ...@@ -46,10 +51,11 @@ import tempfile
import time import time
import uuid import uuid
import torch
import uvloop import uvloop
from fastvideo import VideoGenerator from fastvideo import VideoGenerator
from fastvideo.configs.pipelines.base import PipelineConfig from fastvideo.configs.pipelines.base import PipelineConfig
from fastvideo.layers.quantization.fp4_config import FP4Config from fastvideo.platforms.interface import AttentionBackendEnum
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from dynamo.llm import ModelInput, ModelType, register_llm # type: ignore[attr-defined] from dynamo.llm import ModelInput, ModelType, register_llm # type: ignore[attr-defined]
...@@ -58,6 +64,14 @@ from dynamo.runtime import DistributedRuntime, dynamo_endpoint ...@@ -58,6 +64,14 @@ from dynamo.runtime import DistributedRuntime, dynamo_endpoint
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
DEFAULT_MODEL = "FastVideo/LTX2-Distilled-Diffusers" DEFAULT_MODEL = "FastVideo/LTX2-Distilled-Diffusers"
DEFAULT_ATTENTION_BACKEND = "TORCH_SDPA"
# FastVideo exposes NO_ATTENTION in the enum, but it is not a selectable
# inference backend for this worker's FASTVIDEO_ATTENTION_BACKEND override.
ATTENTION_BACKEND_CHOICES = tuple(
backend_name
for backend_name in AttentionBackendEnum.__members__
if backend_name != "NO_ATTENTION"
)
# ── Request / Response models ───────────────────────────────────────────────── # ── Request / Response models ─────────────────────────────────────────────────
...@@ -133,14 +147,14 @@ class FastVideoBackend: ...@@ -133,14 +147,14 @@ class FastVideoBackend:
def __init__(self, args: argparse.Namespace) -> None: def __init__(self, args: argparse.Namespace) -> None:
self.model_name: str = args.model self.model_name: str = args.model
self.num_gpus: int = args.num_gpus self.num_gpus: int = args.num_gpus
self.disable_optimizations: bool = args.disable_optimizations self.enable_optimizations: bool = args.enable_optimizations
self.attention_backend: str = args.attention_backend
# One request at a time — VideoGenerator is not re-entrant # One request at a time — VideoGenerator is not re-entrant
self._generate_lock = asyncio.Lock() self._generate_lock = asyncio.Lock()
self.generator: VideoGenerator | None = None self.generator: VideoGenerator | None = None
attn_backend = "TORCH_SDPA" if self.disable_optimizations else "FLASH_ATTN" os.environ["FASTVIDEO_ATTENTION_BACKEND"] = self.attention_backend
os.environ["FASTVIDEO_ATTENTION_BACKEND"] = attn_backend
os.environ["FASTVIDEO_STAGE_LOGGING"] = "1" os.environ["FASTVIDEO_STAGE_LOGGING"] = "1"
os.environ["FASTVIDEO_ENABLE_RMSNORM_FP4_PREQUANT"] = "0" os.environ["FASTVIDEO_ENABLE_RMSNORM_FP4_PREQUANT"] = "0"
...@@ -150,33 +164,56 @@ class FastVideoBackend: ...@@ -150,33 +164,56 @@ class FastVideoBackend:
def _load(): def _load():
pipeline_config = PipelineConfig.from_pretrained(self.model_name) pipeline_config = PipelineConfig.from_pretrained(self.model_name)
if not self.disable_optimizations: optimization_kwargs = {}
logger.info( if self.enable_optimizations:
"Using FP4 quantization for VideoGenerator model=%s", major, minor = torch.cuda.get_device_capability()
self.model_name, if major < 10:
) logger.warning(
pipeline_config.dit_config.quant_config = FP4Config() "FP4 quantization is only supported on NVIDIA Blackwell GPUs (compute capability 10.0+). Detected compute capability: %d.%d. Continuing without FP4 optimizations.",
major,
minor,
)
else:
logger.info(
"Using FP4 quantization for VideoGenerator model=%s",
self.model_name,
)
try:
from fastvideo.layers.quantization.fp4_config import FP4Config
except ImportError as exc:
raise RuntimeError(
"FastVideo optimizations require "
"fastvideo.layers.quantization.fp4_config, but this "
"FastVideo build does not provide it. Re-run "
"worker.py without --enable-optimizations or install a "
"FastVideo version that includes fp4_config."
) from exc
pipeline_config.dit_config.quant_config = FP4Config()
optimization_kwargs = {
"ltx2_refine_enabled": True,
"ltx2_refine_lora_path": "", # disable refine lora for distilled model
"ltx2_refine_num_inference_steps": 2,
"ltx2_refine_guidance_scale": 1.0,
"ltx2_refine_add_noise": True,
"enable_torch_compile": True,
"enable_torch_compile_text_encoder": True,
"torch_compile_kwargs": {
"backend": "inductor",
"fullgraph": True,
"mode": "max-autotune-no-cudagraphs",
},
"dit_cpu_offload": False,
"vae_cpu_offload": False,
"text_encoder_cpu_offload": False,
"ltx2_vae_tiling": False,
}
return VideoGenerator.from_pretrained( return VideoGenerator.from_pretrained(
self.model_name, self.model_name,
num_gpus=self.num_gpus, num_gpus=self.num_gpus,
ltx2_refine_enabled=True,
ltx2_refine_lora_path="", # disable refine lora for distilled model
ltx2_refine_num_inference_steps=2,
ltx2_refine_guidance_scale=1.0,
ltx2_refine_add_noise=True,
pipeline_config=pipeline_config, pipeline_config=pipeline_config,
enable_torch_compile=not self.disable_optimizations, **optimization_kwargs,
enable_torch_compile_text_encoder=not self.disable_optimizations,
torch_compile_kwargs={
"backend": "inductor",
"fullgraph": True,
"mode": "max-autotune-no-cudagraphs",
},
dit_cpu_offload=False,
vae_cpu_offload=False,
text_encoder_cpu_offload=False,
ltx2_vae_tiling=False,
) )
self.generator = await loop.run_in_executor(None, _load) self.generator = await loop.run_in_executor(None, _load)
...@@ -402,10 +439,21 @@ def _parse_args() -> argparse.Namespace: ...@@ -402,10 +439,21 @@ def _parse_args() -> argparse.Namespace:
help="Number of GPUs (default: 1)", help="Number of GPUs (default: 1)",
) )
parser.add_argument( parser.add_argument(
"--disable-optimizations", "--enable-optimizations",
action="store_true", action="store_true",
dest="disable_optimizations", dest="enable_optimizations",
help="Disable FP4 quantization, torch.compile, and use TORCH_SDPA attention", help="Enable FP4 quantization (if available) and torch.compile",
)
parser.add_argument(
"--attention-backend",
choices=ATTENTION_BACKEND_CHOICES,
default=DEFAULT_ATTENTION_BACKEND,
dest="attention_backend",
help=(
"Attention backend to set via FASTVIDEO_ATTENTION_BACKEND "
f"(choices: {', '.join(ATTENTION_BACKEND_CHOICES)}; "
f"default: {DEFAULT_ATTENTION_BACKEND})"
),
) )
return parser.parse_args() return parser.parse_args()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment