"vscode:/vscode.git/clone" did not exist on "4c26cb9cc83b0ad0d750f7b4ac337e949cefedd7"
text-img2vid.md 18.6 KB
Newer Older
Aryan's avatar
Aryan committed
1
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Steven Liu's avatar
Steven Liu committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
# Video generation
Steven Liu's avatar
Steven Liu committed
14

Steven Liu's avatar
Steven Liu committed
15
Video generation models extend image generation (can be considered a 1-frame video) to also process data related to space and time. Making sure all this data - text, space, time - remain consistent and aligned from frame-to-frame is a big challenge in generating long and high-resolution videos.
Steven Liu's avatar
Steven Liu committed
16

Steven Liu's avatar
Steven Liu committed
17
Modern video models tackle this challenge with the diffusion transformer (DiT) architecture. This reduces computational costs and allows more efficient scaling to larger and higher-quality image and video data.
Steven Liu's avatar
Steven Liu committed
18

Steven Liu's avatar
Steven Liu committed
19
Check out what some of these video models are capable of below.
Steven Liu's avatar
Steven Liu committed
20

Steven Liu's avatar
Steven Liu committed
21
22
<hfoptions id="popular models">
<hfoption id="Wan2.1">
glide-the's avatar
glide-the committed
23
24

```py
Steven Liu's avatar
Steven Liu committed
25
# pip install ftfy
glide-the's avatar
glide-the committed
26
import torch
Steven Liu's avatar
Steven Liu committed
27
28
29
import numpy as np
from diffusers import AutoModel, WanPipeline
from diffusers.hooks.group_offloading import apply_group_offloading
glide-the's avatar
glide-the committed
30
from diffusers.utils import export_to_video, load_image
Steven Liu's avatar
Steven Liu committed
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from transformers import UMT5EncoderModel

text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16)
vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)

# group-offloading
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
apply_group_offloading(text_encoder,
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="block_level",
    num_blocks_per_group=4
)
transformer.enable_group_offload(
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="leaf_level",
    use_stream=True
)
glide-the's avatar
glide-the committed
52

Steven Liu's avatar
Steven Liu committed
53
54
55
56
57
pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    vae=vae,
    transformer=transformer,
    text_encoder=text_encoder,
glide-the's avatar
glide-the committed
58
59
    torch_dtype=torch.bfloat16
)
Steven Liu's avatar
Steven Liu committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
pipeline.to("cuda")

prompt = """
The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic 
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, 
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, 
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""
74

Steven Liu's avatar
Steven Liu committed
75
output = pipeline(
glide-the's avatar
glide-the committed
76
    prompt=prompt,
Steven Liu's avatar
Steven Liu committed
77
78
79
    negative_prompt=negative_prompt,
    num_frames=81,
    guidance_scale=5.0,
glide-the's avatar
glide-the committed
80
).frames[0]
Steven Liu's avatar
Steven Liu committed
81
export_to_video(output, "output.mp4", fps=16)
glide-the's avatar
glide-the committed
82
83
```

84
85
</hfoption>
<hfoption id="HunyuanVideo">
Steven Liu's avatar
Steven Liu committed
86
87
88

```py
import torch
Steven Liu's avatar
Steven Liu committed
89
90
from diffusers importAutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
91
from diffusers.utils import export_to_video
Steven Liu's avatar
Steven Liu committed
92

Steven Liu's avatar
Steven Liu committed
93
94
95
96
97
98
99
100
101
# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
  quant_backend="bitsandbytes_4bit",
  quant_kwargs={
    "load_in_4bit": True,
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_compute_dtype": torch.bfloat16
    },
  components_to_quantize=["transformer"]
102
)
Steven Liu's avatar
Steven Liu committed
103
104
105
106
107

pipeline = HunyuanVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
Steven Liu's avatar
Steven Liu committed
108
109
)

Steven Liu's avatar
Steven Liu committed
110
111
112
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()
Steven Liu's avatar
Steven Liu committed
113

Steven Liu's avatar
Steven Liu committed
114
115
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
116
export_to_video(video, "output.mp4", fps=15)
Steven Liu's avatar
Steven Liu committed
117
118
```

119
120
</hfoption>
<hfoption id="LTX-Video">
Steven Liu's avatar
Steven Liu committed
121
122
123

```py
import torch
Steven Liu's avatar
Steven Liu committed
124
125
from diffusers import LTXPipeline, AutoModel
from diffusers.hooks import apply_group_offloading
126
from diffusers.utils import export_to_video
Steven Liu's avatar
Steven Liu committed
127

Steven Liu's avatar
Steven Liu committed
128
129
130
131
132
133
134
135
136
137
138
# fp8 layerwise weight-casting
transformer = AutoModel.from_pretrained(
    "Lightricks/LTX-Video",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)
transformer.enable_layerwise_casting(
    storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16
)

pipeline = LTXPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16)
Steven Liu's avatar
Steven Liu committed
139

Steven Liu's avatar
Steven Liu committed
140
141
142
143
144
145
146
147
148
149
150
151
152
# group-offloading
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True)
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)
apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level")

prompt = """
A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage
"""
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipeline(
Steven Liu's avatar
Steven Liu committed
153
    prompt=prompt,
Steven Liu's avatar
Steven Liu committed
154
155
156
    negative_prompt=negative_prompt,
    width=768,
    height=512,
157
    num_frames=161,
Steven Liu's avatar
Steven Liu committed
158
159
    decode_timestep=0.03,
    decode_noise_scale=0.025,
Steven Liu's avatar
Steven Liu committed
160
161
    num_inference_steps=50,
).frames[0]
162
163
164
165
export_to_video(video, "output.mp4", fps=24)
```

</hfoption>
Steven Liu's avatar
Steven Liu committed
166
</hfoptions>
167

Steven Liu's avatar
Steven Liu committed
168
This guide will cover video generation basics such as which parameters to configure and how to reduce their memory usage.
169

Steven Liu's avatar
Steven Liu committed
170
171
> [!TIP]
> If you're interested in learning more about how to use a specific model, please refer to their pipeline API model card.
172

Steven Liu's avatar
Steven Liu committed
173
## Pipeline parameters
174

Steven Liu's avatar
Steven Liu committed
175
There are several parameters to configure in the pipeline that'll affect video generation quality or speed. Experimenting with different parameter values is important for discovering the appropriate quality and speed tradeoff.
176

Steven Liu's avatar
Steven Liu committed
177
### num_frames
178

Steven Liu's avatar
Steven Liu committed
179
A frame is a still image that is played in a sequence of other frames to create motion or a video. Control the number of frames generated per second with `num_frames`. Increasing `num_frames` increases perceived motion smoothness and visual coherence, making it especially important for videos with dynamic content. A higher `num_frames` value also increases video duration.
180

Steven Liu's avatar
Steven Liu committed
181
Some video models require more specific `num_frames` values for inference. For example, [`HunyuanVideoPipeline`] recommends calculating the `num_frames` with `(4 * num_frames) +1`. Always check a pipelines API model card to see if there is a recommended value.
182
183
184

```py
import torch
Steven Liu's avatar
Steven Liu committed
185
186
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
187

Steven Liu's avatar
Steven Liu committed
188
189
190
pipeline = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video", torch_dtype=torch.bfloat16
).to("cuda")
191

Steven Liu's avatar
Steven Liu committed
192
193
194
195
196
197
198
prompt = """
A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman 
with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The 
camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and 
natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be 
real-life footage
"""
199

Steven Liu's avatar
Steven Liu committed
200
201
202
203
204
205
206
207
208
209
210
video = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)
Steven Liu's avatar
Steven Liu committed
211
212
```

Steven Liu's avatar
Steven Liu committed
213
### guidance_scale
Steven Liu's avatar
Steven Liu committed
214

Steven Liu's avatar
Steven Liu committed
215
Guidance scale or "cfg" controls how closely the generated frames adhere to the input conditioning (text, image or both). Increasing `guidance_scale` generates frames that resemble the input conditions more closely and includes finer details, but risk introducing artifacts and reducing output diversity. Lower `guidance_scale` values encourages looser prompt adherence and increased output variety, but details may not be as great. If it's too low, it may ignore your prompt entirely and generate random noise.
Steven Liu's avatar
Steven Liu committed
216
217
218

```py
import torch
Steven Liu's avatar
Steven Liu committed
219
220
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
221

Steven Liu's avatar
Steven Liu committed
222
223
224
225
pipeline = CogVideoXPipeline.from_pretrained(
  "THUDM/CogVideoX-2b",
  torch_dtype=torch.float16
).to("cuda")
Steven Liu's avatar
Steven Liu committed
226

Steven Liu's avatar
Steven Liu committed
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
prompt = """
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over
a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, 
with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an 
oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at 
a playful environment. The scene captures the innocence and imagination of childhood, 
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
"""

video = pipeline(
  prompt=prompt,
  guidance_scale=6,
  num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)
Steven Liu's avatar
Steven Liu committed
242
243
```

Steven Liu's avatar
Steven Liu committed
244
### negative_prompt
Steven Liu's avatar
Steven Liu committed
245

Steven Liu's avatar
Steven Liu committed
246
A negative prompt is useful for excluding things you don't want to see in the generated video. It is commonly used to refine the quality and alignment of the generated video by pushing the model away from undesirable elements like "blurry, distorted, ugly". This can create cleaner and more focused videos.
Steven Liu's avatar
Steven Liu committed
247
248

```py
Steven Liu's avatar
Steven Liu committed
249
# pip install ftfy
Steven Liu's avatar
Steven Liu committed
250
import torch
Steven Liu's avatar
Steven Liu committed
251
252
253
from diffusers import WanPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video
Steven Liu's avatar
Steven Liu committed
254

Steven Liu's avatar
Steven Liu committed
255
256
vae = AutoencoderKLWan.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32
Steven Liu's avatar
Steven Liu committed
257
)
Steven Liu's avatar
Steven Liu committed
258
259
260
261
262
263
264
pipeline = WanPipeline.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", vae=vae, torch_dtype=torch.bfloat16
)
pipeline.scheduler = UniPCMultistepScheduler.from_config(
  pipeline.scheduler.config, flow_shift=5.0
)
pipeline.to("cuda")
Steven Liu's avatar
Steven Liu committed
265

Steven Liu's avatar
Steven Liu committed
266
267
pipeline.load_lora_weights("benjamin-paine/steamboat-willie-14b", adapter_name="steamboat-willie")
pipeline.set_adapters("steamboat-willie")
Steven Liu's avatar
Steven Liu committed
268

Steven Liu's avatar
Steven Liu committed
269
pipeline.enable_model_cpu_offload()
Steven Liu's avatar
Steven Liu committed
270

Steven Liu's avatar
Steven Liu committed
271
272
273
274
275
276
277
278
# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts 
dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
Steven Liu's avatar
Steven Liu committed
279

Steven Liu's avatar
Steven Liu committed
280
281
282
283
284
285
286
output = pipeline(
  prompt=prompt,
  num_frames=81,
  guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)
```
Steven Liu's avatar
Steven Liu committed
287

Steven Liu's avatar
Steven Liu committed
288
## Reduce memory usage
Steven Liu's avatar
Steven Liu committed
289

290
Recent video models like [`HunyuanVideoPipeline`] and [`WanPipeline`], which have 10B+ parameters, require a lot of memory and it often exceeds the memory available on consumer hardware. Diffusers offers several techniques for reducing the memory requirements of these large models.
Steven Liu's avatar
Steven Liu committed
291

Steven Liu's avatar
Steven Liu committed
292
293
> [!TIP]
> Refer to the [Reduce memory usage](../optimization/memory) guide for more details about other memory saving techniques.
Steven Liu's avatar
Steven Liu committed
294

Steven Liu's avatar
Steven Liu committed
295
One of these techniques is [group-offloading](../optimization/memory#group-offloading), which offloads groups of internal model layers (such as `torch.nn.Sequential`) to the CPU when it isn't being used. These layers are only loaded when they're needed for computation to avoid storing **all** the model components on the GPU. For a 14B parameter model like [`WanPipeline`], group-offloading can lower the required memory to ~13GB of VRAM.
Steven Liu's avatar
Steven Liu committed
296
297

```py
Steven Liu's avatar
Steven Liu committed
298
# pip install ftfy
Steven Liu's avatar
Steven Liu committed
299
import torch
Steven Liu's avatar
Steven Liu committed
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
import numpy as np
from diffusers import AutoModel, WanPipeline
from diffusers.hooks.group_offloading import apply_group_offloading
from diffusers.utils import export_to_video, load_image
from transformers import UMT5EncoderModel

text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16)
vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)

# group-offloading
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
apply_group_offloading(text_encoder,
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="block_level",
    num_blocks_per_group=4
)
transformer.enable_group_offload(
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="leaf_level",
    use_stream=True
)
Steven Liu's avatar
Steven Liu committed
325

Steven Liu's avatar
Steven Liu committed
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    vae=vae,
    transformer=transformer,
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16
)
pipeline.to("cuda")

prompt = """
The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic 
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, 
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, 
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""
Steven Liu's avatar
Steven Liu committed
347

Steven Liu's avatar
Steven Liu committed
348
output = pipeline(
Steven Liu's avatar
Steven Liu committed
349
350
    prompt=prompt,
    negative_prompt=negative_prompt,
Steven Liu's avatar
Steven Liu committed
351
352
    num_frames=81,
    guidance_scale=5.0,
Steven Liu's avatar
Steven Liu committed
353
).frames[0]
Steven Liu's avatar
Steven Liu committed
354
export_to_video(output, "output.mp4", fps=16)
Steven Liu's avatar
Steven Liu committed
355
356
```

Steven Liu's avatar
Steven Liu committed
357
Another option for reducing memory is to consider quantizing a model, which stores the model weights in a lower precision data type. However, quantization may impact video quality depending on the specific video model. Refer to the quantization [Overivew](../quantization/overview) to learn more about the different supported quantization backends.
Steven Liu's avatar
Steven Liu committed
358

Steven Liu's avatar
Steven Liu committed
359
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize a model.
Steven Liu's avatar
Steven Liu committed
360
361

```py
Steven Liu's avatar
Steven Liu committed
362
363
# pip install ftfy

Steven Liu's avatar
Steven Liu committed
364
import torch
Steven Liu's avatar
Steven Liu committed
365
366
367
368
369
370
from diffusers import WanPipeline
from diffusers import AutoModel, WanPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from transformers import UMT5EncoderModel
from diffusers.utils import export_to_video
Steven Liu's avatar
Steven Liu committed
371

Steven Liu's avatar
Steven Liu committed
372
373
374
375
376
# quantize transformer and text encoder weights with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
  quant_backend="bitsandbytes_4bit",
  quant_kwargs={"load_in_4bit": True},
  components_to_quantize=["transformer", "text_encoder"]
Steven Liu's avatar
Steven Liu committed
377
378
)

Steven Liu's avatar
Steven Liu committed
379
380
381
382
383
384
385
386
387
388
vae = AutoModel.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32
)
pipeline = WanPipeline.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", vae=vae, quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16
)
pipeline.scheduler = UniPCMultistepScheduler.from_config(
  pipeline.scheduler.config, flow_shift=5.0
)
pipeline.to("cuda")
Steven Liu's avatar
Steven Liu committed
389

Steven Liu's avatar
Steven Liu committed
390
391
pipeline.load_lora_weights("benjamin-paine/steamboat-willie-14b", adapter_name="steamboat-willie")
pipeline.set_adapters("steamboat-willie")
Steven Liu's avatar
Steven Liu committed
392

Steven Liu's avatar
Steven Liu committed
393
pipeline.enable_model_cpu_offload()
Steven Liu's avatar
Steven Liu committed
394

Steven Liu's avatar
Steven Liu committed
395
396
397
398
399
400
401
402
# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts 
dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
Steven Liu's avatar
Steven Liu committed
403

Steven Liu's avatar
Steven Liu committed
404
405
406
407
408
409
output = pipeline(
  prompt=prompt,
  num_frames=81,
  guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)
Steven Liu's avatar
Steven Liu committed
410
411
```

Steven Liu's avatar
Steven Liu committed
412
## Inference speed
Steven Liu's avatar
Steven Liu committed
413

Steven Liu's avatar
Steven Liu committed
414
[torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial_.html) can speedup inference by using optimized kernels. Compilation takes longer the first time, but once compiled, it is much faster. It is best to compile the pipeline once, and then use the pipeline multiple times without changing anything. A change, such as in the image size, triggers recompilation.
Steven Liu's avatar
Steven Liu committed
415

Steven Liu's avatar
Steven Liu committed
416
The example below compiles the transformer in the pipeline and uses the `"max-autotune"` mode to maximize performance.
Steven Liu's avatar
Steven Liu committed
417
418
419

```py
import torch
Steven Liu's avatar
Steven Liu committed
420
421
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
Steven Liu's avatar
Steven Liu committed
422

Steven Liu's avatar
Steven Liu committed
423
424
425
pipeline = CogVideoXPipeline.from_pretrained(
  "THUDM/CogVideoX-2b",
  torch_dtype=torch.float16
Steven Liu's avatar
Steven Liu committed
426
427
).to("cuda")

Steven Liu's avatar
Steven Liu committed
428
429
430
431
432
# torch.compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer = torch.compile(
    pipeline.transformer, mode="max-autotune", fullgraph=True
)
433

Steven Liu's avatar
Steven Liu committed
434
435
436
437
438
439
440
441
442
443
444
445
446
447
prompt = """
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. 
The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. 
Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, 
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
"""

video = pipeline(
  prompt=prompt,
  guidance_scale=6,
  num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)
```