svd.md 5.79 KB
Newer Older
1
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Suraj Patil's avatar
Suraj Patil committed
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Stable Video Diffusion

[[open-in-colab]]

17
[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127) is a powerful image-to-video generation model that can generate 2-4 second high resolution (576x1024) videos conditioned on an input image.
Suraj Patil's avatar
Suraj Patil committed
18

19
This guide will show you how to use SVD to generate short videos from images.
Suraj Patil's avatar
Suraj Patil committed
20
21
22
23

Before you begin, make sure you have the following libraries installed:

```py
24
# Colab에서 필요한 라이브러리를 설치하기 위해 주석을 제외하세요
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
25
!pip install -q -U diffusers transformers accelerate
Suraj Patil's avatar
Suraj Patil committed
26
27
```

28
The are two variants of this model, [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt). The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.
Suraj Patil's avatar
Suraj Patil committed
29

30
You'll use the SVD-XT checkpoint for this guide.
Suraj Patil's avatar
Suraj Patil committed
31
32
33
34
35
36
37
38
39
40
41
42
43

```python
import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Load the conditioning image
44
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
Suraj Patil's avatar
Suraj Patil committed
45
46
47
48
49
50
51
52
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)
```

53
54
55
56
57
58
59
60
61
62
<div class="flex gap-4">
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/>
    <figcaption class="mt-2 text-center text-sm text-gray-500">"source image of a rocket"</figcaption>
  </div>
  <div>
    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/>
    <figcaption class="mt-2 text-center text-sm text-gray-500">"generated video from source image"</figcaption>
  </div>
</div>
Suraj Patil's avatar
Suraj Patil committed
63

64
## torch.compile
Suraj Patil's avatar
Suraj Patil committed
65

66
You can gain a 20-25% speedup at the expense of slightly increased memory by [compiling](../optimization/torch2.0#torchcompile) the UNet.
Suraj Patil's avatar
Suraj Patil committed
67
68
69
70
71
72
73

```diff
- pipe.enable_model_cpu_offload()
+ pipe.to("cuda")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
```

74
## Reduce memory usage
Suraj Patil's avatar
Suraj Patil committed
75

76
Video generation is very memory intensive because you're essentially generating `num_frames` all at once, similar to text-to-image generation with a high batch size. To reduce the memory requirement, there are multiple options that trade-off inference speed for lower memory requirement:
Suraj Patil's avatar
Suraj Patil committed
77

78
79
80
- enable model offloading: each component of the pipeline is offloaded to the CPU once it's not needed anymore.
- enable feed-forward chunking: the feed-forward layer runs in a loop instead of running a single feed-forward with a huge batch size.
- reduce `decode_chunk_size`: the VAE decodes frames in chunks instead of decoding them all together. Setting `decode_chunk_size=1` decodes one frame at a time and uses the least amount of memory (we recommend adjusting this value based on your GPU memory) but the video might have some flickering.
Suraj Patil's avatar
Suraj Patil committed
81
82

```diff
83
84
85
86
87
- pipe.enable_model_cpu_offload()
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
+ pipe.enable_model_cpu_offload()
+ pipe.unet.enable_forward_chunking()
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]
Suraj Patil's avatar
Suraj Patil committed
88
89
```

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
90
Using all these tricks together should lower the memory requirement to less than 8GB VRAM.
Suraj Patil's avatar
Suraj Patil committed
91

92
## Micro-conditioning
Suraj Patil's avatar
Suraj Patil committed
93

94
Stable Diffusion Video also accepts micro-conditioning, in addition to the conditioning image, which allows more control over the generated video:
Suraj Patil's avatar
Suraj Patil committed
95

96
97
98
- `fps`: the frames per second of the generated video.
- `motion_bucket_id`: the motion bucket id to use for the generated video. This can be used to control the motion of the generated video. Increasing the motion bucket id increases the motion of the generated video.
- `noise_aug_strength`: the amount of noise added to the conditioning image. The higher the values the less the video resembles the conditioning image. Increasing this value also increases the motion of the generated video.
Suraj Patil's avatar
Suraj Patil committed
99

100
For example, to generate a video with more motion, use the `motion_bucket_id` and `noise_aug_strength` micro-conditioning parameters:
Suraj Patil's avatar
Suraj Patil committed
101

Suraj Patil's avatar
Suraj Patil committed
102
103
104
105
106
107
108
```python
import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
Suraj Patil's avatar
Suraj Patil committed
109
  "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
Suraj Patil's avatar
Suraj Patil committed
110
111
112
113
)
pipe.enable_model_cpu_offload()

# Load the conditioning image
114
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
Suraj Patil's avatar
Suraj Patil committed
115
116
117
118
119
120
121
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
```

122
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket_with_conditions.gif)