text_to_video.md 7.38 KB
Newer Older
1
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
14
<Tip warning={true}>

15
🧪 This pipeline is for research purposes only.
16
17
18

</Tip>

19
# Text-to-video
20

21
[ModelScope Text-to-Video Technical Report](https://arxiv.org/abs/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang.
22

23
The abstract from the paper is:
24

25
*This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.*
26

27
You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense).
28

29
## Usage example
30

Patrick von Platen's avatar
Patrick von Platen committed
31
32
### `text-to-video-ms-1.7b`

33
34
Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):

35
```python
36
37
38
39
40
41
42
43
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe = pipe.to("cuda")

prompt = "Spiderman is surfing"
44
video_frames = pipe(prompt).frames[0]
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
video_path = export_to_video(video_frames)
video_path
```

Diffusers supports different optimization techniques to improve the latency
and memory footprint of a pipeline. Since videos are often more memory-heavy than images,
we can enable CPU offloading and VAE slicing to keep the memory footprint at bay.

Let's generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing:

```python
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.enable_model_cpu_offload()

# memory optimization
pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave"
67
video_frames = pipe(prompt, num_frames=64).frames[0]
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
video_path = export_to_video(video_frames)
video_path
```

It just takes **7 GBs of GPU memory** to generate the 64 video frames using PyTorch 2.0, "fp16" precision and the techniques mentioned above.

We can also use a different scheduler easily, using the same method we'd use for Stable Diffusion:

```python
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
86
video_frames = pipe(prompt, num_inference_steps=25).frames[0]
87
88
89
90
video_path = export_to_video(video_frames)
video_path
```

91
Here are some sample outputs:
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

<table>
    <tr>
        <td><center>
        An astronaut riding a horse.
        <br>
        <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astr.gif"
            alt="An astronaut riding a horse."
            style="width: 300px;" />
        </center></td>
        <td ><center>
        Darth vader surfing in waves.
        <br>
        <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vader.gif"
            alt="Darth vader surfing in waves."
            style="width: 300px;" />
        </center></td>
    </tr>
</table>

Patrick von Platen's avatar
Patrick von Platen committed
112
113
114
115
116
117
118
119
120
### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL`

Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`.
One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`],
which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL).


```py
import torch
121
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
Patrick von Platen's avatar
Patrick von Platen committed
122
from diffusers.utils import export_to_video
123
from PIL import Image
Patrick von Platen's avatar
Patrick von Platen committed
124
125
126
127
128

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

# memory optimization
129
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
Patrick von Platen's avatar
Patrick von Platen committed
130
131
132
pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave"
133
video_frames = pipe(prompt, num_frames=24).frames[0]
Patrick von Platen's avatar
Patrick von Platen committed
134
135
136
137
138
139
140
141
142
143
144
video_path = export_to_video(video_frames)
video_path
```

Now the video can be upscaled:

```py
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

145
146
147
148
# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()

Patrick von Platen's avatar
Patrick von Platen committed
149
150
video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]

151
video_frames = pipe(prompt, video=video, strength=0.6).frames[0]
Patrick von Platen's avatar
Patrick von Platen committed
152
153
154
155
video_path = export_to_video(video_frames)
video_path
```

156
Here are some sample outputs:
Patrick von Platen's avatar
Patrick von Platen committed
157
158
159
160
161
162
163
164
165
166
167
168
169

<table>
    <tr>
        <td ><center>
        Darth vader surfing in waves.
        <br>
        <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/darthvader_cerpense.gif"
            alt="Darth vader surfing in waves."
            style="width: 576px;" />
        </center></td>
    </tr>
</table>

170
171
172
173
174
175
<Tip>

Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.

</Tip>

176
177
## TextToVideoSDPipeline
[[autodoc]] TextToVideoSDPipeline
178
179
	- all
	- __call__
Patrick von Platen's avatar
Patrick von Platen committed
180
181
182
183
184

## VideoToVideoSDPipeline
[[autodoc]] VideoToVideoSDPipeline
	- all
	- __call__
185
186

## TextToVideoSDPipelineOutput
187
[[autodoc]] pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput