hunyuan_video.md 6.85 KB
Newer Older
Aryan's avatar
Aryan committed
1
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
Aryan's avatar
Aryan committed
2
3
4
5
6
7
8
9
10
11
12
13
14
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->

Steven Liu's avatar
Steven Liu committed
15
16
17
18
19
20
<div style="float: right;">
  <div class="flex flex-wrap space-x-1">
    <a href="https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference" target="_blank" rel="noopener">
      <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
    </a>
  </div>
Steven Liu's avatar
Steven Liu committed
21
22
</div>

Steven Liu's avatar
Steven Liu committed
23
# HunyuanVideo
Aryan's avatar
Aryan committed
24

Steven Liu's avatar
Steven Liu committed
25
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
Aryan's avatar
Aryan committed
26

Steven Liu's avatar
Steven Liu committed
27
You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
Aryan's avatar
Aryan committed
28

Steven Liu's avatar
Steven Liu committed
29
30
31
32
> [!TIP]
> Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.
>
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
Aryan's avatar
Aryan committed
33

Steven Liu's avatar
Steven Liu committed
34
The example below demonstrates how to generate a video optimized for memory or inference speed.
Aryan's avatar
Aryan committed
35

Steven Liu's avatar
Steven Liu committed
36
37
<hfoptions id="usage">
<hfoption id="memory">
Aryan's avatar
Aryan committed
38

Steven Liu's avatar
Steven Liu committed
39
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
Aryan's avatar
Aryan committed
40

Steven Liu's avatar
Steven Liu committed
41
The quantized HunyuanVideo model below requires ~14GB of VRAM.
Aryan's avatar
Aryan committed
42

Steven Liu's avatar
Steven Liu committed
43
44
45
46
47
```py
import torch
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video
Aryan's avatar
Aryan committed
48

Steven Liu's avatar
Steven Liu committed
49
50
51
52
53
54
55
56
# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
      "load_in_4bit": True,
      "bnb_4bit_quant_type": "nf4",
      "bnb_4bit_compute_dtype": torch.bfloat16
      },
57
    components_to_quantize="transformer"
Steven Liu's avatar
Steven Liu committed
58
)
Aryan's avatar
Aryan committed
59

Steven Liu's avatar
Steven Liu committed
60
61
62
63
64
pipeline = HunyuanVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
)
Aryan's avatar
Aryan committed
65

Steven Liu's avatar
Steven Liu committed
66
67
68
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()
Steven Liu's avatar
Steven Liu committed
69

Steven Liu's avatar
Steven Liu committed
70
71
72
73
74
75
76
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)
```

</hfoption>
<hfoption id="inference speed">
Steven Liu's avatar
Steven Liu committed
77

Steven Liu's avatar
Steven Liu committed
78
[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster.
Steven Liu's avatar
Steven Liu committed
79
80
81

```py
import torch
Steven Liu's avatar
Steven Liu committed
82
83
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
Steven Liu's avatar
Steven Liu committed
84
85
from diffusers.utils import export_to_video

Steven Liu's avatar
Steven Liu committed
86
87
88
89
90
91
92
93
# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
      "load_in_4bit": True,
      "bnb_4bit_quant_type": "nf4",
      "bnb_4bit_compute_dtype": torch.bfloat16
      },
94
    components_to_quantize="transformer"
Steven Liu's avatar
Steven Liu committed
95
96
97
)

pipeline = HunyuanVideoPipeline.from_pretrained(
98
    "hunyuanvideo-community/HunyuanVideo",
Steven Liu's avatar
Steven Liu committed
99
100
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
Steven Liu's avatar
Steven Liu committed
101
102
)

Steven Liu's avatar
Steven Liu committed
103
104
105
106
107
108
109
110
111
112
113
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()

# torch.compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer = torch.compile(
    pipeline.transformer, mode="max-autotune", fullgraph=True
)

prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
Steven Liu's avatar
Steven Liu committed
114
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
Steven Liu's avatar
Steven Liu committed
115
export_to_video(video, "output.mp4", fps=15)
Steven Liu's avatar
Steven Liu committed
116
117
```

Steven Liu's avatar
Steven Liu committed
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
</hfoption>
</hfoptions>

## Notes

- HunyuanVideo supports LoRAs with [`~loaders.HunyuanVideoLoraLoaderMixin.load_lora_weights`].

  <details>
  <summary>Show example code</summary>

  ```py
  import torch
  from diffusers import AutoModel, HunyuanVideoPipeline
  from diffusers.quantizers import PipelineQuantizationConfig
  from diffusers.utils import export_to_video

  # quantize weights to int4 with bitsandbytes
  pipeline_quant_config = PipelineQuantizationConfig(
      quant_backend="bitsandbytes_4bit",
      quant_kwargs={
        "load_in_4bit": True,
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_compute_dtype": torch.bfloat16
        },
142
      components_to_quantize="transformer"
Steven Liu's avatar
Steven Liu committed
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
  )

  pipeline = HunyuanVideoPipeline.from_pretrained(
      "hunyuanvideo-community/HunyuanVideo",
      quantization_config=pipeline_quant_config,
      torch_dtype=torch.bfloat16,
  )

  # load LoRA weights
  pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie")
  pipeline.set_adapters("steamboat-willie", 0.9)

  # model-offloading and tiling
  pipeline.enable_model_cpu_offload()
  pipeline.vae.enable_tiling()

  # use "In the style of SWR" to trigger the LoRA
  prompt = """
  In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys.
  """
  video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
  export_to_video(video, "output.mp4", fps=15)
  ```

  </details>

- Refer to the table below for recommended inference values.

  | parameter | recommended value |
  |---|---|
  | text encoder dtype | `torch.float16` |
  | transformer dtype | `torch.bfloat16` |
  | vae dtype | `torch.float16` |
  | `num_frames (k)` | 4 * `k` + 1 |

- Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images.

Aryan's avatar
Aryan committed
180
181
182
183
184
185
186
187
188
## HunyuanVideoPipeline

[[autodoc]] HunyuanVideoPipeline
  - all
  - __call__

## HunyuanVideoPipelineOutput

[[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput