Following the great successful open-sourcing of our [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we proudly present the [HunyuanVideo-I2V](https://github.com/Tencent/HunyuanVideo-I2V), a new image-to-video generation framework to accelerate open-source community exploration!
This repo contains offical PyTorch model definitions, pre-trained weights and inference/sampling code. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com). Meanwhile, we have released the LoRA training code for customizable special effects, which can be used to create more interesting video effects.
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model**](https://arxiv.org/abs/2412.03603) <be>
## 🔥🔥🔥 News!!
* Mar 13, 2025: 🚀 We release the parallel inference code for HunyuanVideo-I2V powered by [xDiT](https://github.com/xdit-project/xDiT).
* Mar 07, 2025: 🔥 We have fixed the bug in our open-source version that caused ID changes. Please try the new model weights of [HunyuanVideo-I2V](https://huggingface.co/tencent/HunyuanVideo-I2V) to ensure full visual consistency in the first frame and produce higher quality videos.
* Mar 06, 2025: 👋 We release the inference code and model weights of HunyuanVideo-I2V. [Download](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
## 📑 Open-source Plan
- HunyuanVideo-I2V (Image-to-Video Model)
- [x] Inference
- [x] Checkpoints
- [x] ComfyUI
- [x] Lora training scripts
- [x] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
-[Training data construction](#training-data-construction)
-[Training](#training)
-[Inference](#inference)
-[🚀 Parallel Inference on Multiple GPUs by xDiT](#-parallel-inference-on-multiple-gpus-by-xdit)
-[Using Command Line](#using-command-line-1)
-[🔗 BibTeX](#-bibtex)
- [Acknowledgements](#acknowledgements)
---
## **HunyuanVideo-I2V Overall Architecture**
Leveraging the advanced video generation capabilities of [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we have extended its application to image-to-video generation tasks. To achieve this, we employ a token replace technique to effectively reconstruct and incorporate reference image information into the video generation process.
Since we utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder, we can significantly enhance the model's ability to comprehend the semantic content of the input image and to seamlessly integrate information from both the image and its associated caption. Specifically, the input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data.
The overall architecture of our system is designed to maximize the synergy between image and text modalities, ensuring a robust and coherent generation of video content from static images. This integration not only improves the fidelity of the generated videos but also enhances the model's ability to interpret and utilize complex multimodal inputs. The overall architecture is as follows.
To download the HunyuanVideo-I2V model, first install the huggingface-cli. (Detailed instructions are available [here](https://huggingface.co/docs/huggingface_hub/guides/cli).)
### Installation Guide for Linux
We recommend CUDA versions 12.4 or 11.8 for the manual installation.
Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
```shell
python -m pip install"huggingface_hub[cli]"
# 1. Create conda environment
conda create -n HunyuanVideo-I2V python==3.11.9
# 2. Activate the environment
conda activate HunyuanVideo-I2V
# 3. Install PyTorch and other dependencies using conda
The details of download pretrained models are shown [here](ckpts/README.md).
## 🔑 Single-gpu Inference
Similar to [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), HunyuanVideo-I2V supports high-resolution video generation, with resolution up to 720P and video length up to 129 frames (5 seconds).
### Tips for Using Image-to-Video Models
-**Use Concise Prompts**: To effectively guide the model's generation, keep your prompts short and to the point.
-**Include Key Elements**: A well-structured prompt should cover:
-**Main Subject**: Specify the primary focus of the video.
-**Action**: Describe the main movement or activity taking place.
-**Background (Optional)**: Set the scene for the video.
-**Camera Angle (Optional)**: Indicate the perspective or viewpoint.
-**Avoid Overly Detailed Prompts**: Lengthy or highly detailed prompts can lead to unnecessary transitions in the video output.
<!-- **For image-to-video models, we recommend using concise prompts to guide the model's generation process. A good prompt should include elements such as background, main subject, action, and camera angle. Overly long or excessively detailed prompts may introduce unnecessary transitions.** -->
| `--prompt` | None | The text prompt for video generation. |
| `--model` | HYVideo-T/2-cfgdistill | Here we use HYVideo-T/2 for I2V, HYVideo-T/2-cfgdistill is used for T2V mode. |
| `--i2v-mode` | False | Whether to open i2v mode. |
| `--i2v-image-path` | ./assets/demo/i2v/imgs/0.jpg | The reference image for video generation. |
| `--i2v-resolution` | 720p | The resolution for the generated video. |
| `--i2v-stability` | False | Whether to use stable mode for i2v inference. |
| `--video-length` | 129 | The length of the generated video. |
| `--infer-steps` | 50 | The number of steps for sampling. |
| `--flow-shift` | 7.0 | Shift factor for flow matching schedulers. We recommend 7 with `--i2v-stability` switch on for more stable video, 17 with `--i2v-stability` switch off for more dynamic video |
| `--flow-reverse` | False | If reverse, learning/sampling from t=1 -> t=0. |
| `--seed` | None | The random seed for generating video, if None, we init a random seed. |
| `--use-cpu-offload` | False | Use CPU offload for the model load to save more memory, necessary for high-res video generation. |
| `--save-path` | ./results | Path to save the generated video. |
</details>
---
## Download Text Encoder
## 🎉 Customizable I2V LoRA effects training
HunyuanVideo-I2V uses an MLLM model and a CLIP model as text encoder.
### Requirements
1. MLLM model (text_encoder_i2v folder)
The following table shows the requirements for training HunyuanVideo-I2V lora model (batch size = 1) to generate videos:
HunyuanVideo-I2V supports different MLLMs (including HunyuanMLLM and open-source MLLM models). At this stage, we have not yet released HunyuanMLLM. We recommend the user in community to use [llava-llama-3-8b](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) provided by [Xtuer](https://huggingface.co/xtuner), which can be downloaded by the following command.
Note that unlike [HunyuanVideo](https://github.com/Tencent/HunyuanVideo/tree/main), which only uses the language model parts of `llava-llama-3-8b-v1_1-transformers`, HunyuanVideo-I2V needs its full model to encode both prompts and images. Therefore, you only need to download the model without preprocessing.
* An NVIDIA GPU with CUDA support is required.
* The model is tested on a single 80G GPU.
***Minimum**: The minimum GPU memory required is 79GB for 360p.
***Recommended**: We recommend using a GPU with 80GB of memory for better generation quality.
* Tested operating system: Linux
* Note: You can train with 360p data and directly infer 720p videos
| `DATA_JSONS_DIR` | ./assets/demo/i2v_lora/train_dataset/processed_data/json_path | Data jsons dir generated by hyvideo/hyvae_extract/start.sh. |
| `CHIEF_IP` | 127.0.0.1 | Master node IP of the machine. |
After training, you can find `pytorch_lora_kohaya_weights.safetensors` in `{SAVE_BASE}/log_EXP/*_{EXP_NAME}/checkpoints/global_step{*}/pytorch_lora_kohaya_weights.safetensors` and set it in `--lora-path` to perform inference.
### Inference
```bash
cd HunyuanVideo-I2V
python3 sample_image2video.py \
--model HYVideo-T/2 \
--prompt"Two people hugged tightly, In the video, two people are standing apart from each other. They then move closer to each other and begin to hug tightly. The hug is very affectionate, with the two people holding each other tightly and looking into each other's eyes. The interaction is very emotional and heartwarming, with the two people expressing their love and affection for each other."\
We use [CLIP](https://huggingface.co/openai/clip-vit-large-patch14) provided by [OpenAI](https://openai.com) as another text encoder, users in the community can download this model by the following command
[xDiT](https://github.com/xdit-project/xDiT) is a Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters.
It has successfully provided low-latency parallel inference solutions for a variety of DiTs models, including mochi-1, CogVideoX, Flux.1, SD3, etc. This repo adopted the [Unified Sequence Parallelism (USP)](https://arxiv.org/abs/2405.07719) APIs for parallel inference of the HunyuanVideo-I2V model.
### Using Command Line
For example, to generate a video with 8 GPUs, you can use the following command:
<thcolspan="4">Latency (Sec) for 1280x720 (129 frames 50 steps) on 8xGPU</th>
</tr>
<tr>
<th>1</th>
<th>2</th>
<th>4</th>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<th>1904.08</th>
<th>934.09 (2.04x)</th>
<th>514.08 (3.70x)</th>
<th>337.58 (5.64x)</th>
</tr>
</tbody>
</table>
</p>
## 🔗 BibTeX
If you find [HunyuanVideo](https://arxiv.org/abs/2412.03603) useful for your research and applications, please cite using this BibTeX:
```BibTeX
@misc{kong2024hunyuanvideo,
title={HunyuanVideo: A Systematic Framework For Large Video Generative Models},
author={Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Junkun Yuan, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yanxin Long, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, and Jie Jiang, along with Caesar Zhong},
year={2024},
archivePrefix={arXiv preprint arXiv:2412.03603},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.03603},
}
```
## Acknowledgements
We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.