# HunyuanVideo-I2V

## 论文

`HunyuanVideo: A Systematic Framework For Large Video Generative Models`

* https://arxiv.org/abs/2412.03603

## 模型结构

模型采用主流`Latent Diffusion`架构，包含用于数据压缩/恢复的`3D VAE`，去噪模块`DiT`，文本使用`CLIP`以及多模态语言模型(MLLM)编码器处理。

![alt text](readme_imgs/arch.png)

## 算法原理

采用Flow matching算法训练模型。

![alt text](readme_imgs/alg.png)

## 环境配置

### Docker（方法一）
    
    docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10

    docker run --shm-size 100g --network=host --name=hunyuanvideo_i2v --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash

    pip install -r requirements.txt

    pip install distvae yunchang==0.6.0

    pip install "xfuser==0.4.2" --no-deps torch

    bash modified/fix.sh

### Dockerfile（方法二）

    docker build -t <IMAGE_NAME>:<TAG> .

    docker run --shm-size 100g --network=host --name=hunyuanvideo_i2v --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
    
    pip install -r requirements.txt

    pip install distvae yunchang==0.6.0

    pip install "xfuser==0.4.2" --no-deps torch

    bash modified/fix.sh

### Anaconda（方法三）

1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装： https://developer.sourcefind.cn/tool/

```
DTK驱动:dtk24.04.3
python:python3.10
torch:2.3.0
torchvision:0.18.1
torchaudio:2.1.2
triton:2.1.0
flash-attn:2.6.1
deepspeed:0.14.2
apex:1.3.0
xformers:0.0.25
transformers:4.48.0
```

2、其他非特殊库直接按照requirements.txt安装

```
pip install -r requirements.txt

pip install distvae yunchang==0.6.0

pip install "xfuser==0.4.2" --no-deps torch

bash modified/fix.sh

# 需要参考 modified/fix.sh中的命令修改相应位置的代码
```

## 数据集

无

## 训练

无

## 推理

### 多卡

```bash
torchrun --nproc_per_node=5 sample_image2video.py \
--model HYVideo-T/2 \
--prompt "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick." \
--i2v-mode \
--i2v-image-path ./assets/demo/i2v/imgs/0.jpg \
--i2v-resolution 720p \
--i2v-stability \
--infer-steps 50 \
--video-length 129 \
--flow-reverse \
--flow-shift 7.0 \
--seed 0 \
--embedded-cfg-scale 6.0 \
--save-path ./results \
--ulysses-degree 1 \
--ring-degree 5 \
--video-size 720 720 \
--xdit-adaptive-size
```

参数说明

|     --video-size     | --video-length | --ulysses-degree x --ring-degree | --nproc_per_node |
|----------------------|----------------|----------------------------------|------------------|
| 1280 720 or 720 1280 | 129            | 8x1,4x2,2x4,1x8                  | 8                |
| 1280 720 or 720 1280 | 129            | 1x5                              | 5                |
| 1280 720 or 720 1280 | 129            | 4x1,2x2,1x4                      | 4                |
| 1280 720 or 720 1280 | 129            | 3x1,1x3                          | 3                |
| 1280 720 or 720 1280 | 129            | 2x1,1x2                          | 2                |
| 1104 832 or 832 1104 | 129            | 4x1,2x2,1x4                      | 4                |
| 1104 832 or 832 1104 | 129            | 3x1,1x3                          | 3                |
| 1104 832 or 832 1104 | 129            | 2x1,1x2                          | 2                |
| 960 960              | 129            | 6x1,3x2,2x3,1x6                  | 6                |
| 960 960              | 129            | 4x1,2x2,1x4                      | 4                |
| 960 960              | 129            | 3x1,1x3                          | 3                |
| 960 960              | 129            | 1x2,2x1                          | 2                |
| 960 544 or 544 960   | 129            | 6x1,3x2,2x3,1x6                  | 6                |
| 960 544 or 544 960   | 129            | 4x1,2x2,1x4                      | 4                |
| 960 544 or 544 960   | 129            | 3x1,1x3                          | 3                |
| 960 544 or 544 960   | 129            | 1x2,2x1                          | 2                |
| 832 624 or 624 832   | 129            | 4x1,2x2,1x4                      | 4                |
| 624 832 or 624 832   | 129            | 3x1,1x3                          | 3                |
| 832 624 or 624 832   | 129            | 2x1,1x2                          | 2                |
| 720 720              | 129            | 1x5                              | 5                |
| 720 720              | 129            | 3x1,1x3                          | 3                |

### 单卡（不推荐）
```bash
python3 sample_image2video.py \
    --model HYVideo-T/2 \
    --prompt "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick." \
    --i2v-mode \
    --i2v-image-path ./assets/demo/i2v/imgs/0.jpg \
    --i2v-resolution 720p \
    --i2v-stability \
    --infer-steps 50 \
    --video-length 129 \
    --flow-reverse \
    --flow-shift 7.0 \
    --seed 0 \
    --embedded-cfg-scale 6.0 \
    --use-cpu-offload \
    --save-path ./results
```

参数说明：


|        参数        |            默认            |                                                                 描述                                                                 |
|:----------------------:|:-----------------------------:|:----------------------------------------------------------------------------------------------------------------------------------:|
|       `--prompt`       |             None              |                                                            用于视频生成的文本提示。                                                            |
|       `--model`        |      HYVideo-T/2-cfgdistill   |                                    这里我们使用 HYVideo-T/2 用于 I2V，HYVideo-T/2-cfgdistill 用于 T2V 模式。                                     |
|     `--i2v-mode`       |            False              |                                                            是否开启 I2V 模式。                                                            |
|  `--i2v-image-path`    | ./assets/demo/i2v/imgs/0.png  |                                                            用于视频生成的参考图像。                                                            |
|  `--i2v-resolution`    |            720p               |                                                             生成视频的分辨率。                                                              |
|  `--i2v-stability`    |            False             |                                                         是否使用稳定模式进行 i2v 推理。                                                         |
|    `--video-length`    |             129               |                                                              生成视频的长度。                                                              |
|    `--infer-steps`     |              50               |                                                              采样步骤的数量。                                                              |
|     `--flow-shift`     |             7.0               |                        流匹配调度器的偏移因子。我们建议将`--i2v-stability`设置为 7，以获得更稳定的视频；将`--i2v-stability`设置为 17，以获得更动态的视频                         |
|   `--flow-reverse`     |            False              |                                                       如果反转，从 t=1 学习/采样到 t=0。                                                       |
|        `--seed`        |             None              |                                                   生成视频的随机种子，如果为 None，则初始化一个随机种子。                                                   |
|  `--use-cpu-offload`   |            False              |                                                使用 CPU 卸载模型加载以节省更多内存，对于高分辨率视频生成是必要的。                                                |
|     `--save-path`      |         ./results             |                                                             保存生成视频的路径。                                                             |

## result

![](readme_imgs/hy_i2v.gif)

### 精度

无

## 应用场景

### 算法类别

`视频生成`

### 热点应用行业

`电商,教育,广媒`

## 预训练权重

|model|save_path|链接|
|:---:|:---:|:---:|
|HunyuanVideo-I2V|ckpts/hunyuan-video-i2v-720p/|[hf](https://hf-mirror.com/tencent/HunyuanVideo-I2V) \| [SCNet]|
|llava-llama-3-8b-v1_1-transformers|ckpts/text_encoder_i2v/|[hf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) \| [SCNet] |
|clip-vit-large-patch14 |ckpts/text_encoder_2| [hf](https://huggingface.co/openai/clip-vit-large-patch14) \| [SCNet] |


## 源码仓库及问题反馈

* https://developer.sourcefind.cn/codes/modelzoo/hunyuanvideo-i2v_pytorch

## 参考资料

* https://github.com/Tencent/HunyuanVideo-I2V