Commit 1ad55bb4 authored by mashun1's avatar mashun1
Browse files

i2vgen-xl

parents
Pipeline #819 canceled with stages
*.pkl
*.pt
*.mov
*.pth
*.mov
*.npz
*.npy
*.boj
*.onnx
*.tar
*.bin
cache*
.DS_Store
*DS_Store
outputs/
workspace/experiments/
nohup*.txt
models/
i2vgen-xl
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
\ No newline at end of file
# i2vgen-xl
## 论文
**I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models**
* https://arxiv.org/abs/2311.04145
## 模型结构
该模型为两阶段的视频生成模型,其主要结构都为`3D-Unet`,其中第一阶段模型为低质量视频生成模型,其中包括提取图像高阶信息(如语义特征)的`CLIP`,图片压缩用到的`D.Enc.``VQGAN`中的`Encoder`)以及提取低阶特征(如细节特征)的`G.Enc.`;第二阶段模型用于生成高质量视频,以文本作为条件,第一阶段的输出进行Resize后作为LDM的输入并执行加噪去噪过程,最终得到高清视频。
![Alt text](readme_imgs/image-1.png)
## 算法原理
该算法使用了级联的方式进行视频生成,将其分为了两个过程,一个用于保证视频语义的连贯性,一个用于增强视频的细节并提高分辨率。
![alt text](readme_imgs/image-2.png)
## 环境配置
### Docker(方法一)
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
docker run --shm-size 10g --network=host --name=vgen --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
pip install -r requirements.txt
pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl (whl.zip文件中)
pip install triton-2.1.0%2Bgit34f8189.abi0.dtk2310-cp38-cp38-manylinux2014_x86_64.whl (开发者社区下载)
cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh (whl.zip文件中)
# 以下按需安装
yum install epel-release -y
yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm -y
yum install ffmpeg ffmpeg-devel libsm6 libxext6 -y
### Docker(方法二)
# 需要在对应的目录下
docker build -t <IMAGE_NAME>:<TAG> .
docker run --shm-size 10g --network=host --name=vgen --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
pip install -r requirements.txt
pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl (whl.zip文件中)
pip install triton-2.1.0%2Bgit34f8189.abi0.dtk2310-cp38-cp38-manylinux2014_x86_64.whl (开发者社区下载)
cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh (whl.zip文件中)
# 以下按需安装
yum install epel-release -y
yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm -y
yum install ffmpeg ffmpeg-devel libsm6 libxext6 -y
### Anaconda (方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
https://developer.hpccube.com/tool/
DTK驱动:dtk23.10.1
python:python3.8
torch:2.1.0
torchvision:0.16.0
triton:2.1.0
Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应
2、其它非特殊库参照requirements.txt安装
pip install -r requirements.txt
pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl (whl.zip文件中)
cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh (whl.zip文件中)
# 按需安装
conda install -c conda-forge ffmpeg
## 数据集
作者未公开训练数据集,常用的数据集目前无法下载。
## 推理
### 模型下载
https://huggingface.co/ali-vilab/i2vgen-xl/tree/main
i2vgen-xl/
├── i2vgen_xl_00854500.pth
├── open_clip_pytorch_model.bin
├── stable_diffusion_image_key_temporal_attention_x1.json
└── v2-1_512-ema-pruned.ckpt
### 命令行
python inference.py --cfg configs/i2vgen_xl_infer.yaml
python inference.py --cfg configs/i2vgen_xl_infer.yaml test_list_path data/test_list_for_i2vgen.txt test_model i2vgen-xl/i2vgen_xl_00854500.pth
test_list_path 表示输入图像路径及其对应的标题请参考演示文件 data/test_list_for_i2vgen.txt 中的特定格式和建议。test_model 是加载模型的路径。
### gradio页面
python gradio_app.py
注意:第一次执行该命令会下载默认文件,当默认文件下载完毕后需手动注释`~/.cache/modelscope/modelscope_modules/i2vgen-xl/ms_wrapper.py`中的代码
![alt text](readme_imgs/image-3.png)
## result
||输入|输出|
|:---|:---|:---|
|图像|![alt text](readme_imgs/img_0001.jpg)|![alt text](readme_imgs/r.gif)|
|prompt|A green frog floats on the surface of the water on green lotus leaves, with several pink lotus flowers, in a Chinese painting style.||
### 精度
## 应用场景
### 算法类别
`视频生成`
### 热点应用行业
`媒体,科研,教育`
## 源码仓库及问题反馈
* https://developer.hpccube.com/codes/modelzoo/i2vgen-xl_pytorch
## 参考资料
* https://github.com/ali-vilab/VGen
# VGen
![figure1](source/VGen.jpg "figure1")
VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models. This repository includes implementations of the following methods:
- [I2VGen-xl: High-quality image-to-video synthesis via cascaded diffusion models](https://i2vgen-xl.github.io)
- [VideoComposer: Compositional Video Synthesis with Motion Controllability](https://videocomposer.github.io)
- [Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation](https://higen-t2v.github.io)
- [A Recipe for Scaling up Text-to-Video Generation with Text-free Videos](https://tf-t2v.github.io)
- [InstructVideo: Instructing Video Diffusion Models with Human Feedback](https://instructvideo.github.io)
- [DreamVideo: Composing Your Dream Videos with Customized Subject and Motion](https://dreamvideo-t2v.github.io)
- [VideoLCM: Video Latent Consistency Model](https://arxiv.org/abs/2312.09109)
- [Modelscope text-to-video technical report](https://arxiv.org/abs/2308.06571)
VGen can produce high-quality videos from the input text, images, desired motion, desired subjects, and even the feedback signals provided. It also offers a variety of commonly used video generation tools such as visualization, sampling, training, inference, join training using images and videos, acceleration, and more.
<a href='https://i2vgen-xl.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2311.04145'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/damo-vilab/I2VGen-XL) [![Paper page](https://huggingface.co/datasets/huggingface/badges/resolve/main/paper-page-sm-dark.svg)](https://huggingface.co/papers/2311.04145)
[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-a-discussion-sm-dark.svg)](https://huggingface.co/spaces/damo-vilab/I2VGen-XL/discussions) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/XUi0y7dxqEQ) <a href='https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441039979087.mp4'><img src='source/logo.png'></a>
[![Replicate](https://replicate.com/cjwbw/i2vgen-xl/badge)](https://replicate.com/cjwbw/i2vgen-xl/)
## 🔥News!!!
- __[2024.03]__ We release the code and model of HiGen!!
- __[2024.01]__ The gradio demo of I2VGen-XL has been completed in [HuggingFace](https://huggingface.co/spaces/damo-vilab/I2VGen-XL), thanks to our colleague @[Wenmeng Zhou](https://github.com/wenmengzhou) and @[AK](https://twitter.com/_akhaliq) for the support, and welcome to try it out.
- __[2024.01]__ We support running the gradio app locally, thanks to our colleague @[Wenmeng Zhou](https://github.com/wenmengzhou) for the support and @[AK](https://twitter.com/_akhaliq) for the suggestion, and welcome to have a try.
- __[2024.01]__ Thanks @[Chenxi](https://chenxwh.github.io) for supporting the running of i2vgen-xl on [![Replicate](https://replicate.com/cjwbw/i2vgen-xl/badge)](https://replicate.com/cjwbw/i2vgen-xl/). Feel free to give it a try.
- __[2024.01]__ The gradio demo of I2VGen-XL has been completed in [Modelscope](https://modelscope.cn/studios/damo/I2VGen-XL/summary), and welcome to try it out.
- __[2023.12]__ We have open-sourced the code and models for [DreamTalk](https://github.com/ali-vilab/dreamtalk), which can produce high-quality talking head videos across diverse speaking styles using diffusion models.
- __[2023.12]__ We release [TF-T2V](https://tf-t2v.github.io) that can scale up existing video generation techniques using text-free videos, significantly enhancing the performance of both [Modelscope-T2V](https://arxiv.org/abs/2308.06571) and [VideoComposer](https://videocomposer.github.io) at the same time.
- __[2023.12]__ We updated the codebase to support higher versions of xformer (0.0.22), torch2.0+, and removed the dependency on flash_attn.
- __[2023.12]__ We release [InstructVideo](https://instructvideo.github.io/) that can accept human feedback signals to improve VLDM
- __[2023.12]__ We release the diffusion based expressive talking head generation [DreamTalk](https://dreamtalk-project.github.io)
- __[2023.12]__ We release the high-efficiency video generation method [VideoLCM](https://arxiv.org/abs/2312.09109)
- __[2023.12]__ We release the code and model of [I2VGen-XL](https://i2vgen-xl.github.io) and the [ModelScope T2V](https://arxiv.org/abs/2308.06571)
- __[2023.12]__ We release the T2V method [HiGen](https://higen-t2v.github.io) and customizing T2V method [DreamVideo](https://dreamvideo-t2v.github.io).
- __[2023.12]__ We write an [introduction document](doc/introduction.pdf) for VGen and compare I2VGen-XL with SVD.
- __[2023.11]__ We release a high-quality I2VGen-XL model, please refer to the [Webpage](https://i2vgen-xl.github.io)
## TODO
- [x] Release the technical papers and webpage of [I2VGen-XL](doc/i2vgen-xl.md)
- [x] Release the code and pretrained models that can generate 1280x720 videos
- [x] Release the code and models of [DreamTalk](https://github.com/ali-vilab/dreamtalk) that can generate expressive talking head
- [ ] Release the code and pretrained models of [HumanDiff]()
- [ ] Release models optimized specifically for the human body and faces
- [ ] Updated version can fully maintain the ID and capture large and accurate motions simultaneously
- [ ] Release other methods and the corresponding models
## Preparation
The main features of VGen are as follows:
- Expandability, allowing for easy management of your own experiments.
- Completeness, encompassing all common components for video generation.
- Excellent performance, featuring powerful pre-trained models in multiple tasks.
### Installation
```
conda create -n vgen python=3.8
conda activate vgen
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
You also need to ensure that your system has installed the `ffmpeg` command. If it is not installed, you can install it using the following command:
```
sudo apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
```
### Datasets
We have provided a **demo dataset** that includes images and videos, along with their lists in ``data``.
*Please note that the demo images used here are for testing purposes and were not included in the training.*
### Clone the code
```
git clone https://github.com/ali-vilab/VGen.git
cd VGen
```
## Getting Started with VGen
### (1) Train your text-to-video model
Executing the following command to enable distributed training is as easy as that.
```
python train_net.py --cfg configs/t2v_train.yaml
```
In the `t2v_train.yaml` configuration file, you can specify the data, adjust the video-to-image ratio using `frame_lens`, and validate your ideas with different Diffusion settings, and so on.
- Before the training, you can download any of our open-source models for initialization. Our codebase supports custom initialization and `grad_scale` settings, all of which are included in the `Pretrain` item in yaml file.
- During the training, you can view the saved models and intermediate inference results in the `workspace/experiments/t2v_train`directory.
After the training is completed, you can perform inference on the model using the following command.
```
python inference.py --cfg configs/t2v_infer.yaml
```
Then you can find the videos you generated in the `workspace/experiments/test_img_01` directory. For specific configurations such as data, models, seed, etc., please refer to the `t2v_infer.yaml` file.
*If you want to directly load our previously open-sourced [Modelscope T2V model](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main), please refer to [this link](https://github.com/damo-vilab/i2vgen-xl/issues/31).*
<!-- <table>
<center>
<tr>
<td ><center>
<video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441754174077.mp4"></video>
</center></td>
<td ><center>
<video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441138824052.mp4"></video>
</center></td>
</tr>
</center>
</table>
</center> -->
### (2) Run the I2VGen-XL model
(i) Download model and test data:
```
!pip install modelscope
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('damo/I2VGen-XL', cache_dir='models/', revision='v1.0.0')
```
or you can also download it through HuggingFace (https://huggingface.co/damo-vilab/i2vgen-xl):
```
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/damo-vilab/i2vgen-xl
```
(ii) Run the following command:
```
python inference.py --cfg configs/i2vgen_xl_infer.yaml
```
or you can run:
```
python inference.py --cfg configs/i2vgen_xl_infer.yaml test_list_path data/test_list_for_i2vgen.txt test_model models/i2vgen_xl_00854500.pth
```
The `test_list_path` represents the input image path and its corresponding caption. Please refer to the specific format and suggestions within demo file `data/test_list_for_i2vgen.txt`. `test_model` is the path for loading the model. In a few minutes, you can retrieve the high-definition video you wish to create from the `workspace/experiments/test_list_for_i2vgen` directory. At present, we find that the current model performs inadequately on **anime images** and **images with a black background** due to the lack of relevant training data. We are consistently working to optimize it.
(iii) Run the gradio app locally:
```
python gradio_app.py
```
(iv) Run the model on ModelScope and HuggingFace:
- [Modelscope](https://modelscope.cn/studios/damo/I2VGen-XL/summary)
- [HuggingFace](https://huggingface.co/spaces/damo-vilab/I2VGen-XL)
<span style="color:red">Due to the compression of our video quality in GIF format, please click 'HRER' below to view the original video.</span>
<center>
<table>
<center>
<tr>
<td ><center>
<image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01CCEq7K1ZeLpNQqrWu_!!6000000003219-0-tps-1280-720.jpg"></image>
</center></td>
<td ><center>
<!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4"></video> -->
<image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01hIQcvG1spmQMLqBo0_!!6000000005816-1-tps-1280-704.gif"></image>
</center></td>
</tr>
<tr>
<td ><center>
<p>Input Image</p>
</center></td>
<td ><center>
<p>Click <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442125067544.mp4">HERE</a> to view the generated video.</p>
</center></td>
</tr>
<tr>
<td ><center>
<image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01ZXY7UN23K8q4oQ3uG_!!6000000007236-2-tps-1280-720.png"></image>
</center></td>
<td ><center>
<!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4"></video> -->
<image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01iaSiiv1aJZURUEY53_!!6000000003309-1-tps-1280-704.gif"></image>
</center></td>
</tr>
<tr>
<td ><center>
<p>Input Image</p>
</center></td>
<td ><center>
<p>Click <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/441385957074.mp4">HERE</a> to view the generated video.</p>
</center></td>
</tr>
<tr>
<td ><center>
<image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01NHpVGl1oat4H54Hjf_!!6000000005242-2-tps-1280-720.png"></image>
</center></td>
<td ><center>
<!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4"></video> -->
<!-- <image muted="true" height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif"></image>
-->
<image height="260" src="https://img.alicdn.com/imgextra/i4/O1CN01DgLj1T240jfpzKoaQ_!!6000000007329-1-tps-1280-704.gif"></image>
</center></td>
</tr>
<tr>
<td ><center>
<p>Input Image</p>
</center></td>
<td ><center>
<p>Click <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442102706767.mp4">HERE</a> to view the generated video.</p>
</center></td>
</tr>
<tr>
<td ><center>
<image height="260" src="https://img.alicdn.com/imgextra/i1/O1CN01odS61s1WW9tXen21S_!!6000000002795-0-tps-1280-720.jpg"></image>
</center></td>
<td ><center>
<!-- <video muted="true" autoplay="true" loop="true" height="260" src="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4"></video> -->
<image height="260" src="https://img.alicdn.com/imgextra/i3/O1CN01Jyk1HT28JkZtpAtY6_!!6000000007912-1-tps-1280-704.gif"></image>
</center></td>
</tr>
<tr>
<td ><center>
<p>Input Image</p>
</center></td>
<td ><center>
<p>Click <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/442163934688.mp4">HERE</a> to view the generated video.</p>
</center></td>
</tr>
</center>
</table>
</center>
(ii) Run the following command:
```
python inference.py --cfg configs/i2vgen_xl_train.yaml
```
In a few minutes, you can retrieve the high-definition video you wish to create from the `workspace/experiments/test_img_01` directory. At present, we find that the current model performs inadequately on **anime images** and **images with a black background** due to the lack of relevant training data. We are consistently working to optimize it.
### (3) Run the HiGen model
(i) Download model:
```
!pip install modelscope
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('iic/HiGen', cache_dir='models/')
```
Then you might need the following command to move the checkpoints to the "models/" directory:
```
mv ./models/iic/HiGen/* ./models/
```
(ii) Run the following command for text-to-video generation:
```
python inference.py --cfg configs/higen_infer.yaml
```
In a few minutes, you can retrieve the videos you wish to create from the `workspace/experiments/text_list_for_t2v_share` directory.
Then you can execute the following command to perform super-resolution on the generated videos:
```
python inference.py --cfg configs/sr600_infer.yaml
```
Finally, you can retrieve the high-definition video from the `workspace/experiments/text_list_for_t2v_share` directory.
<span style="color:red">Due to the compression of our video quality in GIF format, please click 'HERE' below to view the original video.</span>
<table>
<center>
<tr>
<td ><center>
<image height="260" src="source/duck.png"></image>
</center></td>
<td ><center>
<image height="260" src="source/bat_man.png"></image>
</center></td>
</tr>
<tr>
<td ><center>
<p>Click <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/452227605224.mp4">HERE</a> to view the generated video.</p>
</center></td>
<td ><center>
<p>Click <a href="https://cloud.video.taobao.com/play/u/null/p/1/e/6/t/1/452015792863.mp4">HERE</a> to view the generated video.</p>
</center></td>
</tr>
</center>
</table>
</center>
### (4) Other methods
In preparation!!
## Customize your own approach
Our codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including `ENGINE, MODEL, DATASETS, EMBEDDER, AUTO_ENCODER, VISUAL, DIFFUSION, PRETRAIN`, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time.
## BibTeX
If this repo is useful to you, please cite our corresponding technical paper.
```bibtex
@article{2023videocomposer,
title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
author={Wang, Xiang and Yuan, Hangjie and Zhang, Shiwei and Chen, Dayou and Wang, Jiuniu, and Zhang, Yingya, and Shen, Yujun, and Zhao, Deli and Zhou, Jingren},
booktitle={arXiv preprint arXiv:2306.02018},
year={2023}
}
@article{2023i2vgenxl,
title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang and Zhao, Deli and Zhou, Jingren},
booktitle={arXiv preprint arXiv:2311.04145},
year={2023}
}
@article{wang2023modelscope,
title={Modelscope text-to-video technical report},
author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
journal={arXiv preprint arXiv:2308.06571},
year={2023}
}
@article{dreamvideo,
title={DreamVideo: Composing Your Dream Videos with Customized Subject and Motion},
author={Wei, Yujie and Zhang, Shiwei and Qing, Zhiwu and Yuan, Hangjie and Liu, Zhiheng and Liu, Yu and Zhang, Yingya and Zhou, Jingren and Shan, Hongming},
journal={arXiv preprint arXiv:2312.04433},
year={2023}
}
@article{qing2023higen,
title={Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation},
author={Qing, Zhiwu and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Wei, Yujie and Zhang, Yingya and Gao, Changxin and Sang, Nong },
journal={arXiv preprint arXiv:2312.04483},
year={2023}
}
@article{wang2023videolcm,
title={VideoLCM: Video Latent Consistency Model},
author={Wang, Xiang and Zhang, Shiwei and Zhang, Han and Liu, Yu and Zhang, Yingya and Gao, Changxin and Sang, Nong },
journal={arXiv preprint arXiv:2312.09109},
year={2023}
}
@article{ma2023dreamtalk,
title={DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models},
author={Ma, Yifeng and Zhang, Shiwei and Wang, Jiayu and Wang, Xiang and Zhang, Yingya and Deng Zhidong},
journal={arXiv preprint arXiv:2312.09767},
year={2023}
}
@article{2023InstructVideo,
title={InstructVideo: Instructing Video Diffusion Models with Human Feedback},
author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong},
booktitle={arXiv preprint arXiv:2312.12490},
year={2023}
}
@article{TFT2V,
title={A Recipe for Scaling up Text-to-Video Generation with Text-free Videos},
author={Wang, Xiang and Zhang, Shiwei and Yuan, Hangjie and Qing, Zhiwu and Gong, Biao and Zhang, Yingya and Shen, Yujun and Gao, Changxin and Sang, Nong},
journal={arXiv preprint arXiv:2312.15770},
year={2023}
}
```
## Acknowledgement
We would like to express our gratitude for the contributions of several previous works to the development of VGen. This includes, but is not limited to [Composer](https://arxiv.org/abs/2302.09778), [ModelScopeT2V](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [Stable Diffusion](https://github.com/Stability-AI/stablediffusion), [OpenCLIP](https://github.com/mlfoundations/open_clip), [WebVid-10M](https://m-bain.github.io/webvid-dataset/), [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/), [Pidinet](https://github.com/zhuoinoulu/pidinet) and [MiDaS](https://github.com/isl-org/MiDaS). We are committed to building upon these foundations in a way that respects their original contributions.
## Disclaimer
This open-source model is trained with using [WebVid-10M](https://m-bain.github.io/webvid-dataset/) and [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) datasets and is intended for <strong>RESEARCH/NON-COMMERCIAL USE ONLY</strong>.
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
build:
gpu: true
system_packages:
- libgl1-mesa-glx
- libglib2.0-0
- ffmpeg
python_version: "3.11"
python_packages:
- torch==2.0.1
- torchvision==0.15.2
- easydict==1.10
- tokenizers==0.15.0
- ftfy==6.1.1
- transformers==4.36.2
- imageio==2.33.1
- fairscale==0.4.13
- open-clip-torch==2.23.0
- chardet==5.2.0
- torchdiffeq==0.2.3
- opencv-python==4.9.0.80
- opencv-python-headless==4.9.0.80
- torchsde==0.2.6
- simplejson==3.19.2
- scikit-learn==1.3.2
- scikit-image==0.22.0
- rotary-embedding-torch==0.5.3
- pynvml==11.5.0
- triton==2.0.0
- pytorch-lightning==2.1.3
- torchmetrics==1.2.1
- PyYAML==6.0.1
run:
- pip install -U xformers --index-url https://download.pytorch.org/whl/cu118
predict: "predict.py:Predictor"
ENABLE: true
DATASET: webvid10m
\ No newline at end of file
TASK_TYPE: inference_higen_entrance
use_fp16: True
guide_scale: 12.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 32
target_fps: 16 # FPS Conditions, not the encoding fps
scale: 8
seed: 0
round: 1
batch_size: 1
# For important input
vldm_cfg: configs/higen_train.yaml
test_list_path: data/text_list_for_t2v_share.txt
test_model: models/cvpr2024.t2v.e003.non_ema_0725000.pth
motion_factor: 500
appearance_factor: 1.0
\ No newline at end of file
TASK_TYPE: train_t2v_higen_entrance
ENABLE: true
use_ema: true
num_workers: 6
frame_lens: [32, 32, 32, 32, 32, 32, 32, 32]
sample_fps: [8, 8, 8, 8, 8, 8, 8, 8]
resolution: [448, 256]
vit_resolution: [224, 224]
vid_dataset: {
'type': 'VideoDataset',
'data_list': ['data/vid_list.txt', ],
'data_dir_list': ['data/videos/', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'get_first_frame': True,
'max_words': 1000,
}
img_dataset: {
'type': 'ImageDataset',
'data_list': ['data/img_list.txt', ],
'data_dir_list': ['data/images', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'max_words': 1000
}
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'models/open_clip_pytorch_model.bin'
}
UNet: {
'type': 'UNetSD_HiGen',
'in_dim': 4,
'y_dim': 1024,
'upper_len': 128,
'context_dim': 1024,
'concat_dim': 4,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'default_fps': 8,
'head_dim': 64,
'num_res_blocks': 2,
'dropout': 0.1,
'temporal_attention': True,
'temporal_attn_times': 1,
'use_checkpoint': True,
'use_fps_condition': False,
'use_sim_mask': False,
'context_embedding_depth': 2,
'num_tokens': 16
}
Diffusion: {
'type': 'DiffusionDDIM',
'schedule': 'linear_sd', # linear_sd
'schedule_param': {
'num_timesteps': 1000,
'zero_terminal_snr': True,
'init_beta': 0.00085,
'last_beta': 0.0120
},
'mean_type': 'v',
'loss_type': 'mse',
'var_type': 'fixed_small',
'rescale_timesteps': False,
'noise_strength': 0.1
}
batch_sizes: {
"1": 256,
"4": 96,
"8": 48,
"16": 32,
"24": 24,
"32": 10
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
# ['y', 'local_image', 'fps'],
# ['image', 'local_image', 'fps'],
['y', 'image', 'local_image', 'fps']
],
'use_offset_noise': True,
'guide_scale': 9.0,
}
Pretrain: {
'type': pretrain_specific_strategies,
'fix_weight': False,
'grad_scale': 0.5,
'resume_checkpoint': 'models/i2vgen_xl_00854500.pth',
'sd_keys_path': 'models/stable_diffusion_image_key_temporal_attention_x1.json',
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.0
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 50 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 6666
TASK_TYPE: inference_i2vgen_entrance
use_fp16: True
guide_scale: 9.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 16
target_fps: 16 # FPS Conditions, not the encoding fps
scale: 8
seed: 8888
round: 4
batch_size: 1
use_zero_infer: True
# For important input
vldm_cfg: configs/i2vgen_xl_train.yaml
test_list_path: data/test_list_for_i2vgen.txt
test_model: i2vgen-xl/i2vgen_xl_00854500.pth
TASK_TYPE: inference_i2vgen_entrance
use_fp16: True
guide_scale: 9.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 16
target_fps: 16 # FPS Conditions
scale: 8
batch_size: 1
use_zero_infer: True
# For important input
round: 4
seed: 0
data_root: workspace/test_imgs/test_img_01
# test_list_path: workspace/test_imgs/test_img_01.txt
test_list_path: workspace/test_imgs/test_img_02.txt
cap_dict_path: workspace/test_imgs/cap_dict_01.json
vldm_cfg: configs/i2vgen_xl_train.yaml
test_model: i2vgen-xl/i2vgen_xl_person_00854500.pth
TASK_TYPE: train_i2v_vs_img_text_entrance
ENABLE: true
use_ema: true
num_workers: 6
frame_lens: [16, 16, 16, 16, 16, 32, 32, 32]
sample_fps: [8, 8, 16, 16, 16, 8, 16, 16]
resolution: [1280, 704]
vit_resolution: [224, 224]
vid_dataset: {
'type': 'VideoDataset',
'data_list': ['data/vid_list.txt', ],
'data_dir_list': ['data/videos/', ],
'vit_resolution': [224, 224],
'resolution': [1280, 704],
'get_first_frame': True,
'max_words': 1000,
}
img_dataset: {
'type': 'ImageDataset',
'data_list': ['data/img_list.txt', ],
'data_dir_list': ['data/images', ],
'vit_resolution': [224, 224],
'resolution': [1280, 704],
'max_words': 1000
}
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'i2vgen-xl/open_clip_pytorch_model.bin'
}
UNet: {
'type': 'UNetSD_I2VGen',
'in_dim': 4,
'y_dim': 1024,
'upper_len': 128,
'context_dim': 1024,
'concat_dim': 4,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'default_fps': 8,
'head_dim': 64,
'num_res_blocks': 2,
'dropout': 0.1,
'temporal_attention': True,
'temporal_attn_times': 1,
'use_checkpoint': True,
'use_fps_condition': False,
'use_sim_mask': False
}
Diffusion: {
'type': 'DiffusionDDIM',
'schedule': 'cosine', # cosine
'schedule_param': {
'num_timesteps': 1000,
'cosine_s': 0.008,
'zero_terminal_snr': True,
},
'mean_type': 'v',
'loss_type': 'mse',
'var_type': 'fixed_small',
'rescale_timesteps': False,
'noise_strength': 0.1
}
batch_sizes: {
"1": 32,
"4": 8,
"8": 4,
"16": 2,
"32": 1,
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
# ['y', 'local_image', 'fps'],
# ['image', 'local_image', 'fps'],
['y', 'image', 'local_image', 'fps']
],
'use_offset_noise': True,
'guide_scale': 9.0,
}
Pretrain: {
'type': pretrain_specific_strategies,
'fix_weight': False,
'grad_scale': 0.5,
'resume_checkpoint': 'i2vgen-xl/i2vgen_xl_00854500.pth',
'sd_keys_path': 'i2vgen-xl/stable_diffusion_image_key_temporal_attention_x1.json',
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.0
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 50 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 6666
TASK_TYPE: inference_sr600_entrance
use_fp16: True
vldm_cfg: ''
round: 1
batch_size: 1
# For important input
test_list_path: data/text_list_for_t2v_share.txt
test_model: models/sr_step_110000_ema.pth
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'i2vgen-xl/models/open_clip_pytorch_model.bin',
'negative_prompt': 'worst quality, normal quality, low quality, low res, blurry, text, watermark, logo, banner, extra digits, cropped, jpeg artifacts, signature, username, error, sketch ,duplicate, ugly, monochrome, horror, geometry, mutation, disgusting',
'positive_prompt': ', cinematic, High Contrast, highly detailed, Unreal Engine 5, no blur, full length ultra-wide angle shot a cinematic scene, taken using a Canon EOS R camera, hyper detailed photo - realistic maximum detail, 32k, Color Grading, portrait Photography, ultra HD, extreme meticulous detailing, skin pore detailing, hyper sharpness, perfect without deformations, 4k render'
}
UNet: {
'type': 'UNetSD_SR600',
'in_dim': 4,
'dim': 320,
'y_dim': 1024,
'context_dim': 1024,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'head_dim': 64,
'num_res_blocks': 2,
'attn_scales' :[1, 0.5, 0.25],
'use_scale_shift_norm': True,
'dropout': 0.1,
'temporal_attn_times': 1,
'temporal_attention': True,
'use_checkpoint': True,
'use_image_dataset': False,
'use_sim_mask': False,
'inpainting': True
}
Diffusion: {
'type': 'DiffusionDDIMSR',
'reverse_diffusion': {
'schedule': 'cosine',
'mean_type': 'v',
'schedule_param':
{
'num_timesteps': 1000,
'zero_terminal_snr': True
}
},
'forward_diffusion': {
'schedule': 'logsnr_cosine_interp',
'mean_type': 'v',
'schedule_param':
{
'num_timesteps': 1000,
'zero_terminal_snr': True,
'scale_min': 2.0,
'scale_max': 4.0
}
}
}
batch_sizes: {
"1": 256,
"4": 96,
"8": 48,
"16": 32,
"24": 24,
"32": 10
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
# ['y', 'local_image', 'fps'],
# ['image', 'local_image', 'fps'],
['y', 'image', 'local_image', 'fps']
],
'use_offset_noise': True,
'guide_scale': 9.0,
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.0
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 50 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 6666
total_noise_levels: 700
\ No newline at end of file
TASK_TYPE: inference_text2video_entrance
use_fp16: True
guide_scale: 9.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 16
target_fps: 16 # FPS Conditions, not encoding fps
scale: 8
batch_size: 1
use_zero_infer: True
# For important input
round: 4
seed: 8888
test_list_path: data/text_img_for_t2v.txt
vldm_cfg: configs/t2v_train.yaml
test_model: workspace/model_bk/model_scope_0267000.pth
TASK_TYPE: train_t2v_entrance
ENABLE: true
use_ema: false
num_workers: 6
frame_lens: [1, 16, 16, 16, 16, 32, 32, 32]
sample_fps: [1, 8, 16, 16, 16, 8, 16, 16]
resolution: [448, 256]
vit_resolution: [224, 224]
vid_dataset: {
'type': 'VideoDataset',
'data_list': ['data/vid_list.txt', ],
'data_dir_list': ['data/videos/', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'get_first_frame': True,
'max_words': 1000,
}
img_dataset: {
'type': 'ImageDataset',
'data_list': ['data/img_list.txt', ],
'data_dir_list': ['data/images', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'max_words': 1000
}
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'models/open_clip_pytorch_model.bin'
}
UNet: {
'type': 'UNetSD_T2VBase',
'in_dim': 4,
'y_dim': 1024,
'upper_len': 128,
'context_dim': 1024,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'default_fps': 8,
'head_dim': 64,
'num_res_blocks': 2,
'dropout': 0.1,
'misc_dropout': 0.4,
'temporal_attention': True,
'temporal_attn_times': 1,
'use_checkpoint': True,
'use_fps_condition': False,
'use_sim_mask': False
}
Diffusion: {
'type': 'DiffusionDDIM',
'schedule': 'cosine', # cosine
'schedule_param': {
'num_timesteps': 1000,
'cosine_s': 0.008,
'zero_terminal_snr': True,
},
'mean_type': 'v',
'loss_type': 'mse',
'var_type': 'fixed_small',
'rescale_timesteps': False,
'noise_strength': 0.1
}
batch_sizes: {
"1": 32,
"4": 8,
"8": 4,
"16": 4,
"32": 2
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
['y', 'fps'],
],
'use_offset_noise': False,
'guide_scale': 9.0,
}
Pretrain: {
'type': pretrain_specific_strategies,
'fix_weight': False,
'grad_scale': 0.5,
'resume_checkpoint': 'workspace/model_bk/model_scope_0267000.pth',
'sd_keys_path': 'data/stable_diffusion_image_key_temporal_attention_x1.json',
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.1
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 5 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 8888
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment