dynamicrafter

ce34ef01 · mashun1 · ce34ef01 · ce34ef01 · ce34ef01 · ce34ef01
Commit ce34ef01 authored Mar 07, 2024 by mashun1
20 changed files
--- a/.gitignore
+++ b/.gitignore
+.DS_Store
+*pyc
+.vscode
+__pycache__
+*.egg-info
+checkpoints
+results
+backup
\ No newline at end of file
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
--- a/README.md
+++ b/README.md
+# DynamiCrafter
+## 论文
+**DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors**
+* https://arxiv.org/abs/2310.12190
+## 模型结构
+该模型对Stable Diffusion进行了扩展，使其可以生成视频。在训练时采用双流图像注入（`Dual-stream image injection`）机制，该机制以一种上下文感知的方式继承视觉细节并提取输入图像特征。模型的整体流程是这样的，输入分别是`x`以及$`x^m`$（`x`中随机帧），视频`x`逐帧通过`VAE`的编码器部分获取 $`z_0`$，图像`x_m`通过编码器并`Repeat`后与`z_t`（$`z_0`$扩散后得到）拼接进入`Denoising U-Net`，同时，由$`x^m`$经过`CLIP image encoder`以及`Query transformer`后得到的条件与`FPS`，`Text`特征一同进入`U-Net`进行训练。
+![Alt text](readme_imgs/image-1.png)
+## 算法原理
+该算法在文本生成视频的基础上，增加了视觉信息，使得在视频生成的过程中可以保留视觉的细节信息。
+![Alt text](readme_imgs/image-2.png)
+## 环境配置
+### Docker（方法一）
+    docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
+    docker run --shm-size 10g --network=host --name=dynamicrafter --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+    pip install -r requirements.txt
+    pip install triton-2.1.0+git34f8189.abi0.dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+    pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+    cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh  （whl.zip文件中）
+### Docker（方法二）
+    # 需要在对应的目录下
+    docker build -t <IMAGE_NAME>:<TAG> .
+    docker run --shm-size 10g --network=host --name=dynamicrafter --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+    pip install -r requirements.txt
+    pip install triton-2.1.0+git34f8189.abi0.dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+    pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+    cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh  （whl.zip文件中）
+### Anaconda (方法三)
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+https://developer.hpccube.com/tool/
+    DTK驱动：dtk23.10.1
+    python：python3.8
+    torch:2.1.0
+Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应
+2、其它非特殊库参照requirements.txt安装
+    pip install torchvision-0.16.0+git267eff6.abi0.dtk2310.torch2.1.0-cp38-cp38-linux_x86_64.whl --no-deps  （whl.zip文件中）
+    pip install triton-2.1.0+git34f8189.abi0.dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+    pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+    cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh  （whl.zip文件中）
+    pip install -r requirements.txt
+## 数据集
+无
+## 推理
+### 模型下载
+|Model|Resolution|GPU Mem. & Inference Time (A100, ddim 50steps)|Checkpoint|
+|:---------|:---------|:--------|:--------|
+|DynamiCrafter1024|576x1024|18.3GB & 75s (`perframe_ae=True`)|https://huggingface.co/Doubiiu/DynamiCrafter_1024/blob/main/model.ckpt|
+|DynamiCrafter512|320x512|12.8GB & 20s (`perframe_ae=True`)|https://huggingface.co/Doubiiu/DynamiCrafter_512/blob/main/model.ckpt|
+|DynamiCrafter256|256x256|11.9GB  & 10s (`perframe_ae=False`)|https://huggingface.co/Doubiiu/DynamiCrafter/blob/main/model.ckpt|
+注意：若无法访问`huggingface`，可使用镜像`hf-mirror`（替换`huggingface.co`）。若无法访问`huggingface`，需要执行`export HF_ENDPOINT=https://hf-mirror.com`设置环境变量，用以自动下载其他必要模型。
+模型文件结构如下：
+    checkpoints/
+    |── dynamicrafter_512_v1
+        └── model.ckpt
+    |── dynamicrafter_1024_v1
+        └── model.ckpt
+    └── dynamicrafter_256_v1
+        └── model.ckpt
+### 命令行
+    # Run on a single GPU:
+    # Select the model based on required resolutions: i.e., 1024|512|320:
+    sh scripts/run.sh 512
+    # Run on multiple GPUs for parallel inference:
+    sh scripts/run_mp.sh 512
+### gradio页面
+    python gradio_app.py --res 512
+## result
+||输入|输出|
+|:---|:---|:---|
+|image|![alt text](readme_imgs/bloom01.png)|![Alt text](readme_imgs/image-3.gif)|
+|prompt|time-lapse of a blooming flower with leaves and a stem||
+### 精度
+无
+## 应用场景
+### 算法类别
+`AIGC`
+### 热点应用行业
+`媒体,科研,教育`
+## 源码仓库及问题反馈
+https://developer.hpccube.com/codes/modelzoo/dynamicrafter_pytorch
+## 参考资料
+* https://github.com/Doubiiu/DynamiCrafter
\ No newline at end of file
--- a/README_official.md
+++ b/README_official.md
+## ___***DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors***___
+<!-- ![](./assets/logo_long.png#gh-light-mode-only){: width="50%"} -->
+<!-- ![](./assets/logo_long_dark.png#gh-dark-mode-only=100x20) -->
+<div align="center">
+<img src='assets/logo_long.png' style="height:100px"></img>
+ <a href='https://arxiv.org/abs/2310.12190'><img src='https://img.shields.io/badge/arXiv-2310.12190-b31b1b.svg'></a> &nbsp;
+ <a href='https://doubiiu.github.io/projects/DynamiCrafter/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> &nbsp;
+<a href='https://huggingface.co/spaces/Doubiiu/DynamiCrafter'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-blue'></a> &nbsp;
+<a href='https://youtu.be/0NfmIsNAg-g'><img src='https://img.shields.io/badge/Youtube-Video-b31b1b.svg'></a><br>
+[![Open in OpenXLab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/JinboXING/DynamiCrafter)&nbsp;&nbsp;
+<a href='https://replicate.com/camenduru/dynami-crafter-576x1024'><img src='https://img.shields.io/badge/replicate-Demo-blue'></a>&nbsp;&nbsp;
+<a href='https://github.com/camenduru/DynamiCrafter-colab'><img src='https://img.shields.io/badge/Colab-Demo-Green'></a>&nbsp;<a href='https://huggingface.co/papers/2310.12190'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Page-blue'></a>
+_**[Jinbo Xing](https://doubiiu.github.io/), [Menghan Xia*](https://menghanxia.github.io), [Yong Zhang](https://yzhang2016.github.io), [Haoxin Chen](), [Wangbo Yu](), <br>[Hanyuan Liu](https://github.com/hyliu), [Xintao Wang](https://xinntao.github.io/), [Tien-Tsin Wong*](https://www.cse.cuhk.edu.hk/~ttwong/myself.html), [Ying Shan](https://scholar.google.com/citations?hl=en&user=4oXBp9UAAAAJ&view_op=list_works&sortby=pubdate)**_
+<br><br>
+(* corresponding authors)
+From CUHK and Tencent AI Lab.
+</div>
+## 🔆 Introduction
+### 🔥🔥 New Update Rolls Out for DynamiCrafter! Better Dynamic, Higher Resolution, and Stronger Coherence! <br>
+🤗 DynamiCrafter can animate open-domain still images based on <strong>text prompt</strong> by leveraging the pre-trained video diffusion priors. Please check our project page and paper for more information. <br>
+😀 We will continue to improve the model's performance.
+👀 Seeking comparisons with [Stable Video Diffusion](https://stability.ai/news/stable-video-diffusion-open-ai-video-model) and [PikaLabs](https://pika.art/)? Click the image below.
+[![](https://img.youtube.com/vi/0NfmIsNAg-g/0.jpg)](https://www.youtube.com/watch?v=0NfmIsNAg-g)
+### 1.1. Showcases (576x1024)
+<table class="center">
+  <!-- <tr>
+    <td colspan="1">"fireworks display"</td>
+    <td colspan="1">"a robot is walking through a destroyed city"</td>
+  </tr> -->
+  <tr>
+  <td>
+    <img src=assets/showcase/firework03.gif width="340">
+  </td>
+  <td>
+    <img src=assets/showcase/robot01.gif width="340">
+  </td>
+  </tr>
+  <!-- <tr>
+    <td colspan="1">"riding a bike under a bridge"</td>
+    <td colspan="1">""</td>
+  </tr> -->
+  <tr>
+  <td>
+    <img src=assets/showcase/bike_chineseink.gif width="340">
+  </td>
+  <td>
+    <img src=assets/showcase/girl07.gif width="340">
+  </td>
+  </tr>
+</table>
+### 1.2. Showcases (320x512)
+<table class="center">
+  <!-- <tr>
+    <td colspan="1">"fireworks display"</td>
+    <td colspan="1">"a robot is walking through a destroyed city"</td>
+  </tr> -->
+  <tr>
+  <td>
+    <img src=assets/showcase/bloom2.gif width="340">
+  </td>
+  <td>
+    <img src=assets/showcase/train_anime02.gif width="340">
+  </td>
+  </tr>
+  <!-- <tr>
+    <td colspan="1">"riding a bike under a bridge"</td>
+    <td colspan="1">""</td>
+  </tr> -->
+  <tr>
+  <td>
+    <img src=assets/showcase/pour_honey.gif width="340">
+  </td>
+  <td>
+    <img src=assets/showcase/lighthouse.gif width="340">
+  </td>
+  </tr>
+</table>
+### 1.3. Showcases (256x256)
+<table class="center">
+  <tr>
+    <td colspan="2">"bear playing guitar happily, snowing"</td>
+    <td colspan="2">"boy walking on the street"</td>
+  </tr>
+  <tr>
+  <td>
+    <img src=assets/showcase/guitar0.jpeg_00.png width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/guitar0.gif width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/walk0.png_00.png width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/walk0.gif width="170">
+  </td>
+  </tr>
+  <!-- <tr>
+    <td colspan="2">"two people dancing"</td>
+    <td colspan="2">"girl talking and blinking"</td>
+  </tr>
+  <tr>
+  <td>
+    <img src=assets/showcase/dance1.jpeg_00.png width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/dance1.gif width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/girl3.jpeg_00.png width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/girl3.gif width="170">
+  </td>
+  </tr> -->
+  <!-- <tr>
+    <td colspan="2">"zoom-in, a landscape, springtime"</td>
+    <td colspan="2">"A blonde woman rides on top of a moving <br>washing machine into the sunset."</td>
+  </tr>
+  <tr>
+  <td>
+    <img src=assets/showcase/Upscaled_Aime_Tribolet_springtime_landscape_golden_hour_morning_pale_yel_e6946f8d-37c1-4ce8-bf62-6ba90d23bd93.mp4_00.png width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/Upscaled_Aime_Tribolet_springtime_landscape_golden_hour_morning_pale_yel_e6946f8d-37c1-4ce8-bf62-6ba90d23bd93.gif width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/Upscaled_Alex__State_Blonde_woman_riding_on_top_of_a_moving_washing_mach_c31acaa3-dd30-459f-a109-2d2eb4c00fe2.mp4_00.png width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/Upscaled_Alex__State_Blonde_woman_riding_on_top_of_a_moving_washing_mach_c31acaa3-dd30-459f-a109-2d2eb4c00fe2.gif width="170">
+  </td>
+  </tr>
+  <tr>
+    <td colspan="2">"explode colorful smoke coming out"</td>
+    <td colspan="2">"a bird on the tree branch"</td>
+  </tr>
+  <tr>
+  <td>
+    <img src=assets/showcase/explode0.jpeg_00.png width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/explode0.gif width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/bird000.jpeg width="170">
+  </td>
+  <td>
+    <img src=assets/showcase/bird000.gif width="170">
+  </td>
+  </tr> -->
+</table >
+### 2. Applications
+#### 2.1 Storytelling video generation (see project page for more details)
+<table class="center">
+    <!-- <tr style="font-weight: bolder;text-align:center;">
+        <td>Input</td>
+        <td>Output</td>
+        <td>Input</td>
+        <td>Output</td>
+    </tr> -->
+  <tr>
+    <td colspan="4"><img src=assets/application/storytellingvideo.gif width="250"></td>
+  </tr>
+</table >
+#### 2.2 Looping video generation
+<table class="center">
+  <tr>
+  <td>
+    <img src=assets/application/60.gif width="300">
+  </td>
+  <td>
+    <img src=assets/application/35.gif width="300">
+  </td>
+  <td>
+    <img src=assets/application/36.gif width="300">
+  </td>
+  </tr>
+  <!-- <tr>
+  <td>
+    <img src=assets/application/05.gif width="300">
+  </td>
+  <td>
+    <img src=assets/application/25.gif width="300">
+  </td>
+  <td>
+    <img src=assets/application/34.gif width="300">
+  </td>
+  </tr> -->
+</table >
+#### 2.3 Generative frame interpolation
+<table class="center">
+    <tr style="font-weight: bolder;text-align:center;">
+        <td>Input starting frame</td>
+        <td>Input ending frame</td>
+        <td>Generated video</td>
+    </tr>
+  <tr>
+  <td>
+    <img src=assets/application/gkxX0kb8mE8_input_start.png width="250">
+  </td>
+  <td>
+    <img src=assets/application/gkxX0kb8mE8_input_end.png width="250">
+  </td>
+  <td>
+    <img src=assets/application/gkxX0kb8mE8.gif width="250">
+  </td>
+  </tr>
+  <!-- <tr>
+  <td>
+    <img src=assets/application/YwHJYWvv_dM_input_start.png width="250">
+  </td>
+  <td>
+    <img src=assets/application/YwHJYWvv_dM_input_end.png width="250">
+  </td>
+  <td>
+    <img src=assets/application/YwHJYWvv_dM.gif width="250">
+  </td>
+  </tr>
+  <tr>
+  <td>
+    <img src=assets/application/ypDLB52Ykk4_input_start.png width="250">
+  </td>
+  <td>
+    <img src=assets/application/ypDLB52Ykk4_input_end.png width="250">
+  </td>
+  <td>
+    <img src=assets/application/ypDLB52Ykk4.gif width="250">
+  </td>
+  </tr> -->
+</table >
+## 📝 Changelog
+- __[2024.02.05]__: 🔥🔥 Release high-resolution models (320x512 & 576x1024).
+- __[2023.12.02]__: Launch the local Gradio demo.
+- __[2023.11.29]__: Release the main model at a resolution of 256x256.
+- __[2023.11.27]__: Launch the project page and update the arXiv preprint.
+<br>
+## 🧰 Models
+|Model|Resolution|GPU Mem. & Inference Time (A100, ddim 50steps)|Checkpoint|
+|:---------|:---------|:--------|:--------|
+|DynamiCrafter1024|576x1024|18.3GB & 75s (`perframe_ae=True`)|[Hugging Face](https://huggingface.co/Doubiiu/DynamiCrafter_1024/blob/main/model.ckpt)|
+|DynamiCrafter512|320x512|12.8GB & 20s (`perframe_ae=True`)|[Hugging Face](https://huggingface.co/Doubiiu/DynamiCrafter_512/blob/main/model.ckpt)|
+|DynamiCrafter256|256x256|11.9GB  & 10s (`perframe_ae=False`)|[Hugging Face](https://huggingface.co/Doubiiu/DynamiCrafter/blob/main/model.ckpt)|
+Currently, our DynamiCrafter can support generating videos of up to 16 frames with a resolution of 576x1024. The inference time can be reduced by using fewer DDIM steps.
+GPU memory consumed on RTX 4090 reported by @noguchis in [Twitter](https://x.com/noguchis/status/1754488826016432341?s=20): 18.3GB (576x1024), 12.8GB (320x512), 11.9GB (256x256).
+<!-- It takes approximately 10 seconds and requires a peak GPU memory of 20 GB to animate an image using a single NVIDIA A100 (40G) GPU. -->
+## ⚙️ Setup
+### Install Environment via Anaconda (Recommended)
+```bash
+conda create -n dynamicrafter python=3.8.5
+conda activate dynamicrafter
+pip install -r requirements.txt
+```
+## 💫 Inference 
+### 1. Command line
+1) Download pretrained models via Hugging Face, and put the `model.ckpt` with the required resolution in `checkpoints/dynamicrafter_[1024|512|256]_v1/model.ckpt`.
+2) Run the commands based on your devices and needs in terminal.
+```bash
+  # Run on a single GPU:
+  # Select the model based on required resolutions: i.e., 1024|512|320:
+  sh scripts/run.sh 1024
+  # Run on multiple GPUs for parallel inference:
+  sh scripts/run_mp.sh 1024
+```
+### 2. Local Gradio demo
+1. Download the pretrained models and put them in the corresponding directory according to the previous guidelines.
+2. Input the following commands in terminal (choose a model based on the required resolution: 1024, 512 or 256).
+```bash
+  python gradio_app.py --res 1024
+```
+Community Extensions: [ComfyUI](https://github.com/chaojie/ComfyUI-DynamiCrafter) (Thanks to [chaojie](https://github.com/chaojie)).
+## 👨‍👩‍👧‍👦 Crafter Family
+[VideoCrafter1](https://github.com/AILab-CVC/VideoCrafter): Framework for high-quality video generation.
+[ScaleCrafter](https://github.com/YingqingHe/ScaleCrafter): Tuning-free method for high-resolution image/video generation.
+[TaleCrafter](https://github.com/AILab-CVC/TaleCrafter): An interactive story visualization tool that supports multiple characters.  
+[LongerCrafter](https://github.com/arthur-qiu/LongerCrafter): Tuning-free method for longer high-quality video generation.  
+[MakeYourVideo, might be a Crafter:)](https://doubiiu.github.io/projects/Make-Your-Video/): Video generation/editing with textual and structural guidance.
+[StyleCrafter](https://gongyeliu.github.io/StyleCrafter.github.io/): Stylized-image-guided text-to-image and text-to-video generation.
+## 😉 Citation
+```bib
+@article{xing2023dynamicrafter,
+  title={DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors},
+  author={Xing, Jinbo and Xia, Menghan and Zhang, Yong and Chen, Haoxin and Yu, Wangbo and Liu, Hanyuan and Wang, Xintao and Wong, Tien-Tsin and Shan, Ying},
+  journal={arXiv preprint arXiv:2310.12190},
+  year={2023}
+}
+```
+## 🙏 Acknowledgements
+We would like to thank [AK(@_akhaliq)](https://twitter.com/_akhaliq?lang=en) for the help of setting up hugging face online demo, and [camenduru](https://twitter.com/camenduru) for providing the replicate & colab online demo.
+## 📢 Disclaimer
+We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.
+****
\ No newline at end of file
--- a/assets/application/05.gif
+++ b/assets/application/05.gif
--- a/assets/application/25.gif
+++ b/assets/application/25.gif
--- a/assets/application/34.gif
+++ b/assets/application/34.gif
--- a/assets/application/35.gif
+++ b/assets/application/35.gif
--- a/assets/application/36.gif
+++ b/assets/application/36.gif
--- a/assets/application/60.gif
+++ b/assets/application/60.gif
--- a/assets/application/YwHJYWvv_dM.gif
+++ b/assets/application/YwHJYWvv_dM.gif
--- a/assets/application/YwHJYWvv_dM_input_end.png
+++ b/assets/application/YwHJYWvv_dM_input_end.png
--- a/assets/application/YwHJYWvv_dM_input_start.png
+++ b/assets/application/YwHJYWvv_dM_input_start.png
--- a/assets/application/gkxX0kb8mE8.gif
+++ b/assets/application/gkxX0kb8mE8.gif
--- a/assets/application/gkxX0kb8mE8_input_end.png
+++ b/assets/application/gkxX0kb8mE8_input_end.png
--- a/assets/application/gkxX0kb8mE8_input_start.png
+++ b/assets/application/gkxX0kb8mE8_input_start.png
--- a/assets/application/storytellingvideo.gif
+++ b/assets/application/storytellingvideo.gif
--- a/assets/application/ypDLB52Ykk4.gif
+++ b/assets/application/ypDLB52Ykk4.gif
--- a/assets/application/ypDLB52Ykk4_input_end.png
+++ b/assets/application/ypDLB52Ykk4_input_end.png