Commit 1ad55bb4 authored by mashun1's avatar mashun1
Browse files

i2vgen-xl

parents
Pipeline #819 canceled with stages
*.pkl
*.pt
*.mov
*.pth
*.mov
*.npz
*.npy
*.boj
*.onnx
*.tar
*.bin
cache*
.DS_Store
*DS_Store
outputs/
workspace/experiments/
nohup*.txt
models/
i2vgen-xl
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
\ No newline at end of file
# i2vgen-xl
## 论文
**I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models**
* https://arxiv.org/abs/2311.04145
## 模型结构
该模型为两阶段的视频生成模型,其主要结构都为`3D-Unet`,其中第一阶段模型为低质量视频生成模型,其中包括提取图像高阶信息(如语义特征)的`CLIP`,图片压缩用到的`D.Enc.``VQGAN`中的`Encoder`)以及提取低阶特征(如细节特征)的`G.Enc.`;第二阶段模型用于生成高质量视频,以文本作为条件,第一阶段的输出进行Resize后作为LDM的输入并执行加噪去噪过程,最终得到高清视频。
![Alt text](readme_imgs/image-1.png)
## 算法原理
该算法使用了级联的方式进行视频生成,将其分为了两个过程,一个用于保证视频语义的连贯性,一个用于增强视频的细节并提高分辨率。
![alt text](readme_imgs/image-2.png)
## 环境配置
### Docker(方法一)
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
docker run --shm-size 10g --network=host --name=vgen --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
pip install -r requirements.txt
pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl (whl.zip文件中)
pip install triton-2.1.0%2Bgit34f8189.abi0.dtk2310-cp38-cp38-manylinux2014_x86_64.whl (开发者社区下载)
cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh (whl.zip文件中)
# 以下按需安装
yum install epel-release -y
yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm -y
yum install ffmpeg ffmpeg-devel libsm6 libxext6 -y
### Docker(方法二)
# 需要在对应的目录下
docker build -t <IMAGE_NAME>:<TAG> .
docker run --shm-size 10g --network=host --name=vgen --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
pip install -r requirements.txt
pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl (whl.zip文件中)
pip install triton-2.1.0%2Bgit34f8189.abi0.dtk2310-cp38-cp38-manylinux2014_x86_64.whl (开发者社区下载)
cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh (whl.zip文件中)
# 以下按需安装
yum install epel-release -y
yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm -y
yum install ffmpeg ffmpeg-devel libsm6 libxext6 -y
### Anaconda (方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
https://developer.hpccube.com/tool/
DTK驱动:dtk23.10.1
python:python3.8
torch:2.1.0
torchvision:0.16.0
triton:2.1.0
Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应
2、其它非特殊库参照requirements.txt安装
pip install -r requirements.txt
pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl (whl.zip文件中)
cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh (whl.zip文件中)
# 按需安装
conda install -c conda-forge ffmpeg
## 数据集
作者未公开训练数据集,常用的数据集目前无法下载。
## 推理
### 模型下载
https://huggingface.co/ali-vilab/i2vgen-xl/tree/main
i2vgen-xl/
├── i2vgen_xl_00854500.pth
├── open_clip_pytorch_model.bin
├── stable_diffusion_image_key_temporal_attention_x1.json
└── v2-1_512-ema-pruned.ckpt
### 命令行
python inference.py --cfg configs/i2vgen_xl_infer.yaml
python inference.py --cfg configs/i2vgen_xl_infer.yaml test_list_path data/test_list_for_i2vgen.txt test_model i2vgen-xl/i2vgen_xl_00854500.pth
test_list_path 表示输入图像路径及其对应的标题请参考演示文件 data/test_list_for_i2vgen.txt 中的特定格式和建议。test_model 是加载模型的路径。
### gradio页面
python gradio_app.py
注意:第一次执行该命令会下载默认文件,当默认文件下载完毕后需手动注释`~/.cache/modelscope/modelscope_modules/i2vgen-xl/ms_wrapper.py`中的代码
![alt text](readme_imgs/image-3.png)
## result
||输入|输出|
|:---|:---|:---|
|图像|![alt text](readme_imgs/img_0001.jpg)|![alt text](readme_imgs/r.gif)|
|prompt|A green frog floats on the surface of the water on green lotus leaves, with several pink lotus flowers, in a Chinese painting style.||
### 精度
## 应用场景
### 算法类别
`视频生成`
### 热点应用行业
`媒体,科研,教育`
## 源码仓库及问题反馈
* https://developer.hpccube.com/codes/modelzoo/i2vgen-xl_pytorch
## 参考资料
* https://github.com/ali-vilab/VGen
This diff is collapsed.
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
build:
gpu: true
system_packages:
- libgl1-mesa-glx
- libglib2.0-0
- ffmpeg
python_version: "3.11"
python_packages:
- torch==2.0.1
- torchvision==0.15.2
- easydict==1.10
- tokenizers==0.15.0
- ftfy==6.1.1
- transformers==4.36.2
- imageio==2.33.1
- fairscale==0.4.13
- open-clip-torch==2.23.0
- chardet==5.2.0
- torchdiffeq==0.2.3
- opencv-python==4.9.0.80
- opencv-python-headless==4.9.0.80
- torchsde==0.2.6
- simplejson==3.19.2
- scikit-learn==1.3.2
- scikit-image==0.22.0
- rotary-embedding-torch==0.5.3
- pynvml==11.5.0
- triton==2.0.0
- pytorch-lightning==2.1.3
- torchmetrics==1.2.1
- PyYAML==6.0.1
run:
- pip install -U xformers --index-url https://download.pytorch.org/whl/cu118
predict: "predict.py:Predictor"
ENABLE: true
DATASET: webvid10m
\ No newline at end of file
TASK_TYPE: inference_higen_entrance
use_fp16: True
guide_scale: 12.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 32
target_fps: 16 # FPS Conditions, not the encoding fps
scale: 8
seed: 0
round: 1
batch_size: 1
# For important input
vldm_cfg: configs/higen_train.yaml
test_list_path: data/text_list_for_t2v_share.txt
test_model: models/cvpr2024.t2v.e003.non_ema_0725000.pth
motion_factor: 500
appearance_factor: 1.0
\ No newline at end of file
TASK_TYPE: train_t2v_higen_entrance
ENABLE: true
use_ema: true
num_workers: 6
frame_lens: [32, 32, 32, 32, 32, 32, 32, 32]
sample_fps: [8, 8, 8, 8, 8, 8, 8, 8]
resolution: [448, 256]
vit_resolution: [224, 224]
vid_dataset: {
'type': 'VideoDataset',
'data_list': ['data/vid_list.txt', ],
'data_dir_list': ['data/videos/', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'get_first_frame': True,
'max_words': 1000,
}
img_dataset: {
'type': 'ImageDataset',
'data_list': ['data/img_list.txt', ],
'data_dir_list': ['data/images', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'max_words': 1000
}
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'models/open_clip_pytorch_model.bin'
}
UNet: {
'type': 'UNetSD_HiGen',
'in_dim': 4,
'y_dim': 1024,
'upper_len': 128,
'context_dim': 1024,
'concat_dim': 4,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'default_fps': 8,
'head_dim': 64,
'num_res_blocks': 2,
'dropout': 0.1,
'temporal_attention': True,
'temporal_attn_times': 1,
'use_checkpoint': True,
'use_fps_condition': False,
'use_sim_mask': False,
'context_embedding_depth': 2,
'num_tokens': 16
}
Diffusion: {
'type': 'DiffusionDDIM',
'schedule': 'linear_sd', # linear_sd
'schedule_param': {
'num_timesteps': 1000,
'zero_terminal_snr': True,
'init_beta': 0.00085,
'last_beta': 0.0120
},
'mean_type': 'v',
'loss_type': 'mse',
'var_type': 'fixed_small',
'rescale_timesteps': False,
'noise_strength': 0.1
}
batch_sizes: {
"1": 256,
"4": 96,
"8": 48,
"16": 32,
"24": 24,
"32": 10
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
# ['y', 'local_image', 'fps'],
# ['image', 'local_image', 'fps'],
['y', 'image', 'local_image', 'fps']
],
'use_offset_noise': True,
'guide_scale': 9.0,
}
Pretrain: {
'type': pretrain_specific_strategies,
'fix_weight': False,
'grad_scale': 0.5,
'resume_checkpoint': 'models/i2vgen_xl_00854500.pth',
'sd_keys_path': 'models/stable_diffusion_image_key_temporal_attention_x1.json',
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.0
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 50 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 6666
TASK_TYPE: inference_i2vgen_entrance
use_fp16: True
guide_scale: 9.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 16
target_fps: 16 # FPS Conditions, not the encoding fps
scale: 8
seed: 8888
round: 4
batch_size: 1
use_zero_infer: True
# For important input
vldm_cfg: configs/i2vgen_xl_train.yaml
test_list_path: data/test_list_for_i2vgen.txt
test_model: i2vgen-xl/i2vgen_xl_00854500.pth
TASK_TYPE: inference_i2vgen_entrance
use_fp16: True
guide_scale: 9.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 16
target_fps: 16 # FPS Conditions
scale: 8
batch_size: 1
use_zero_infer: True
# For important input
round: 4
seed: 0
data_root: workspace/test_imgs/test_img_01
# test_list_path: workspace/test_imgs/test_img_01.txt
test_list_path: workspace/test_imgs/test_img_02.txt
cap_dict_path: workspace/test_imgs/cap_dict_01.json
vldm_cfg: configs/i2vgen_xl_train.yaml
test_model: i2vgen-xl/i2vgen_xl_person_00854500.pth
TASK_TYPE: train_i2v_vs_img_text_entrance
ENABLE: true
use_ema: true
num_workers: 6
frame_lens: [16, 16, 16, 16, 16, 32, 32, 32]
sample_fps: [8, 8, 16, 16, 16, 8, 16, 16]
resolution: [1280, 704]
vit_resolution: [224, 224]
vid_dataset: {
'type': 'VideoDataset',
'data_list': ['data/vid_list.txt', ],
'data_dir_list': ['data/videos/', ],
'vit_resolution': [224, 224],
'resolution': [1280, 704],
'get_first_frame': True,
'max_words': 1000,
}
img_dataset: {
'type': 'ImageDataset',
'data_list': ['data/img_list.txt', ],
'data_dir_list': ['data/images', ],
'vit_resolution': [224, 224],
'resolution': [1280, 704],
'max_words': 1000
}
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'i2vgen-xl/open_clip_pytorch_model.bin'
}
UNet: {
'type': 'UNetSD_I2VGen',
'in_dim': 4,
'y_dim': 1024,
'upper_len': 128,
'context_dim': 1024,
'concat_dim': 4,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'default_fps': 8,
'head_dim': 64,
'num_res_blocks': 2,
'dropout': 0.1,
'temporal_attention': True,
'temporal_attn_times': 1,
'use_checkpoint': True,
'use_fps_condition': False,
'use_sim_mask': False
}
Diffusion: {
'type': 'DiffusionDDIM',
'schedule': 'cosine', # cosine
'schedule_param': {
'num_timesteps': 1000,
'cosine_s': 0.008,
'zero_terminal_snr': True,
},
'mean_type': 'v',
'loss_type': 'mse',
'var_type': 'fixed_small',
'rescale_timesteps': False,
'noise_strength': 0.1
}
batch_sizes: {
"1": 32,
"4": 8,
"8": 4,
"16": 2,
"32": 1,
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
# ['y', 'local_image', 'fps'],
# ['image', 'local_image', 'fps'],
['y', 'image', 'local_image', 'fps']
],
'use_offset_noise': True,
'guide_scale': 9.0,
}
Pretrain: {
'type': pretrain_specific_strategies,
'fix_weight': False,
'grad_scale': 0.5,
'resume_checkpoint': 'i2vgen-xl/i2vgen_xl_00854500.pth',
'sd_keys_path': 'i2vgen-xl/stable_diffusion_image_key_temporal_attention_x1.json',
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.0
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 50 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 6666
TASK_TYPE: inference_sr600_entrance
use_fp16: True
vldm_cfg: ''
round: 1
batch_size: 1
# For important input
test_list_path: data/text_list_for_t2v_share.txt
test_model: models/sr_step_110000_ema.pth
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'i2vgen-xl/models/open_clip_pytorch_model.bin',
'negative_prompt': 'worst quality, normal quality, low quality, low res, blurry, text, watermark, logo, banner, extra digits, cropped, jpeg artifacts, signature, username, error, sketch ,duplicate, ugly, monochrome, horror, geometry, mutation, disgusting',
'positive_prompt': ', cinematic, High Contrast, highly detailed, Unreal Engine 5, no blur, full length ultra-wide angle shot a cinematic scene, taken using a Canon EOS R camera, hyper detailed photo - realistic maximum detail, 32k, Color Grading, portrait Photography, ultra HD, extreme meticulous detailing, skin pore detailing, hyper sharpness, perfect without deformations, 4k render'
}
UNet: {
'type': 'UNetSD_SR600',
'in_dim': 4,
'dim': 320,
'y_dim': 1024,
'context_dim': 1024,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'head_dim': 64,
'num_res_blocks': 2,
'attn_scales' :[1, 0.5, 0.25],
'use_scale_shift_norm': True,
'dropout': 0.1,
'temporal_attn_times': 1,
'temporal_attention': True,
'use_checkpoint': True,
'use_image_dataset': False,
'use_sim_mask': False,
'inpainting': True
}
Diffusion: {
'type': 'DiffusionDDIMSR',
'reverse_diffusion': {
'schedule': 'cosine',
'mean_type': 'v',
'schedule_param':
{
'num_timesteps': 1000,
'zero_terminal_snr': True
}
},
'forward_diffusion': {
'schedule': 'logsnr_cosine_interp',
'mean_type': 'v',
'schedule_param':
{
'num_timesteps': 1000,
'zero_terminal_snr': True,
'scale_min': 2.0,
'scale_max': 4.0
}
}
}
batch_sizes: {
"1": 256,
"4": 96,
"8": 48,
"16": 32,
"24": 24,
"32": 10
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
# ['y', 'local_image', 'fps'],
# ['image', 'local_image', 'fps'],
['y', 'image', 'local_image', 'fps']
],
'use_offset_noise': True,
'guide_scale': 9.0,
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.0
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 50 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 6666
total_noise_levels: 700
\ No newline at end of file
TASK_TYPE: inference_text2video_entrance
use_fp16: True
guide_scale: 9.0
use_fp16: True
chunk_size: 2
decoder_bs: 2
max_frames: 16
target_fps: 16 # FPS Conditions, not encoding fps
scale: 8
batch_size: 1
use_zero_infer: True
# For important input
round: 4
seed: 8888
test_list_path: data/text_img_for_t2v.txt
vldm_cfg: configs/t2v_train.yaml
test_model: workspace/model_bk/model_scope_0267000.pth
TASK_TYPE: train_t2v_entrance
ENABLE: true
use_ema: false
num_workers: 6
frame_lens: [1, 16, 16, 16, 16, 32, 32, 32]
sample_fps: [1, 8, 16, 16, 16, 8, 16, 16]
resolution: [448, 256]
vit_resolution: [224, 224]
vid_dataset: {
'type': 'VideoDataset',
'data_list': ['data/vid_list.txt', ],
'data_dir_list': ['data/videos/', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'get_first_frame': True,
'max_words': 1000,
}
img_dataset: {
'type': 'ImageDataset',
'data_list': ['data/img_list.txt', ],
'data_dir_list': ['data/images', ],
'vit_resolution': [224, 224],
'resolution': [448, 256],
'max_words': 1000
}
embedder: {
'type': 'FrozenOpenCLIPTextVisualEmbedder',
'layer': 'penultimate',
'vit_resolution': [224, 224],
'pretrained': 'models/open_clip_pytorch_model.bin'
}
UNet: {
'type': 'UNetSD_T2VBase',
'in_dim': 4,
'y_dim': 1024,
'upper_len': 128,
'context_dim': 1024,
'out_dim': 4,
'dim_mult': [1, 2, 4, 4],
'num_heads': 8,
'default_fps': 8,
'head_dim': 64,
'num_res_blocks': 2,
'dropout': 0.1,
'misc_dropout': 0.4,
'temporal_attention': True,
'temporal_attn_times': 1,
'use_checkpoint': True,
'use_fps_condition': False,
'use_sim_mask': False
}
Diffusion: {
'type': 'DiffusionDDIM',
'schedule': 'cosine', # cosine
'schedule_param': {
'num_timesteps': 1000,
'cosine_s': 0.008,
'zero_terminal_snr': True,
},
'mean_type': 'v',
'loss_type': 'mse',
'var_type': 'fixed_small',
'rescale_timesteps': False,
'noise_strength': 0.1
}
batch_sizes: {
"1": 32,
"4": 8,
"8": 4,
"16": 4,
"32": 2
}
visual_train: {
'type': 'VisualTrainTextImageToVideo',
'partial_keys': [
['y', 'fps'],
],
'use_offset_noise': False,
'guide_scale': 9.0,
}
Pretrain: {
'type': pretrain_specific_strategies,
'fix_weight': False,
'grad_scale': 0.5,
'resume_checkpoint': 'workspace/model_bk/model_scope_0267000.pth',
'sd_keys_path': 'data/stable_diffusion_image_key_temporal_attention_x1.json',
}
chunk_size: 4
decoder_bs: 4
lr: 0.00003
noise_strength: 0.1
# classifier-free guidance
p_zero: 0.1
guide_scale: 3.0
num_steps: 1000000
use_zero_infer: True
viz_interval: 5 # 200
save_ckp_interval: 50 # 500
# Log
log_dir: "workspace/experiments"
log_interval: 1
seed: 8888
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment