i2vgen-xl

1ad55bb4 · mashun1 · 1ad55bb4 · 1ad55bb4 · 1ad55bb4 · 1ad55bb4
Commit 1ad55bb4 authored Mar 15, 2024 by mashun1
20 changed files
--- a/.gitignore
+++ b/.gitignore
+*.pkl
+*.pt
+*.mov
+*.pth
+*.mov
+*.npz
+*.npy
+*.boj
+*.onnx
+*.tar
+*.bin
+cache*
+.DS_Store
+*DS_Store
+outputs/
+workspace/experiments/
+nohup*.txt
+models/
+i2vgen-xl
\ No newline at end of file
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# i2vgen-xl
+
+## 论文
+
+**I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models**
+
+* https://arxiv.org/abs/2311.04145
+
+## 模型结构
+该模型为两阶段的视频生成模型，其主要结构都为`3D-Unet`，其中第一阶段模型为低质量视频生成模型，其中包括提取图像高阶信息（如语义特征）的`CLIP`，图片压缩用到的`D.Enc.`（`VQGAN`中的`Encoder`）以及提取低阶特征（如细节特征）的`G.Enc.`；第二阶段模型用于生成高质量视频，以文本作为条件，第一阶段的输出进行Resize后作为LDM的输入并执行加噪去噪过程，最终得到高清视频。
+
+![Alt text](readme_imgs/image-1.png)
+
+## 算法原理
+
+该算法使用了级联的方式进行视频生成，将其分为了两个过程，一个用于保证视频语义的连贯性，一个用于增强视频的细节并提高分辨率。
+
+![alt text](readme_imgs/image-2.png)
+
+## 环境配置
+
+### Docker（方法一）
+
+    docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py38
+
+    docker run --shm-size 10g --network=host --name=vgen --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+
+    pip install -r requirements.txt
+
+    pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+
+    pip install triton-2.1.0%2Bgit34f8189.abi0.dtk2310-cp38-cp38-manylinux2014_x86_64.whl （开发者社区下载）
+
+    cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh  （whl.zip文件中）
+
+    # 以下按需安装
+    yum install epel-release -y
+
+    yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm -y
+
+    yum install ffmpeg ffmpeg-devel libsm6 libxext6 -y
+
+
+
+### Docker（方法二）
+
+    # 需要在对应的目录下
+    docker build -t <IMAGE_NAME>:<TAG> .
+
+    docker run --shm-size 10g --network=host --name=vgen --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+
+    pip install -r requirements.txt
+
+    pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+
+    pip install triton-2.1.0%2Bgit34f8189.abi0.dtk2310-cp38-cp38-manylinux2014_x86_64.whl （开发者社区下载）
+
+    cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh  （whl.zip文件中）
+
+    # 以下按需安装
+
+    yum install epel-release -y
+
+    yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm -y
+
+    yum install ffmpeg ffmpeg-devel libsm6 libxext6 -y
+
+### Anaconda (方法三)
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+https://developer.hpccube.com/tool/
+
+    DTK驱动：dtk23.10.1
+    python：python3.8
+    torch:2.1.0
+    torchvision:0.16.0
+    triton:2.1.0
+
+
+Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应
+
+2、其它非特殊库参照requirements.txt安装
+
+    pip install -r requirements.txt
+
+    pip install flash_attn-2.0.4_torch2.1_dtk2310-cp38-cp38-linux_x86_64.whl  （whl.zip文件中）
+
+    cd xformers && pip install xformers==0.0.23 --no-deps && bash patch_xformers.rocm.sh  （whl.zip文件中）
+
+    # 按需安装
+
+    conda install -c conda-forge ffmpeg
+
+
+## 数据集
+
+作者未公开训练数据集，常用的数据集目前无法下载。
+
+## 推理
+
+### 模型下载
+
+https://huggingface.co/ali-vilab/i2vgen-xl/tree/main
+
+    i2vgen-xl/
+    ├── i2vgen_xl_00854500.pth
+    ├── open_clip_pytorch_model.bin
+    ├── stable_diffusion_image_key_temporal_attention_x1.json
+    └── v2-1_512-ema-pruned.ckpt
+
+
+### 命令行
+
+    python inference.py --cfg configs/i2vgen_xl_infer.yaml
+
+    python inference.py --cfg configs/i2vgen_xl_infer.yaml  test_list_path data/test_list_for_i2vgen.txt test_model i2vgen-xl/i2vgen_xl_00854500.pth
+
+test_list_path 表示输入图像路径及其对应的标题请参考演示文件 data/test_list_for_i2vgen.txt 中的特定格式和建议。test_model 是加载模型的路径。
+
+
+### gradio页面
+
+    python gradio_app.py
+
+注意：第一次执行该命令会下载默认文件，当默认文件下载完毕后需手动注释`~/.cache/modelscope/modelscope_modules/i2vgen-xl/ms_wrapper.py`中的代码
+
+![alt text](readme_imgs/image-3.png)
+        
+## result
+
+||输入|输出|
+|:---|:---|:---|
+|图像|![alt text](readme_imgs/img_0001.jpg)|![alt text](readme_imgs/r.gif)|
+|prompt|A green frog floats on the surface of the water on green lotus leaves, with several pink lotus flowers, in a Chinese painting style.||
+
+### 精度
+
+无
+
+## 应用场景
+
+### 算法类别
+
+`视频生成`
+
+### 热点应用行业
+
+`媒体,科研,教育`
+
+## 源码仓库及问题反馈
+
+* https://developer.hpccube.com/codes/modelzoo/i2vgen-xl_pytorch
+
+## 参考资料
+
+* https://github.com/ali-vilab/VGen
--- a/README_official.MD
+++ b/README_official.MD
--- a/cog.yaml
+++ b/cog.yaml
+# Configuration for Cog ⚙️
+# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
+
+build:
+  gpu: true
+  system_packages:
+    - libgl1-mesa-glx
+    - libglib2.0-0
+    - ffmpeg
+  python_version: "3.11"
+  python_packages:
+    - torch==2.0.1
+    - torchvision==0.15.2
+    - easydict==1.10
+    - tokenizers==0.15.0
+    - ftfy==6.1.1
+    - transformers==4.36.2
+    - imageio==2.33.1
+    - fairscale==0.4.13
+    - open-clip-torch==2.23.0
+    - chardet==5.2.0
+    - torchdiffeq==0.2.3
+    - opencv-python==4.9.0.80
+    - opencv-python-headless==4.9.0.80
+    - torchsde==0.2.6
+    - simplejson==3.19.2
+    - scikit-learn==1.3.2
+    - scikit-image==0.22.0
+    - rotary-embedding-torch==0.5.3
+    - pynvml==11.5.0
+    - triton==2.0.0
+    - pytorch-lightning==2.1.3
+    - torchmetrics==1.2.1
+    - PyYAML==6.0.1
+  run:
+    - pip install -U xformers --index-url https://download.pytorch.org/whl/cu118
+predict: "predict.py:Predictor"
--- a/configs/base.yaml
+++ b/configs/base.yaml
+ENABLE: true
+DATASET: webvid10m
\ No newline at end of file
--- a/configs/higen_infer.yaml
+++ b/configs/higen_infer.yaml
+TASK_TYPE: inference_higen_entrance
+use_fp16: True
+guide_scale: 12.0
+use_fp16: True
+chunk_size: 2
+decoder_bs: 2
+max_frames: 32
+target_fps: 16      # FPS Conditions, not the encoding fps
+scale: 8
+seed: 0
+round: 1
+batch_size: 1
+# For important input
+vldm_cfg: configs/higen_train.yaml
+test_list_path: data/text_list_for_t2v_share.txt
+test_model: models/cvpr2024.t2v.e003.non_ema_0725000.pth
+motion_factor: 500
+appearance_factor: 1.0
\ No newline at end of file
--- a/configs/higen_train.yaml
+++ b/configs/higen_train.yaml
+TASK_TYPE: train_t2v_higen_entrance
+ENABLE: true
+use_ema: true
+num_workers: 6
+frame_lens: [32, 32, 32, 32, 32, 32, 32, 32]
+sample_fps: [8,  8,  8, 8, 8, 8,  8, 8]
+resolution: [448, 256]
+vit_resolution: [224, 224]
+vid_dataset: {
+    'type': 'VideoDataset',
+    'data_list': ['data/vid_list.txt', ],
+    'data_dir_list': ['data/videos/', ],
+    'vit_resolution': [224, 224],
+    'resolution': [448, 256],
+    'get_first_frame': True,
+    'max_words': 1000,
+}
+img_dataset: {
+    'type': 'ImageDataset',
+    'data_list': ['data/img_list.txt', ],
+    'data_dir_list': ['data/images', ],
+    'vit_resolution': [224, 224],
+    'resolution': [448, 256],
+    'max_words': 1000
+}
+embedder: {
+    'type': 'FrozenOpenCLIPTextVisualEmbedder',
+    'layer': 'penultimate',
+    'vit_resolution': [224, 224],
+    'pretrained': 'models/open_clip_pytorch_model.bin'
+}
+UNet: {
+    'type': 'UNetSD_HiGen',
+    'in_dim': 4,
+    'y_dim': 1024,
+    'upper_len': 128,
+    'context_dim': 1024,
+    'concat_dim': 4,
+    'out_dim': 4,
+    'dim_mult': [1, 2, 4, 4],
+    'num_heads': 8,
+    'default_fps': 8,
+    'head_dim': 64,
+    'num_res_blocks': 2,
+    'dropout': 0.1,
+    'temporal_attention': True,
+    'temporal_attn_times': 1,
+    'use_checkpoint': True,
+    'use_fps_condition': False,
+    'use_sim_mask': False,
+    'context_embedding_depth': 2,
+    'num_tokens': 16
+}
+Diffusion: {
+    'type': 'DiffusionDDIM',
+    'schedule': 'linear_sd', # linear_sd
+    'schedule_param': {
+        'num_timesteps': 1000,
+        'zero_terminal_snr': True,
+        'init_beta': 0.00085,
+        'last_beta': 0.0120
+    },
+    'mean_type': 'v',
+    'loss_type': 'mse',
+    'var_type': 'fixed_small',
+    'rescale_timesteps': False,
+    'noise_strength': 0.1
+}
+batch_sizes: {
+    "1": 256,
+    "4": 96,
+    "8": 48,
+    "16": 32,
+    "24": 24,
+    "32": 10
+}
+visual_train: {
+    'type': 'VisualTrainTextImageToVideo',
+    'partial_keys': [
+        # ['y', 'local_image', 'fps'],
+        # ['image', 'local_image', 'fps'],
+        ['y', 'image', 'local_image', 'fps']
+    ],
+    'use_offset_noise': True,
+    'guide_scale': 9.0, 
+}
+
+Pretrain: {
+    'type': pretrain_specific_strategies,
+    'fix_weight': False,
+    'grad_scale': 0.5,
+    'resume_checkpoint': 'models/i2vgen_xl_00854500.pth',
+    'sd_keys_path': 'models/stable_diffusion_image_key_temporal_attention_x1.json',
+}
+
+chunk_size: 4
+decoder_bs: 4
+lr: 0.00003
+
+noise_strength: 0.1
+# classifier-free guidance
+p_zero: 0.0
+guide_scale: 3.0
+num_steps: 1000000
+
+use_zero_infer: True
+viz_interval: 50        # 200
+save_ckp_interval: 50   # 500
+
+# Log
+log_dir: "workspace/experiments"
+log_interval: 1
+seed: 6666
--- a/configs/i2vgen_xl_infer.yaml
+++ b/configs/i2vgen_xl_infer.yaml
+TASK_TYPE: inference_i2vgen_entrance
+use_fp16: True
+guide_scale: 9.0
+use_fp16: True
+chunk_size: 2
+decoder_bs: 2
+max_frames: 16
+target_fps: 16      # FPS Conditions, not the encoding fps
+scale: 8
+seed: 8888
+round: 4
+batch_size: 1
+use_zero_infer: True 
+# For important input
+vldm_cfg: configs/i2vgen_xl_train.yaml
+test_list_path: data/test_list_for_i2vgen.txt
+test_model: i2vgen-xl/i2vgen_xl_00854500.pth
--- a/configs/i2vgen_xl_infer_person.yaml
+++ b/configs/i2vgen_xl_infer_person.yaml
+TASK_TYPE: inference_i2vgen_entrance
+use_fp16: True
+guide_scale: 9.0
+use_fp16: True
+chunk_size: 2
+decoder_bs: 2
+max_frames: 16
+target_fps: 16      # FPS Conditions 
+scale: 8
+batch_size: 1
+use_zero_infer: True 
+# For important input
+round: 4
+seed: 0
+data_root: workspace/test_imgs/test_img_01
+# test_list_path: workspace/test_imgs/test_img_01.txt
+test_list_path: workspace/test_imgs/test_img_02.txt
+cap_dict_path: workspace/test_imgs/cap_dict_01.json
+vldm_cfg: configs/i2vgen_xl_train.yaml
+test_model: i2vgen-xl/i2vgen_xl_person_00854500.pth
--- a/configs/i2vgen_xl_train.yaml
+++ b/configs/i2vgen_xl_train.yaml
+TASK_TYPE: train_i2v_vs_img_text_entrance
+ENABLE: true
+use_ema: true
+num_workers: 6
+frame_lens: [16, 16, 16, 16, 16, 32, 32, 32]
+sample_fps: [8,  8,  16, 16, 16, 8,  16, 16]
+resolution: [1280, 704]
+vit_resolution: [224, 224]
+vid_dataset: {
+    'type': 'VideoDataset',
+    'data_list': ['data/vid_list.txt', ],
+    'data_dir_list': ['data/videos/', ],
+    'vit_resolution': [224, 224],
+    'resolution': [1280, 704],
+    'get_first_frame': True,
+    'max_words': 1000,
+}
+img_dataset: {
+    'type': 'ImageDataset',
+    'data_list': ['data/img_list.txt', ],
+    'data_dir_list': ['data/images', ],
+    'vit_resolution': [224, 224],
+    'resolution': [1280, 704],
+    'max_words': 1000
+}
+embedder: {
+    'type': 'FrozenOpenCLIPTextVisualEmbedder',
+    'layer': 'penultimate',
+    'vit_resolution': [224, 224],
+    'pretrained': 'i2vgen-xl/open_clip_pytorch_model.bin'
+}
+UNet: {
+    'type': 'UNetSD_I2VGen',
+    'in_dim': 4,
+    'y_dim': 1024,
+    'upper_len': 128,
+    'context_dim': 1024,
+    'concat_dim': 4,
+    'out_dim': 4,
+    'dim_mult': [1, 2, 4, 4],
+    'num_heads': 8,
+    'default_fps': 8,
+    'head_dim': 64,
+    'num_res_blocks': 2,
+    'dropout': 0.1,
+    'temporal_attention': True,
+    'temporal_attn_times': 1,
+    'use_checkpoint': True,
+    'use_fps_condition': False,
+    'use_sim_mask': False
+}
+Diffusion: {
+    'type': 'DiffusionDDIM',
+    'schedule': 'cosine', # cosine
+    'schedule_param': {
+        'num_timesteps': 1000,
+        'cosine_s': 0.008,
+        'zero_terminal_snr': True,
+    },
+    'mean_type': 'v',
+    'loss_type': 'mse',
+    'var_type': 'fixed_small',
+    'rescale_timesteps': False,
+    'noise_strength': 0.1
+}
+batch_sizes: {
+    "1": 32,
+    "4": 8,
+    "8": 4,
+    "16": 2,
+    "32": 1,
+}
+visual_train: {
+    'type': 'VisualTrainTextImageToVideo',
+    'partial_keys': [
+        # ['y', 'local_image', 'fps'],
+        # ['image', 'local_image', 'fps'],
+        ['y', 'image', 'local_image', 'fps']
+    ],
+    'use_offset_noise': True,
+    'guide_scale': 9.0, 
+}
+
+Pretrain: {
+    'type': pretrain_specific_strategies,
+    'fix_weight': False,
+    'grad_scale': 0.5,
+    'resume_checkpoint': 'i2vgen-xl/i2vgen_xl_00854500.pth',
+    'sd_keys_path': 'i2vgen-xl/stable_diffusion_image_key_temporal_attention_x1.json',
+}
+
+chunk_size: 4
+decoder_bs: 4
+lr: 0.00003
+
+noise_strength: 0.1
+# classifier-free guidance
+p_zero: 0.0
+guide_scale: 3.0
+num_steps: 1000000
+
+use_zero_infer: True
+viz_interval: 50        # 200
+save_ckp_interval: 50   # 500
+
+# Log
+log_dir: "workspace/experiments"
+log_interval: 1
+seed: 6666
--- a/configs/sr600_infer.yaml
+++ b/configs/sr600_infer.yaml
+TASK_TYPE: inference_sr600_entrance
+use_fp16: True
+vldm_cfg: ''
+round: 1
+batch_size: 1
+# For important input
+test_list_path: data/text_list_for_t2v_share.txt
+test_model: models/sr_step_110000_ema.pth
+
+embedder: {
+    'type': 'FrozenOpenCLIPTextVisualEmbedder',
+    'layer': 'penultimate',
+    'vit_resolution': [224, 224],
+    'pretrained': 'i2vgen-xl/models/open_clip_pytorch_model.bin',
+    'negative_prompt': 'worst quality, normal quality, low quality, low res, blurry, text, watermark, logo, banner, extra digits, cropped, jpeg artifacts, signature, username, error, sketch ,duplicate, ugly, monochrome, horror, geometry, mutation, disgusting',
+    'positive_prompt': ', cinematic, High Contrast, highly detailed, Unreal Engine 5, no blur, full length ultra-wide angle shot a cinematic scene, taken using a Canon EOS R camera, hyper detailed photo - realistic maximum detail, 32k, Color Grading, portrait Photography, ultra HD, extreme meticulous detailing, skin pore detailing, hyper sharpness, perfect without deformations, 4k render'
+}
+UNet: {
+    'type': 'UNetSD_SR600',
+    'in_dim': 4,
+    'dim': 320,
+    'y_dim': 1024,
+    'context_dim': 1024,
+    'out_dim': 4,
+    'dim_mult': [1, 2, 4, 4],
+    'num_heads': 8,
+    'head_dim': 64,
+    'num_res_blocks': 2,
+    'attn_scales' :[1, 0.5, 0.25],
+    'use_scale_shift_norm': True,
+    'dropout': 0.1,
+    'temporal_attn_times': 1,
+    'temporal_attention': True,
+    'use_checkpoint': True,
+    'use_image_dataset': False,
+    'use_sim_mask': False,
+    'inpainting': True
+}
+Diffusion: {
+    'type': 'DiffusionDDIMSR',
+    'reverse_diffusion': {
+      'schedule': 'cosine',
+      'mean_type': 'v',
+      'schedule_param':
+      {
+        'num_timesteps': 1000,
+        'zero_terminal_snr': True
+      }
+    },
+    'forward_diffusion': {
+      'schedule': 'logsnr_cosine_interp',
+      'mean_type': 'v',
+      'schedule_param':
+      {
+        'num_timesteps': 1000,
+        'zero_terminal_snr': True,
+        'scale_min': 2.0,
+        'scale_max': 4.0
+      }
+    }
+}
+batch_sizes: {
+    "1": 256,
+    "4": 96,
+    "8": 48,
+    "16": 32,
+    "24": 24,
+    "32": 10
+}
+visual_train: {
+    'type': 'VisualTrainTextImageToVideo',
+    'partial_keys': [
+        # ['y', 'local_image', 'fps'],
+        # ['image', 'local_image', 'fps'],
+        ['y', 'image', 'local_image', 'fps']
+    ],
+    'use_offset_noise': True,
+    'guide_scale': 9.0, 
+}
+
+chunk_size: 4
+decoder_bs: 4
+lr: 0.00003
+
+noise_strength: 0.1
+# classifier-free guidance
+p_zero: 0.0
+guide_scale: 3.0
+num_steps: 1000000
+
+use_zero_infer: True
+viz_interval: 50        # 200
+save_ckp_interval: 50   # 500
+
+# Log
+log_dir: "workspace/experiments"
+log_interval: 1
+seed: 6666
+
+total_noise_levels: 700
\ No newline at end of file
--- a/configs/t2v_infer.yaml
+++ b/configs/t2v_infer.yaml
+TASK_TYPE: inference_text2video_entrance
+use_fp16: True
+guide_scale: 9.0
+use_fp16: True
+chunk_size: 2
+decoder_bs: 2
+max_frames: 16
+target_fps: 16      # FPS Conditions, not encoding fps
+scale: 8
+batch_size: 1
+use_zero_infer: True 
+# For important input
+round: 4
+seed: 8888
+test_list_path: data/text_img_for_t2v.txt
+vldm_cfg: configs/t2v_train.yaml
+test_model: workspace/model_bk/model_scope_0267000.pth
--- a/configs/t2v_train.yaml
+++ b/configs/t2v_train.yaml
+TASK_TYPE: train_t2v_entrance
+ENABLE: true
+use_ema: false
+num_workers: 6
+frame_lens: [1, 16, 16, 16, 16, 32, 32, 32]
+sample_fps: [1,  8,  16, 16, 16, 8,  16, 16]
+resolution: [448, 256]
+vit_resolution: [224, 224]
+vid_dataset: {
+    'type': 'VideoDataset',
+    'data_list': ['data/vid_list.txt', ],
+    'data_dir_list': ['data/videos/', ],
+    'vit_resolution': [224, 224],
+    'resolution': [448, 256],
+    'get_first_frame': True,
+    'max_words': 1000,
+}
+img_dataset: {
+    'type': 'ImageDataset',
+    'data_list': ['data/img_list.txt', ],
+    'data_dir_list': ['data/images', ],
+    'vit_resolution': [224, 224],
+    'resolution': [448, 256],
+    'max_words': 1000
+}
+embedder: {
+    'type': 'FrozenOpenCLIPTextVisualEmbedder',
+    'layer': 'penultimate',
+    'vit_resolution': [224, 224],
+    'pretrained': 'models/open_clip_pytorch_model.bin'
+}
+UNet: {
+    'type': 'UNetSD_T2VBase',
+    'in_dim': 4,
+    'y_dim': 1024,
+    'upper_len': 128,
+    'context_dim': 1024,
+    'out_dim': 4,
+    'dim_mult': [1, 2, 4, 4],
+    'num_heads': 8,
+    'default_fps': 8,
+    'head_dim': 64,
+    'num_res_blocks': 2,
+    'dropout': 0.1,
+    'misc_dropout': 0.4,
+    'temporal_attention': True,
+    'temporal_attn_times': 1,
+    'use_checkpoint': True,
+    'use_fps_condition': False,
+    'use_sim_mask': False
+}
+Diffusion: {
+    'type': 'DiffusionDDIM',
+    'schedule': 'cosine', # cosine
+    'schedule_param': {
+        'num_timesteps': 1000,
+        'cosine_s': 0.008,
+        'zero_terminal_snr': True,
+    },
+    'mean_type': 'v',
+    'loss_type': 'mse',
+    'var_type': 'fixed_small',
+    'rescale_timesteps': False,
+    'noise_strength': 0.1
+}
+batch_sizes: {
+    "1": 32,
+    "4": 8,
+    "8": 4,
+    "16": 4,
+    "32": 2
+}
+visual_train: {
+    'type': 'VisualTrainTextImageToVideo',
+    'partial_keys': [
+        ['y', 'fps'],
+    ],
+    'use_offset_noise': False,
+    'guide_scale': 9.0, 
+}
+
+Pretrain: {
+    'type': pretrain_specific_strategies,
+    'fix_weight': False,
+    'grad_scale': 0.5,
+    'resume_checkpoint': 'workspace/model_bk/model_scope_0267000.pth',
+    'sd_keys_path': 'data/stable_diffusion_image_key_temporal_attention_x1.json',
+}
+
+chunk_size: 4
+decoder_bs: 4
+lr: 0.00003
+
+noise_strength: 0.1
+# classifier-free guidance
+p_zero: 0.1
+guide_scale: 3.0
+num_steps: 1000000
+
+use_zero_infer: True
+viz_interval: 5        # 200
+save_ckp_interval: 50   # 500
+
+# Log
+log_dir: "workspace/experiments"
+log_interval: 1
+seed: 8888
--- a/data/font/DejaVuSans.ttf
+++ b/data/font/DejaVuSans.ttf
--- a/data/images/s09_003750_037507367.jpg
+++ b/data/images/s09_003750_037507367.jpg
--- a/data/images/s09_006735_067357176.jpg
+++ b/data/images/s09_006735_067357176.jpg
--- a/data/images/s09_006882_068827514.jpg
+++ b/data/images/s09_006882_068827514.jpg
--- a/data/images/s09_009187_091873942.jpg
+++ b/data/images/s09_009187_091873942.jpg
--- a/data/img_list.txt
+++ b/data/img_list.txt
+s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
+s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
+s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
+s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
+s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
+s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
+s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
+s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
+s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
+s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
+s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
+s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
+s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
+s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
+s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
+s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
+s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
+s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea
+s09_009187_091873942.jpg|||FOTON 4x2 4x4 Right Hand Drive Mobile Outdoor Waterproof LED Advertising Truck Manufacturer
+s09_006882_068827514.jpg|||China Electric Propulsion Outboards 6HP 10HP 20HP for high
+s09_003750_037507367.jpg|||Fish Farming Use HDPE Net Cage in The Sea