i2vgen-xl

1ad55bb4 · mashun1 · 1ad55bb4 · 1ad55bb4 · 1ad55bb4 · 1ad55bb4
Commit 1ad55bb4 authored Mar 15, 2024 by mashun1
20 changed files
--- a/data/videos/1066674790.mp4
+++ b/data/videos/1066674790.mp4
--- a/data/videos/1066674802.mp4
+++ b/data/videos/1066674802.mp4
--- a/data/videos/1066674877.mp4
+++ b/data/videos/1066674877.mp4
--- a/data/videos/1066674883.mp4
+++ b/data/videos/1066674883.mp4
--- a/doc/InstructVideo.md
+++ b/doc/InstructVideo.md
+<div align="center">
+<h1> InstructVideo: Instructing Video Diffusion Models<br>with Human Feedback
+</h1>
+<div>
+    <a href='https://jacobyuan7.github.io/' target='_blank'>Hangjie Yuan</a>&emsp;
+    <a href='https://scholar.google.com/citations?user=ZO3OQ-8AAAAJ&hl=en&oi=ao' target='_blank'>Shiwei Zhang</a>&emsp;
+    <a href='https://scholar.google.com/citations?user=cQbXvkcAAAAJ&hl=en' target='_blank'>Xiang Wang</a>&emsp;
+    <a href='https://scholar.google.com/citations?hl=zh-CN&user=grn93WcAAAAJ' target='_blank'>Yujie Wei</a>&emsp;
+    <a href='https://scholar.google.com/citations?user=JT8hRbgAAAAJ&hl=en' target='_blank'>Tao Feng</a>&emsp;<br>
+<!--     Yining Pan&emsp;<br> -->
+    <a href='https://pynsigrid.github.io/' target='_blank'>Yining Pan</a>&emsp;
+    <a href='https://scholar.google.com/citations?user=16RDSEUAAAAJ&hl=en' target='_blank'>Yingya Zhang</a>&emsp;
+    <a href='https://liuziwei7.github.io/' target='_blank'>Ziwei Liu</a>&emsp;
+    <a href='https://samuelalbanie.com/' target='_blank'>Samuel Albanie</a>&emsp;
+    <a href='https://scholar.google.com/citations?user=boUZ-jwAAAAJ&hl=en' target='_blank'>Dong Ni</a>&emsp;
+</div>
+<br>
+[![arXiv](https://img.shields.io/badge/arXiv-InstructVideo-<COLOR>.svg)](xxxxx)
+[![Project Page](https://img.shields.io/badge/Project_Page-InstructVideo-<COLOR>.svg)](https://instructvideo.github.io/)
+[![GitHub Stars](https://img.shields.io/github/stars/damo-vilab/i2vgen-xl?style=social)](https://github.com/damo-vilab/i2vgen-xl)
+[![GitHub Forks](https://img.shields.io/github/forks/damo-vilab/i2vgen-xl)](https://github.com/damo-vilab/i2vgen-xl)
+[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fdamo-vilab%2Fi2vgen-xl&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com)
+</div>
+> Abstract:
+> Diffusion models have emerged as the de facto paradigm for video generation. 
+> However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts.
+> To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning.
+> InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the ab sence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. 
+> To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. 
+> Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. 
+> Code and models will be made publicly available at this repo.
+## Todo list
+Note that if you can not get access to the links provided below, try using another browser or contact me by e-mail or raise an issue. 
+- [ ] 🕘 Release code for fine-tuning and inference.
+- [ ] 🕘 Release pre-training and fine-tuning data list (should be obtained from WebVid10M). 
+- [ ] 🕘 Release pre-training and fine-tuned checkpoints.  
+## Configrue the Environment
+Please refer to the main [README](https://github.com/damo-vilab/i2vgen-xl/blob/main/README.MD) to configure the environment.
+## Fine-tuning and Inference
+Pre-trained models and details on InstrcutVideo fine-tuning and inference are coming soon. Stay tuned!
+## Citation
+```bibtex
+@article{2023InstructVideo,
+    title={InstructVideo: Instructing Video Diffusion Models with Human Feedback},
+    author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong},
+    booktitle={arXiv preprint arXiv:2312.12490},
+    year={2023}
+}
+```
--- a/doc/i2vgen-xl.md
+++ b/doc/i2vgen-xl.md
+# I2VGen-XL
+Official repo for [I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models](https://arxiv.org/abs/2311.04145)
+Please see [Project Page](https://i2vgen-xl.github.io) for more examples.
+![method](../source/i2vgen_fig_02.jpg "method")
+I2VGen-XL is capable of generating high-quality, realistically animated, and temporally coherent high-definition videos from a single input static image, based on user input.
+*Our initial version has already been open-sourced on [Modelscope](https://modelscope.cn/models/damo/Image-to-Video/summary). This project focuses on improving the version, especially in terms of motions and semantics.*
+## Examples
+![figure2](../source/i2vgen_fig_04.png "figure2")
--- a/doc/introduction.pdf
+++ b/doc/introduction.pdf
--- a/gradio_app.py
+++ b/gradio_app.py
+import os
+# os.system('pip install "modelscope" --upgrade -f https://pypi.org/project/modelscope/')
+# os.system('pip install "gradio==3.39.0"')
+import gradio as gr
+from modelscope.pipelines import pipeline
+from modelscope.outputs import OutputKeys
+image_to_video_pipe = pipeline(task="image-to-video", model='damo/i2vgen-xl', model_revision='v1.1.3', device='cuda:0')
+def upload_file(file):
+    return file.name
+def image_to_video(image_in, text_in):
+    if image_in is None:
+        raise gr.Error('请上传图片或等待图片上传完成')
+    print(image_in)
+    output_video_path = image_to_video_pipe(image_in, caption=text_in)[OutputKeys.OUTPUT_VIDEO]
+    print(output_video_path)
+    return output_video_path
+with gr.Blocks() as demo:
+    gr.Markdown(
+        """<center><font size=7>I2VGen-XL</center>
+        <left><font size=3>I2VGen-XL可以根据用户输入的静态图像和文本生成目标接近、语义相同的视频，生成的视频具高清(1280 * 720)、宽屏(16:9)、时序连贯、质感好等特点。</left>
+        <left><font size=3>I2VGen-XL can generate videos with similar contents and semantics based on user input static images and text. The generated videos have characteristics such as high-definition (1280 * 720), widescreen (16:9), coherent timing, and good texture.</left>
+        """
+    )
+    with gr.Box():
+        gr.Markdown(
+        """<left><font size=3>选择合适的图片进行上传，并补充对视频内容的英文文本描述，然后点击“生成视频”。</left>
+        <left><font size=3>Please choose the image to upload (we recommend the image size be 1280 * 720), provide the English text description of the video you wish to create, and then click on "Generate Video" to receive the generated video.</left>"""
+        )
+        with gr.Row():
+            with gr.Column():
+                text_in = gr.Textbox(label="文本描述", lines=2, elem_id="text-in")
+                image_in = gr.Image(label="图片输入", type="filepath", interactive=False, elem_id="image-in", height=300)
+                with gr.Row():
+                    upload_image = gr.UploadButton("上传图片", file_types=["image"], file_count="single")
+                    image_submit = gr.Button("生成视频🎬")
+            with gr.Column():
+                video_out_1 = gr.Video(label='生成的视频', elem_id='video-out_1', interactive=False, height=300)
+    gr.Markdown("<left><font size=2>注：如果生成的视频无法播放，请尝试升级浏览器或使用chrome浏览器。</left>")
+    upload_image.upload(upload_file, upload_image, image_in, queue=False)
+    image_submit.click(fn=image_to_video, inputs=[image_in, text_in], outputs=[video_out_1])
+demo.queue(status_update_rate=1, api_open=False).launch(share=False, show_error=True, server_name="0.0.0.0")
\ No newline at end of file
--- a/inference.py
+++ b/inference.py
+import os
+import sys
+import copy
+import json
+import math
+import random
+import logging
+import itertools
+import numpy as np
+from utils.config import Config
+from utils.registry_class import INFER_ENGINE
+from tools import *
+if __name__ == '__main__':
+    cfg_update = Config(load=True)
+    INFER_ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict)
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=568
+# 模型名称
+modelName=i2vgen-xl_pytorch
+# 模型描述
+modelDescription=i2vgen-xl可以静态图变为高清动态视频
+# 应用场景
+appScenario=推理,视频生成,媒体,科研,教育
+# 框架类型
+frameType=pytorch
--- a/predict.py
+++ b/predict.py
+# Prediction interface for Cog ⚙️
+# https://github.com/replicate/cog/blob/main/docs/python.md
+import os
+import yaml
+import pynvml
+from PIL import Image
+import torch.distributed as dist
+import torch
+import torch.cuda.amp as amp
+from torch.nn.parallel import DistributedDataParallel
+from einops import rearrange
+from cog import BasePredictor, Input, Path
+from tools.modules.config import cfg
+from utils.multi_port import find_free_port
+from utils.seed import setup_seed
+from utils.video_op import save_i2vgen_video, save_i2vgen_video_safe
+from utils.assign_cfg import assign_signle_cfg
+from utils.registry_class import MODEL, EMBEDDER, AUTO_ENCODER, DIFFUSION
+import utils.transforms as data
+class Predictor(BasePredictor):
+    def setup(self) -> None:
+        """Load the model into memory to make running multiple predictions efficient"""
+        with open("configs/i2vgen_xl_infer.yaml", "r") as file:
+            config = yaml.safe_load(file)
+        self.cfg = assign_signle_cfg(cfg, config, "vldm_cfg")
+        for k, v in config.items():
+            if isinstance(v, dict) and k in self.cfg:
+                self.cfg[k].update(v)
+            else:
+                self.cfg[k] = v
+        if not "MASTER_ADDR" in os.environ:
+            os.environ["MASTER_ADDR"] = "localhost"
+            os.environ["MASTER_PORT"] = find_free_port()
+        self.cfg.gpu = 0
+        self.cfg.pmi_rank = int(os.getenv("RANK", 0))
+        self.cfg.pmi_world_size = int(os.getenv("WORLD_SIZE", 1))
+        self.cfg.gpus_per_machine = torch.cuda.device_count()
+        self.cfg.world_size = self.cfg.pmi_world_size * self.cfg.gpus_per_machine
+        torch.cuda.set_device(self.cfg.gpu)
+        torch.backends.cudnn.benchmark = True
+        self.cfg.rank = self.cfg.pmi_rank * self.cfg.gpus_per_machine + self.cfg.gpu
+        dist.init_process_group(
+            backend="nccl", world_size=self.cfg.world_size, rank=self.cfg.rank
+        )
+        # [Diffusion]
+        self.diffusion = DIFFUSION.build(self.cfg.Diffusion)
+        # [Model] embedder
+        self.clip_encoder = EMBEDDER.build(self.cfg.embedder)
+        self.clip_encoder.model.to(self.cfg.gpu)
+        _, _, zero_y_negative = self.clip_encoder(text=self.cfg.negative_prompt)
+        self.zero_y_negative = zero_y_negative.detach()
+        self.black_image_feature = torch.zeros([1, 1, self.cfg.UNet.y_dim]).cuda()
+        # [Model] auotoencoder
+        self.autoencoder = AUTO_ENCODER.build(self.cfg.auto_encoder)
+        self.autoencoder.eval()  # freeze
+        for param in self.autoencoder.parameters():
+            param.requires_grad = False
+        self.autoencoder.cuda()
+        # [Model] UNet
+        self.model = MODEL.build(self.cfg.UNet)
+        checkpoint_dict = torch.load(self.cfg.test_model, map_location="cpu")
+        state_dict = checkpoint_dict["state_dict"]
+        status = self.model.load_state_dict(state_dict, strict=True)
+        print("Load model from {} with status {}".format(self.cfg.test_model, status))
+        self.model = self.model.to(self.cfg.gpu)
+        self.model.eval()
+        self.model = DistributedDataParallel(self.model, device_ids=[self.cfg.gpu])
+        torch.cuda.empty_cache()
+        print("Models loaded!")
+    def predict(
+        self,
+        image: Path = Input(description="Input image."),
+        prompt: str = Input(description="Describe the input image."),
+        max_frames: int = Input(
+            description="Number of frames in the output", default=16, ge=2
+        ),
+        num_inference_steps: int = Input(
+            description="Number of denoising steps", ge=1, le=500, default=50
+        ),
+        guidance_scale: float = Input(
+            description="Scale for classifier-free guidance", ge=1, le=20, default=9
+        ),
+        seed: int = Input(
+            description="Random seed. Leave blank to randomize the seed", default=None
+        ),
+    ) -> Path:
+        """Run a single prediction on the model"""
+        image = Image.open(str(image)).convert("RGB")
+        if seed is None:
+            seed = int.from_bytes(os.urandom(2), "big")
+        print(f"Using seed: {seed}")
+        setup_seed(seed)
+        # [Data] Data Transform
+        train_trans = data.Compose(
+            [
+                data.CenterCropWide(size=self.cfg.resolution),
+                data.ToTensor(),
+                data.Normalize(mean=self.cfg.mean, std=self.cfg.std),
+            ]
+        )
+        vit_trans = data.Compose(
+            [
+                data.CenterCropWide(
+                    size=(self.cfg.resolution[0], self.cfg.resolution[0])
+                ),
+                data.Resize(self.cfg.vit_resolution),
+                data.ToTensor(),
+                data.Normalize(mean=self.cfg.vit_mean, std=self.cfg.vit_std),
+            ]
+        )
+        captions = [prompt]
+        with torch.no_grad():
+            image_tensor = vit_trans(image)
+            image_tensor = image_tensor.unsqueeze(0)
+            y_visual, y_text, y_words = self.clip_encoder(
+                image=image_tensor, text=captions
+            )
+            y_visual = y_visual.unsqueeze(1)
+        fps_tensor = torch.tensor(
+            [self.cfg.target_fps], dtype=torch.long, device=self.cfg.gpu
+        )
+        image_id_tensor = train_trans([image]).to(self.cfg.gpu)
+        local_image = self.autoencoder.encode_firsr_stage(
+            image_id_tensor, self.cfg.scale_factor
+        ).detach()
+        local_image = local_image.unsqueeze(2).repeat_interleave(
+            repeats=max_frames, dim=2
+        )
+        with torch.no_grad():
+            pynvml.nvmlInit()
+            handle = pynvml.nvmlDeviceGetHandleByIndex(0)
+            meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
+            print(f"GPU Memory used {meminfo.used / (1024 ** 3):.2f} GB")
+            # Sample images
+            with amp.autocast(enabled=self.cfg.use_fp16):
+                noise = torch.randn(
+                    [
+                        1,
+                        4,
+                        max_frames,
+                        int(self.cfg.resolution[1] / self.cfg.scale),
+                        int(self.cfg.resolution[0] / self.cfg.scale),
+                    ]
+                )
+                noise = noise.to(self.cfg.gpu)
+                infer_img = (
+                    self.black_image_feature if self.cfg.use_zero_infer else None
+                )
+                model_kwargs = [
+                    {
+                        "y": y_words,
+                        "image": y_visual,
+                        "local_image": local_image,
+                        "fps": fps_tensor,
+                    },
+                    {
+                        "y": self.zero_y_negative,
+                        "image": infer_img,
+                        "local_image": local_image,
+                        "fps": fps_tensor,
+                    },
+                ]
+                video_data = self.diffusion.ddim_sample_loop(
+                    noise=noise,
+                    model=self.model.eval(),
+                    model_kwargs=model_kwargs,
+                    guide_scale=guidance_scale,
+                    ddim_timesteps=num_inference_steps,
+                    eta=0.0,
+                )
+        video_data = 1.0 / self.cfg.scale_factor * video_data  # [1, 4, 32, 46]
+        video_data = rearrange(video_data, "b c f h w -> (b f) c h w")
+        chunk_size = min(self.cfg.decoder_bs, video_data.shape[0])
+        video_data_list = torch.chunk(
+            video_data, video_data.shape[0] // chunk_size, dim=0
+        )
+        decode_data = []
+        for vd_data in video_data_list:
+            gen_frames = self.autoencoder.decode(vd_data)
+            decode_data.append(gen_frames)
+        video_data = torch.cat(decode_data, dim=0)
+        video_data = rearrange(
+            video_data, "(b f) c h w -> b c f h w", b=self.cfg.batch_size
+        )
+        text_size = cfg.resolution[-1]
+        out_path = "/tmp/out.mp4"
+        try:
+            save_i2vgen_video_safe(
+                out_path,
+                video_data.cpu(),
+                captions,
+                self.cfg.mean,
+                self.cfg.std,
+                text_size,
+            )
+        except Exception as e:
+            print(f"Step: save text or video error with {e}")
+        torch.cuda.synchronize()
+        dist.barrier()
+        return Path(out_path)
--- a/readme_imgs/image-1.png
+++ b/readme_imgs/image-1.png
--- a/readme_imgs/image-2.png
+++ b/readme_imgs/image-2.png
--- a/readme_imgs/image-3.png
+++ b/readme_imgs/image-3.png
--- a/readme_imgs/img_0001.jpg
+++ b/readme_imgs/img_0001.jpg
--- a/readme_imgs/r.gif
+++ b/readme_imgs/r.gif
--- a/requirements.txt
+++ b/requirements.txt
+easydict==1.10
+tokenizers
+numpy>=1.19.2
+ftfy==6.1.1
+transformers==4.38.2
+imageio==2.15.0
+fairscale==0.4.6
+ipdb
+open-clip-torch==2.0.2
+# xformers==0.0.13
+chardet==5.1.0
+torchdiffeq==0.2.3
+opencv-python
+# opencv-python-headless==4.7.0.68
+torchsde==0.2.6
+simplejson==3.18.4
+# motion-vector-extractor==1.0.6
+scikit-learn
+scikit-image
+rotary-embedding-torch==0.2.1
+pynvml==11.5.0
+# triton==2.0.0.dev20221120
+pytorch-lightning
+torchmetrics==0.6.0
+gradio==3.39.0
+imageio-ffmpeg
--- a/source/VGen.jpg
+++ b/source/VGen.jpg
--- a/source/bat_man.png
+++ b/source/bat_man.png
--- a/source/duck.png
+++ b/source/duck.png