Commit 1ad55bb4 authored by mashun1's avatar mashun1
Browse files

i2vgen-xl

parents
Pipeline #819 canceled with stages
<div align="center">
<h1> InstructVideo: Instructing Video Diffusion Models<br>with Human Feedback
</h1>
<div>
<a href='https://jacobyuan7.github.io/' target='_blank'>Hangjie Yuan</a>&emsp;
<a href='https://scholar.google.com/citations?user=ZO3OQ-8AAAAJ&hl=en&oi=ao' target='_blank'>Shiwei Zhang</a>&emsp;
<a href='https://scholar.google.com/citations?user=cQbXvkcAAAAJ&hl=en' target='_blank'>Xiang Wang</a>&emsp;
<a href='https://scholar.google.com/citations?hl=zh-CN&user=grn93WcAAAAJ' target='_blank'>Yujie Wei</a>&emsp;
<a href='https://scholar.google.com/citations?user=JT8hRbgAAAAJ&hl=en' target='_blank'>Tao Feng</a>&emsp;<br>
<!-- Yining Pan&emsp;<br> -->
<a href='https://pynsigrid.github.io/' target='_blank'>Yining Pan</a>&emsp;
<a href='https://scholar.google.com/citations?user=16RDSEUAAAAJ&hl=en' target='_blank'>Yingya Zhang</a>&emsp;
<a href='https://liuziwei7.github.io/' target='_blank'>Ziwei Liu</a>&emsp;
<a href='https://samuelalbanie.com/' target='_blank'>Samuel Albanie</a>&emsp;
<a href='https://scholar.google.com/citations?user=boUZ-jwAAAAJ&hl=en' target='_blank'>Dong Ni</a>&emsp;
</div>
<br>
[![arXiv](https://img.shields.io/badge/arXiv-InstructVideo-<COLOR>.svg)](xxxxx)
[![Project Page](https://img.shields.io/badge/Project_Page-InstructVideo-<COLOR>.svg)](https://instructvideo.github.io/)
[![GitHub Stars](https://img.shields.io/github/stars/damo-vilab/i2vgen-xl?style=social)](https://github.com/damo-vilab/i2vgen-xl)
[![GitHub Forks](https://img.shields.io/github/forks/damo-vilab/i2vgen-xl)](https://github.com/damo-vilab/i2vgen-xl)
[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fdamo-vilab%2Fi2vgen-xl&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com)
</div>
> Abstract:
> Diffusion models have emerged as the de facto paradigm for video generation.
> However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts.
> To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning.
> InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the ab sence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2.
> To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning.
> Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities.
> Code and models will be made publicly available at this repo.
## Todo list
Note that if you can not get access to the links provided below, try using another browser or contact me by e-mail or raise an issue.
- [ ] 🕘 Release code for fine-tuning and inference.
- [ ] 🕘 Release pre-training and fine-tuning data list (should be obtained from WebVid10M).
- [ ] 🕘 Release pre-training and fine-tuned checkpoints.
## Configrue the Environment
Please refer to the main [README](https://github.com/damo-vilab/i2vgen-xl/blob/main/README.MD) to configure the environment.
## Fine-tuning and Inference
Pre-trained models and details on InstrcutVideo fine-tuning and inference are coming soon. Stay tuned!
## Citation
```bibtex
@article{2023InstructVideo,
title={InstructVideo: Instructing Video Diffusion Models with Human Feedback},
author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong},
booktitle={arXiv preprint arXiv:2312.12490},
year={2023}
}
```
# I2VGen-XL
Official repo for [I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models](https://arxiv.org/abs/2311.04145)
Please see [Project Page](https://i2vgen-xl.github.io) for more examples.
![method](../source/i2vgen_fig_02.jpg "method")
I2VGen-XL is capable of generating high-quality, realistically animated, and temporally coherent high-definition videos from a single input static image, based on user input.
*Our initial version has already been open-sourced on [Modelscope](https://modelscope.cn/models/damo/Image-to-Video/summary). This project focuses on improving the version, especially in terms of motions and semantics.*
## Examples
![figure2](../source/i2vgen_fig_04.png "figure2")
import os
# os.system('pip install "modelscope" --upgrade -f https://pypi.org/project/modelscope/')
# os.system('pip install "gradio==3.39.0"')
import gradio as gr
from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys
image_to_video_pipe = pipeline(task="image-to-video", model='damo/i2vgen-xl', model_revision='v1.1.3', device='cuda:0')
def upload_file(file):
return file.name
def image_to_video(image_in, text_in):
if image_in is None:
raise gr.Error('请上传图片或等待图片上传完成')
print(image_in)
output_video_path = image_to_video_pipe(image_in, caption=text_in)[OutputKeys.OUTPUT_VIDEO]
print(output_video_path)
return output_video_path
with gr.Blocks() as demo:
gr.Markdown(
"""<center><font size=7>I2VGen-XL</center>
<left><font size=3>I2VGen-XL可以根据用户输入的静态图像和文本生成目标接近、语义相同的视频,生成的视频具高清(1280 * 720)、宽屏(16:9)、时序连贯、质感好等特点。</left>
<left><font size=3>I2VGen-XL can generate videos with similar contents and semantics based on user input static images and text. The generated videos have characteristics such as high-definition (1280 * 720), widescreen (16:9), coherent timing, and good texture.</left>
"""
)
with gr.Box():
gr.Markdown(
"""<left><font size=3>选择合适的图片进行上传,并补充对视频内容的英文文本描述,然后点击“生成视频”。</left>
<left><font size=3>Please choose the image to upload (we recommend the image size be 1280 * 720), provide the English text description of the video you wish to create, and then click on "Generate Video" to receive the generated video.</left>"""
)
with gr.Row():
with gr.Column():
text_in = gr.Textbox(label="文本描述", lines=2, elem_id="text-in")
image_in = gr.Image(label="图片输入", type="filepath", interactive=False, elem_id="image-in", height=300)
with gr.Row():
upload_image = gr.UploadButton("上传图片", file_types=["image"], file_count="single")
image_submit = gr.Button("生成视频🎬")
with gr.Column():
video_out_1 = gr.Video(label='生成的视频', elem_id='video-out_1', interactive=False, height=300)
gr.Markdown("<left><font size=2>注:如果生成的视频无法播放,请尝试升级浏览器或使用chrome浏览器。</left>")
upload_image.upload(upload_file, upload_image, image_in, queue=False)
image_submit.click(fn=image_to_video, inputs=[image_in, text_in], outputs=[video_out_1])
demo.queue(status_update_rate=1, api_open=False).launch(share=False, show_error=True, server_name="0.0.0.0")
\ No newline at end of file
import os
import sys
import copy
import json
import math
import random
import logging
import itertools
import numpy as np
from utils.config import Config
from utils.registry_class import INFER_ENGINE
from tools import *
if __name__ == '__main__':
cfg_update = Config(load=True)
INFER_ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict)
# 模型唯一标识
modelCode=568
# 模型名称
modelName=i2vgen-xl_pytorch
# 模型描述
modelDescription=i2vgen-xl可以静态图变为高清动态视频
# 应用场景
appScenario=推理,视频生成,媒体,科研,教育
# 框架类型
frameType=pytorch
# Prediction interface for Cog ⚙️
# https://github.com/replicate/cog/blob/main/docs/python.md
import os
import yaml
import pynvml
from PIL import Image
import torch.distributed as dist
import torch
import torch.cuda.amp as amp
from torch.nn.parallel import DistributedDataParallel
from einops import rearrange
from cog import BasePredictor, Input, Path
from tools.modules.config import cfg
from utils.multi_port import find_free_port
from utils.seed import setup_seed
from utils.video_op import save_i2vgen_video, save_i2vgen_video_safe
from utils.assign_cfg import assign_signle_cfg
from utils.registry_class import MODEL, EMBEDDER, AUTO_ENCODER, DIFFUSION
import utils.transforms as data
class Predictor(BasePredictor):
def setup(self) -> None:
"""Load the model into memory to make running multiple predictions efficient"""
with open("configs/i2vgen_xl_infer.yaml", "r") as file:
config = yaml.safe_load(file)
self.cfg = assign_signle_cfg(cfg, config, "vldm_cfg")
for k, v in config.items():
if isinstance(v, dict) and k in self.cfg:
self.cfg[k].update(v)
else:
self.cfg[k] = v
if not "MASTER_ADDR" in os.environ:
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = find_free_port()
self.cfg.gpu = 0
self.cfg.pmi_rank = int(os.getenv("RANK", 0))
self.cfg.pmi_world_size = int(os.getenv("WORLD_SIZE", 1))
self.cfg.gpus_per_machine = torch.cuda.device_count()
self.cfg.world_size = self.cfg.pmi_world_size * self.cfg.gpus_per_machine
torch.cuda.set_device(self.cfg.gpu)
torch.backends.cudnn.benchmark = True
self.cfg.rank = self.cfg.pmi_rank * self.cfg.gpus_per_machine + self.cfg.gpu
dist.init_process_group(
backend="nccl", world_size=self.cfg.world_size, rank=self.cfg.rank
)
# [Diffusion]
self.diffusion = DIFFUSION.build(self.cfg.Diffusion)
# [Model] embedder
self.clip_encoder = EMBEDDER.build(self.cfg.embedder)
self.clip_encoder.model.to(self.cfg.gpu)
_, _, zero_y_negative = self.clip_encoder(text=self.cfg.negative_prompt)
self.zero_y_negative = zero_y_negative.detach()
self.black_image_feature = torch.zeros([1, 1, self.cfg.UNet.y_dim]).cuda()
# [Model] auotoencoder
self.autoencoder = AUTO_ENCODER.build(self.cfg.auto_encoder)
self.autoencoder.eval() # freeze
for param in self.autoencoder.parameters():
param.requires_grad = False
self.autoencoder.cuda()
# [Model] UNet
self.model = MODEL.build(self.cfg.UNet)
checkpoint_dict = torch.load(self.cfg.test_model, map_location="cpu")
state_dict = checkpoint_dict["state_dict"]
status = self.model.load_state_dict(state_dict, strict=True)
print("Load model from {} with status {}".format(self.cfg.test_model, status))
self.model = self.model.to(self.cfg.gpu)
self.model.eval()
self.model = DistributedDataParallel(self.model, device_ids=[self.cfg.gpu])
torch.cuda.empty_cache()
print("Models loaded!")
def predict(
self,
image: Path = Input(description="Input image."),
prompt: str = Input(description="Describe the input image."),
max_frames: int = Input(
description="Number of frames in the output", default=16, ge=2
),
num_inference_steps: int = Input(
description="Number of denoising steps", ge=1, le=500, default=50
),
guidance_scale: float = Input(
description="Scale for classifier-free guidance", ge=1, le=20, default=9
),
seed: int = Input(
description="Random seed. Leave blank to randomize the seed", default=None
),
) -> Path:
"""Run a single prediction on the model"""
image = Image.open(str(image)).convert("RGB")
if seed is None:
seed = int.from_bytes(os.urandom(2), "big")
print(f"Using seed: {seed}")
setup_seed(seed)
# [Data] Data Transform
train_trans = data.Compose(
[
data.CenterCropWide(size=self.cfg.resolution),
data.ToTensor(),
data.Normalize(mean=self.cfg.mean, std=self.cfg.std),
]
)
vit_trans = data.Compose(
[
data.CenterCropWide(
size=(self.cfg.resolution[0], self.cfg.resolution[0])
),
data.Resize(self.cfg.vit_resolution),
data.ToTensor(),
data.Normalize(mean=self.cfg.vit_mean, std=self.cfg.vit_std),
]
)
captions = [prompt]
with torch.no_grad():
image_tensor = vit_trans(image)
image_tensor = image_tensor.unsqueeze(0)
y_visual, y_text, y_words = self.clip_encoder(
image=image_tensor, text=captions
)
y_visual = y_visual.unsqueeze(1)
fps_tensor = torch.tensor(
[self.cfg.target_fps], dtype=torch.long, device=self.cfg.gpu
)
image_id_tensor = train_trans([image]).to(self.cfg.gpu)
local_image = self.autoencoder.encode_firsr_stage(
image_id_tensor, self.cfg.scale_factor
).detach()
local_image = local_image.unsqueeze(2).repeat_interleave(
repeats=max_frames, dim=2
)
with torch.no_grad():
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"GPU Memory used {meminfo.used / (1024 ** 3):.2f} GB")
# Sample images
with amp.autocast(enabled=self.cfg.use_fp16):
noise = torch.randn(
[
1,
4,
max_frames,
int(self.cfg.resolution[1] / self.cfg.scale),
int(self.cfg.resolution[0] / self.cfg.scale),
]
)
noise = noise.to(self.cfg.gpu)
infer_img = (
self.black_image_feature if self.cfg.use_zero_infer else None
)
model_kwargs = [
{
"y": y_words,
"image": y_visual,
"local_image": local_image,
"fps": fps_tensor,
},
{
"y": self.zero_y_negative,
"image": infer_img,
"local_image": local_image,
"fps": fps_tensor,
},
]
video_data = self.diffusion.ddim_sample_loop(
noise=noise,
model=self.model.eval(),
model_kwargs=model_kwargs,
guide_scale=guidance_scale,
ddim_timesteps=num_inference_steps,
eta=0.0,
)
video_data = 1.0 / self.cfg.scale_factor * video_data # [1, 4, 32, 46]
video_data = rearrange(video_data, "b c f h w -> (b f) c h w")
chunk_size = min(self.cfg.decoder_bs, video_data.shape[0])
video_data_list = torch.chunk(
video_data, video_data.shape[0] // chunk_size, dim=0
)
decode_data = []
for vd_data in video_data_list:
gen_frames = self.autoencoder.decode(vd_data)
decode_data.append(gen_frames)
video_data = torch.cat(decode_data, dim=0)
video_data = rearrange(
video_data, "(b f) c h w -> b c f h w", b=self.cfg.batch_size
)
text_size = cfg.resolution[-1]
out_path = "/tmp/out.mp4"
try:
save_i2vgen_video_safe(
out_path,
video_data.cpu(),
captions,
self.cfg.mean,
self.cfg.std,
text_size,
)
except Exception as e:
print(f"Step: save text or video error with {e}")
torch.cuda.synchronize()
dist.barrier()
return Path(out_path)
easydict==1.10
tokenizers
numpy>=1.19.2
ftfy==6.1.1
transformers==4.38.2
imageio==2.15.0
fairscale==0.4.6
ipdb
open-clip-torch==2.0.2
# xformers==0.0.13
chardet==5.1.0
torchdiffeq==0.2.3
opencv-python
# opencv-python-headless==4.7.0.68
torchsde==0.2.6
simplejson==3.18.4
# motion-vector-extractor==1.0.6
scikit-learn
scikit-image
rotary-embedding-torch==0.2.1
pynvml==11.5.0
# triton==2.0.0.dev20221120
pytorch-lightning
torchmetrics==0.6.0
gradio==3.39.0
imageio-ffmpeg
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment