InstructVideo: Instructing Video Diffusion Models
with Human Feedback

Hangjie YuanShiwei ZhangXiang WangYujie WeiTao Feng
Yining PanYingya ZhangZiwei LiuSamuel AlbanieDong Ni

[![arXiv](https://img.shields.io/badge/arXiv-InstructVideo-.svg)](xxxxx) [![Project Page](https://img.shields.io/badge/Project_Page-InstructVideo-.svg)](https://instructvideo.github.io/) [![GitHub Stars](https://img.shields.io/github/stars/damo-vilab/i2vgen-xl?style=social)](https://github.com/damo-vilab/i2vgen-xl) [![GitHub Forks](https://img.shields.io/github/forks/damo-vilab/i2vgen-xl)](https://github.com/damo-vilab/i2vgen-xl) [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2Fdamo-vilab%2Fi2vgen-xl&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com)
> Abstract: > Diffusion models have emerged as the de facto paradigm for video generation. > However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. > To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. > InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the ab sence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. > To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. > Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. > Code and models will be made publicly available at this repo. ## Todo list Note that if you can not get access to the links provided below, try using another browser or contact me by e-mail or raise an issue. - [ ] 🕘 Release code for fine-tuning and inference. - [ ] 🕘 Release pre-training and fine-tuning data list (should be obtained from WebVid10M). - [ ] 🕘 Release pre-training and fine-tuned checkpoints. ## Configrue the Environment Please refer to the main [README](https://github.com/damo-vilab/i2vgen-xl/blob/main/README.MD) to configure the environment. ## Fine-tuning and Inference Pre-trained models and details on InstrcutVideo fine-tuning and inference are coming soon. Stay tuned! ## Citation ```bibtex @article{2023InstructVideo, title={InstructVideo: Instructing Video Diffusion Models with Human Feedback}, author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong}, booktitle={arXiv preprint arXiv:2312.12490}, year={2023} } ```