# Text-to-Video Generation with AnimateDiff ## Overview [AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang*, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai The abstract of the paper is the following: With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at this https URL . ## Available Pipelines: | Pipeline | Tasks | Demo |---|---|:---:| | [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* | ## Usage example AnimateDiff works with a MotionAdapter checkpoint and a Stable Diffusion model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in Stable Diffusion UNet. The following example demonstrates how to use a *MotionAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5. ```python import torch from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler from diffusers.utils import export_to_gif # Load the motion adapter adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2") # load SD 1.5 based finetuned model model_id = "SG161222/Realistic_Vision_V5.1_noVAE" pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter) scheduler = DDIMScheduler.from_pretrained( model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1 ) pipe.scheduler = scheduler # enable memory savings pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() output = pipe( prompt=( "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, " "orange sky, warm lighting, fishing boats, ocean waves seagulls, " "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, " "golden hour, coastal landscape, seaside scenery" ), negative_prompt="bad quality, worse quality", num_frames=16, guidance_scale=7.5, num_inference_steps=25, generator=torch.Generator("cpu").manual_seed(42), ) frames = output.frames[0] export_to_gif(frames, "animation.gif") ``` Here are some sample outputs:
|