# Text-to-video synthesis Text-to-video synthesis from [ModelScope](https://modelscope.cn/) can be considered the same as Stable Diffusion structure-wise but it is extended to videos instead of static images. More specifically, this system allows us to generate videos from a natural language text prompt. From the [model summary](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis): *This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.* Resources: * [Website](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) * [GitHub repository](https://github.com/modelscope/modelscope/) * [Spaces] (TODO) ## Available Pipelines: | Pipeline | Tasks | Demo |---|---|:---:| | [DiffusionPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [Spaces] (TODO) ## Usage example Let's start by generating a short video with the default length of 16 frames (2s at 8 fps): ```python import torch from diffusers import DiffusionPipeline from diffusers.utils import export_to_video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe = pipe.to("cuda") prompt = "Spiderman is surfing" video_frames = pipe(prompt).frames video_path = export_to_video(video_frames) video_path ``` Diffusers supports different optimization techniques to improve the latency and memory footprint of a pipeline. Since videos are often more memory-heavy than images, we can enable CPU offloading and VAE slicing to keep the memory footprint at bay. Let's generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing: ```python import torch from diffusers import DiffusionPipeline from diffusers.utils import export_to_video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe.enable_model_cpu_offload() # memory optimization pipe.enable_vae_slicing() prompt = "Darth Vader surfing a wave" video_frames = pipe(prompt, num_frames=64).frames video_path = export_to_video(video_frames) video_path ``` It just takes **7 GBs of GPU memory** to generate the 64 video frames using PyTorch 2.0, "fp16" precision and the techniques mentioned above. We can also use a different scheduler easily, using the same method we'd use for Stable Diffusion: ```python import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() prompt = "Spiderman is surfing" video_frames = pipe(prompt, num_inference_steps=25).frames video_path = export_to_video(video_frames) video_path ``` Here are some sample outputs:
An astronaut riding a horse.
An astronaut riding a horse.
Darth vader surfing in waves.
Darth vader surfing in waves.
## Available checkpoints * [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/) * [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy) ## DiffusionPipeline [[autodoc]] DiffusionPipeline - all - __call__