Unverified Commit 3a7e4816 authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

[docs] Video generation (#6701)

* first draft

* fix path

* fix path

* i2vgen-xl

* review

* modelscopet2v

* feedback
parent d649d6c6
...@@ -52,6 +52,8 @@ ...@@ -52,6 +52,8 @@
title: Image-to-image title: Image-to-image
- local: using-diffusers/inpaint - local: using-diffusers/inpaint
title: Inpainting title: Inpainting
- local: using-diffusers/text-img2vid
title: Text or image-to-video
- local: using-diffusers/depth2img - local: using-diffusers/depth2img
title: Depth-to-image title: Depth-to-image
title: Tasks title: Tasks
...@@ -323,6 +325,8 @@ ...@@ -323,6 +325,8 @@
title: Text-to-image title: Text-to-image
- local: api/pipelines/stable_diffusion/img2img - local: api/pipelines/stable_diffusion/img2img
title: Image-to-image title: Image-to-image
- local: api/pipelines/stable_diffusion/svd
title: Image-to-video
- local: api/pipelines/stable_diffusion/inpaint - local: api/pipelines/stable_diffusion/inpaint
title: Inpainting title: Inpainting
- local: api/pipelines/stable_diffusion/depth2img - local: api/pipelines/stable_diffusion/depth2img
......
...@@ -20,14 +20,14 @@ An attention processor is a class for applying different types of attention mech ...@@ -20,14 +20,14 @@ An attention processor is a class for applying different types of attention mech
## AttnProcessor2_0 ## AttnProcessor2_0
[[autodoc]] models.attention_processor.AttnProcessor2_0 [[autodoc]] models.attention_processor.AttnProcessor2_0
## FusedAttnProcessor2_0 ## AttnAddedKVProcessor
[[autodoc]] models.attention_processor.FusedAttnProcessor2_0 [[autodoc]] models.attention_processor.AttnAddedKVProcessor
## LoRAAttnProcessor ## AttnAddedKVProcessor2_0
[[autodoc]] models.attention_processor.LoRAAttnProcessor [[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0
## LoRAAttnProcessor2_0 ## CrossFrameAttnProcessor
[[autodoc]] models.attention_processor.LoRAAttnProcessor2_0 [[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor
## CustomDiffusionAttnProcessor ## CustomDiffusionAttnProcessor
[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor
...@@ -35,26 +35,29 @@ An attention processor is a class for applying different types of attention mech ...@@ -35,26 +35,29 @@ An attention processor is a class for applying different types of attention mech
## CustomDiffusionAttnProcessor2_0 ## CustomDiffusionAttnProcessor2_0
[[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0 [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0
## AttnAddedKVProcessor ## CustomDiffusionXFormersAttnProcessor
[[autodoc]] models.attention_processor.AttnAddedKVProcessor [[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor
## AttnAddedKVProcessor2_0 ## FusedAttnProcessor2_0
[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0 [[autodoc]] models.attention_processor.FusedAttnProcessor2_0
## LoRAAttnProcessor
[[autodoc]] models.attention_processor.LoRAAttnProcessor
## LoRAAttnProcessor2_0
[[autodoc]] models.attention_processor.LoRAAttnProcessor2_0
## LoRAAttnAddedKVProcessor ## LoRAAttnAddedKVProcessor
[[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor [[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor
## XFormersAttnProcessor
[[autodoc]] models.attention_processor.XFormersAttnProcessor
## LoRAXFormersAttnProcessor ## LoRAXFormersAttnProcessor
[[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor [[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor
## CustomDiffusionXFormersAttnProcessor
[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor
## SlicedAttnProcessor ## SlicedAttnProcessor
[[autodoc]] models.attention_processor.SlicedAttnProcessor [[autodoc]] models.attention_processor.SlicedAttnProcessor
## SlicedAttnAddedKVProcessor ## SlicedAttnAddedKVProcessor
[[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor [[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
## XFormersAttnProcessor
[[autodoc]] models.attention_processor.XFormersAttnProcessor
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable Video Diffusion
Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach.
The abstract from the paper is:
*We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.*
<Tip>
To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd) guide.
<br>
Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints!
</Tip>
## Tips
Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
## StableVideoDiffusionPipeline
[[autodoc]] StableVideoDiffusionPipeline
## StableVideoDiffusionPipelineOutput
[[autodoc]] pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput
...@@ -167,6 +167,12 @@ Here are some sample outputs: ...@@ -167,6 +167,12 @@ Here are some sample outputs:
</tr> </tr>
</table> </table>
## Tips
Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
<Tip> <Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment