[docs] Video generation (#6701)

* first draft * fix path * fix path * i2vgen-xl * review * modelscopet2v * feedback

[docs] Video generation (#6701)
* first draft * fix path * fix path * i2vgen-xl * review * modelscopet2v * feedback
3a7e4816 · Steven Liu · GitHub · d649d6c6 · 3a7e4816 · 3a7e4816
Unverified Commit 3a7e4816 authored Feb 16, 2024 by Steven Liu Committed by GitHub Feb 16, 2024
5 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -52,6 +52,8 @@
      title: Image-to-image
    - local: using-diffusers/inpaint
      title: Inpainting
+    - local: using-diffusers/text-img2vid
+      title: Text or image-to-video
    - local: using-diffusers/depth2img
      title: Depth-to-image
    title: Tasks
@@ -323,6 +325,8 @@
        title: Text-to-image
      - local: api/pipelines/stable_diffusion/img2img
        title: Image-to-image
+      - local: api/pipelines/stable_diffusion/svd
+        title: Image-to-video
      - local: api/pipelines/stable_diffusion/inpaint
        title: Inpainting
      - local: api/pipelines/stable_diffusion/depth2img

--- a/docs/source/en/api/attnprocessor.md
+++ b/docs/source/en/api/attnprocessor.md
@@ -20,14 +20,14 @@ An attention processor is a class for applying different types of attention mech
 ## AttnProcessor2_0
 [[autodoc]] models.attention_processor.AttnProcessor2_0
-## FusedAttnProcessor2_0
+## AttnAddedKVProcessor
-[[autodoc]] models.attention_processor.FusedAttnProcessor2_0
+[[autodoc]] models.attention_processor.AttnAddedKVProcessor
-## LoRAAttnProcessor
+## AttnAddedKVProcessor2_0
-[[autodoc]] models.attention_processor.LoRAAttnProcessor
+[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0
-## LoRAAttnProcessor2_0
+## CrossFrameAttnProcessor
-[[autodoc]] models.attention_processor.LoRAAttnProcessor2_0
+[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor
 ## CustomDiffusionAttnProcessor
 [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor
@@ -35,26 +35,29 @@ An attention processor is a class for applying different types of attention mech
 ## CustomDiffusionAttnProcessor2_0
 [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0
-## AttnAddedKVProcessor
+## CustomDiffusionXFormersAttnProcessor
-[[autodoc]] models.attention_processor.AttnAddedKVProcessor
+[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor
-## AttnAddedKVProcessor2_0
+## FusedAttnProcessor2_0
-[[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0
+[[autodoc]] models.attention_processor.FusedAttnProcessor2_0
+## LoRAAttnProcessor
+[[autodoc]] models.attention_processor.LoRAAttnProcessor
+## LoRAAttnProcessor2_0
+[[autodoc]] models.attention_processor.LoRAAttnProcessor2_0
 ## LoRAAttnAddedKVProcessor
 [[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor
-## XFormersAttnProcessor
-[[autodoc]] models.attention_processor.XFormersAttnProcessor
 ## LoRAXFormersAttnProcessor
 [[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor
-## CustomDiffusionXFormersAttnProcessor
-[[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor
 ## SlicedAttnProcessor
 [[autodoc]] models.attention_processor.SlicedAttnProcessor
 ## SlicedAttnAddedKVProcessor
 [[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor
+## XFormersAttnProcessor
+[[autodoc]] models.attention_processor.XFormersAttnProcessor
--- a/docs/source/en/api/pipelines/stable_diffusion/svd.md
+++ b/docs/source/en/api/pipelines/stable_diffusion/svd.md
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Stable Video Diffusion
+Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach.
+The abstract from the paper is:
+*We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.*
+<Tip>
+To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd) guide.
+<br>
+Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints!
+</Tip>
+## Tips
+Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
+Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
+## StableVideoDiffusionPipeline
+[[autodoc]] StableVideoDiffusionPipeline
+## StableVideoDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput
--- a/docs/source/en/api/pipelines/text_to_video.md
+++ b/docs/source/en/api/pipelines/text_to_video.md
@@ -167,6 +167,12 @@ Here are some sample outputs:
    </tr>
 </table>
+## Tips
+Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient.
+Check out the [Text or image-to-video](text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage.
 <Tip>
 Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.

--- a/docs/source/en/using-diffusers/text-img2vid.md
+++ b/docs/source/en/using-diffusers/text-img2vid.md