[docs] Clean up pipeline apis (#3905)

* start with stable diffusion * fix * finish stable diffusion pipelines * fix path to pipeline output * fix flax paths * fix copies * add up to score sde ve * finish first pass of pipelines * fix copies * second review * align doc titles * more review fixes * final review

[docs] Clean up pipeline apis (#3905)
* start with stable diffusion * fix * finish stable diffusion pipelines * fix path to pipeline output * fix flax paths * fix copies * add up to score sde ve * finish first pass of pipelines * fix copies * second review * align doc titles * more review fixes * final review
a69754bb · Steven Liu · GitHub · bcc570b9 · a69754bb · a69754bb
Unverified Commit a69754bb authored Jul 21, 2023 by Steven Liu Committed by GitHub Jul 21, 2023
20 changed files
--- a/docs/source/en/api/pipelines/stable_unclip.mdx
+++ b/docs/source/en/api/pipelines/stable_unclip.mdx
@@ -12,27 +12,19 @@ specific language governing permissions and limitations under the License.
 # Stable unCLIP
-Stable unCLIP checkpoints are finetuned from [stable diffusion 2.1](./stable_diffusion_2) checkpoints to condition on CLIP image embeddings.
+Stable unCLIP checkpoints are finetuned from [Stable Diffusion 2.1](./stable_diffusion/stable_diffusion_2) checkpoints to condition on CLIP image embeddings.
-Stable unCLIP also still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used
+Stable unCLIP still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used
 for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation.
-To know more about the unCLIP process, check out the following paper:
+The abstract from the paper is:
-[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen.
+*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
 ## Tips
-Stable unCLIP takes a `noise_level` as input during inference. `noise_level` determines how much noise is added 
+Stable unCLIP takes  `noise_level` as input during inference which determines how much noise is added 
 to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, 
-we do not add any additional noise to the image embeddings i.e. `noise_level = 0`.
+we do not add any additional noise to the image embeddings (`noise_level = 0`).
-### Available checkpoints:
-* Image variation
-	* [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip)
-	* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small)
-* Text-to-image 
-	* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small)
 ### Text-to-Image Generation
 Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha)
@@ -104,51 +96,7 @@ prompt = "A fantasy landscape, trending on artstation"
 images = pipe(init_image, prompt=prompt).images
 images[0].save("variation_image_two.png")
 ```
+## StableUnCLIPPipeline
-### Memory optimization
-If you are short on GPU memory, you can enable smart CPU offloading so that models that are not needed
-immediately for a computation can be offloaded to CPU:
-```python 
-from diffusers import StableUnCLIPImg2ImgPipeline
-from diffusers.utils import load_image
-import torch
-pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16"
-)
-# Offload to CPU.
-pipe.enable_model_cpu_offload()
-url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
-init_image = load_image(url)
-images = pipe(init_image).images
-images[0]
-```
-Further memory optimizations are possible by enabling VAE slicing on the pipeline: 
-```python 
-from diffusers import StableUnCLIPImg2ImgPipeline
-from diffusers.utils import load_image
-import torch
-pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
-    "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16"
-)
-pipe.enable_model_cpu_offload()
-pipe.enable_vae_slicing()
-url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
-init_image = load_image(url)
-images = pipe(init_image).images
-images[0]
-```
-### StableUnCLIPPipeline
 [[autodoc]] StableUnCLIPPipeline
 	- all
@@ -161,7 +109,7 @@ images[0]
 	- disable_xformers_memory_efficient_attention
-### StableUnCLIPImg2ImgPipeline
+## StableUnCLIPImg2ImgPipeline
 [[autodoc]] StableUnCLIPImg2ImgPipeline
 	- all
@@ -172,4 +120,6 @@ images[0]
 	- disable_vae_slicing
 	- enable_xformers_memory_efficient_attention
 	- disable_xformers_memory_efficient_attention
\ No newline at end of file
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
\ No newline at end of file
--- a/docs/source/en/api/pipelines/stochastic_karras_ve.mdx
+++ b/docs/source/en/api/pipelines/stochastic_karras_ve.mdx
@@ -12,25 +12,22 @@ specific language governing permissions and limitations under the License.
 # Stochastic Karras VE
-## Overview
+[Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) is by Tero Karras, Miika Aittala, Timo Aila and Samuli Laine. This pipeline implements the stochastic sampling tailored to variance expanding (VE) models.
-[Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) by Tero Karras, Miika Aittala, Timo Aila and Samuli Laine.
+The abstract from the paper:
-The abstract of the paper is the following:
+*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.*
-We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.
+<Tip>
-This pipeline implements the Stochastic sampling tailored to the Variance-Expanding (VE) models.
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
-## Available Pipelines:
-| Pipeline | Tasks | Colab
-|---|---|:---:|
-| [pipeline_stochastic_karras_ve.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stochastic_karras_ve/pipeline_stochastic_karras_ve.py) | *Unconditional Image Generation* | - |
+</Tip>
 ## KarrasVePipeline
 [[autodoc]] KarrasVePipeline
 	- all
 	- __call__
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
\ No newline at end of file
--- a/docs/source/en/api/pipelines/text_to_video.mdx
+++ b/docs/source/en/api/pipelines/text_to_video.mdx
@@ -12,32 +12,19 @@ specific language governing permissions and limitations under the License.
 <Tip warning={true}>
-This pipeline is for research purposes only. 
+🧪 This pipeline is for research purposes only. 
 </Tip>
-# Text-to-video synthesis
+# Text-to-video
-## Overview
+[VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation](https://huggingface.co/papers/2303.08320) is by Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan.
-[VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation](https://arxiv.org/abs/2303.08320) by Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan.
+The abstract from the paper is:
-The abstract of the paper is the following:
 *A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.*
-Resources:
+You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense).
-* [Website](https://modelscope.cn/models/damo/text-to-video-synthesis/summary)
-* [GitHub repository](https://github.com/modelscope/modelscope/)
-* [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis)
-## Available Pipelines:
-| Pipeline | Tasks | Demo
-|---|---|:---:|
-| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis)
-| [VideoToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth_img2img.py) | *Text-Guided Video-to-Video Generation* | [(TODO)🤗 Spaces]()
 ## Usage example 
@@ -179,35 +166,6 @@ Here are some sample outputs:
    </tr>
 </table>
-### Memory optimizations
-Text-guided video generation with [`~TextToVideoSDPipeline`] and [`~VideoToVideoSDPipeline`] is very memory intensive both
-when denoising with [`~UNet3DConditionModel`] and when decoding with [`~AutoencoderKL`]. It is possible though to reduce 
-memory usage at the cost of increased runtime to achieve the exact same result. To do so, it is recommended to enable
-**forward chunking** and **vae slicing**:
-Forward chunking via [`~UNet3DConditionModel.enable_forward_chunking`]is explained in [this blog post](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers) and 
-allows to significantly reduce the required memory for the unet. You can chunk the feed forward layer over the `num_frames`
-dimension by doing:
-```py
-pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
-```
-Vae slicing via [`~TextToVideoSDPipeline.enable_vae_slicing`] and [`~VideoToVideoSDPipeline.enable_vae_slicing`] also 
-gives significant memory savings since the two pipelines decode all image frames at once.
-```py
-pipe.enable_vae_slicing()
-```
-## Available checkpoints 
-* [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/)
-* [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy)
-* [cerspense/zeroscope_v2_576w](https://huggingface.co/cerspense/zeroscope_v2_576w)
-* [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL)
 ## TextToVideoSDPipeline
 [[autodoc]] TextToVideoSDPipeline
 	- all
@@ -217,3 +175,6 @@ pipe.enable_vae_slicing()
 [[autodoc]] VideoToVideoSDPipeline
 	- all
 	- __call__
+## TextToVideoSDPipelineOutput
+[[autodoc]] pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput
\ No newline at end of file
--- a/docs/source/en/api/pipelines/text_to_video_zero.mdx
+++ b/docs/source/en/api/pipelines/text_to_video_zero.mdx
@@ -10,49 +10,32 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# Zero-Shot Text-to-Video Generation 
+# Text2Video-Zero
-## Overview
+[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by
-[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) by
 Levon Khachatryan,
 Andranik Movsisyan,
 Vahram Tadevosyan,
 Roberto Henschel,
 [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).
-Our method Text2Video-Zero enables zero-shot video generation using either
+Text2Video-Zero enables zero-shot video generation using either:
-1. A textual prompt, or
+1. A textual prompt
-2. A prompt combined with guidance from poses or edges, or 
+2. A prompt combined with guidance from poses or edges
-3. Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
+3. Video Instruct-Pix2Pix (instruction-guided video editing)
-Results are temporally consistent and follow closely the guidance and textual prompts.
+Results are temporally consistent and closely follow the guidance and textual prompts.
 ![teaser-img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/t2v_zero_teaser.png)
-The abstract of the paper is the following:
+The abstract from the paper is:
 *Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain.
 Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object.
 Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
 As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.*
+You can find additional information about Text-to-Video Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero).
-Resources:
-* [Project Page](https://text2video-zero.github.io/)
-* [Paper](https://arxiv.org/abs/2303.13439)
-* [Original Code](https://github.com/Picsart-AI-Research/Text2Video-Zero)
-## Available Pipelines:
-| Pipeline | Tasks | Demo
-|---|---|:---:|
-| [TextToVideoZeroPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py) | *Zero-shot Text-to-Video Generation* | [🤗 Space](https://huggingface.co/spaces/PAIR/Text2Video-Zero)
 ## Usage example
@@ -268,8 +251,10 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below
 You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth).
 ## TextToVideoZeroPipeline
 [[autodoc]] TextToVideoZeroPipeline
 	- all
 	- __call__
\ No newline at end of file
+## TextToVideoPipelineOutput
+[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput
\ No newline at end of file
--- a/docs/source/en/api/pipelines/unclip.mdx
+++ b/docs/source/en/api/pipelines/unclip.mdx
@@ -7,31 +7,31 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# unCLIP
+# UnCLIP
-## Overview
+[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The UnCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)).
-[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
+The abstract from the paper is following:
-The abstract of the paper is the following:
+*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
-Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
+You can find lucidrains DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch).
-The unCLIP model in diffusers comes from kakaobrain's karlo and the original codebase can be found [here](https://github.com/kakaobrain/karlo). Additionally, lucidrains has a DALL-E 2 recreation [here](https://github.com/lucidrains/DALLE2-pytorch).
+<Tip>
-## Available Pipelines:
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
-| Pipeline | Tasks | Colab
-|---|---|:---:|
-| [pipeline_unclip.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/unclip/pipeline_unclip.py) | *Text-to-Image Generation* | - |
-| [pipeline_unclip_image_variation.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/unclip/pipeline_unclip_image_variation.py) | *Image-Guided Image Generation* | - |
+</Tip>
 ## UnCLIPPipeline
 [[autodoc]] UnCLIPPipeline
 	- all
 	- __call__
+## UnCLIPImageVariationPipeline
 [[autodoc]] UnCLIPImageVariationPipeline
 	- all
 	- __call__
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
\ No newline at end of file
--- a/docs/source/en/api/pipelines/unidiffuser.mdx
+++ b/docs/source/en/api/pipelines/unidiffuser.mdx
@@ -12,32 +12,19 @@ specific language governing permissions and limitations under the License.
 # UniDiffuser
-The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://arxiv.org/abs/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu.
+The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu.
-The abstract of the [paper](https://arxiv.org/abs/2303.06555) is the following:
+The abstract from the [paper](https://arxiv.org/abs/2303.06555) is:
 *This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).*
-Resources:
+You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml).
-* [Paper](https://arxiv.org/abs/2303.06555).
+This pipeline was contributed by [dg845](https://github.com/dg845). ❤️
-* [Original Code](https://github.com/thu-ml/unidiffuser).
-Available Checkpoints are:
- *UniDiffuser-v0 (512x512 resolution)* [thu-ml/unidiffuser-v0](https://huggingface.co/thu-ml/unidiffuser-v0)
- *UniDiffuser-v1 (512x512 resolution)* [thu-ml/unidiffuser-v1](https://huggingface.co/thu-ml/unidiffuser-v1)
-This pipeline was contributed by our community member [dg845](https://github.com/dg845).
-## Available Pipelines:
-| Pipeline | Tasks | Demo | Colab |
-|:---:|:---:|:---:|:---:|
-| [UniDiffuserPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_unidiffuser.py) | *Joint Image-Text Gen*, *Text-to-Image*, *Image-to-Text*,<br> *Image Gen*, *Text Gen*, *Image Variation*, *Text Variation* | [🤗 Spaces](https://huggingface.co/spaces/thu-ml/unidiffuser) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/unidiffuser.ipynb) |
 ## Usage Examples
-Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks.
+Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks:
 ### Unconditional Image and Text Generation
@@ -202,3 +189,6 @@ print(final_prompt)
 [[autodoc]] UniDiffuserPipeline
 	- all
 	- __call__
+## ImageTextPipelineOutput
+[[autodoc]] pipelines.ImageTextPipelineOutput
\ No newline at end of file
--- a/docs/source/en/api/pipelines/versatile_diffusion.mdx
+++ b/docs/source/en/api/pipelines/versatile_diffusion.mdx
@@ -10,46 +10,30 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# VersatileDiffusion
+# Versatile Diffusion
-VersatileDiffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi .
+Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi .
-The abstract of the paper is the following:
+The abstract from the paper is:
 *The recent advances in diffusion models have set an impressive milestone in many generation tasks. Trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest in academia and industry. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-flow network, dubbed Versatile Diffusion (VD), that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model. Moreover, we generalize VD to a unified multi-flow multimodal diffusion framework with grouped layers, swappable streams, and other propositions that can process modalities beyond images and text. Through our experiments, we demonstrate that VD and its underlying framework have the following merits: a) VD handles all subtasks with competitive quality; b) VD initiates novel extensions and applications such as disentanglement of style and semantic, image-text dual-guided generation, etc.; c) Through these experiments and applications, VD provides more semantic insights of the generated outputs.*
 ## Tips
- VersatileDiffusion is conceptually very similar as [Stable Diffusion](./stable_diffusion/overview),  but instead of providing just a image data stream conditioned on text, VersatileDiffusion provides both a image and text data stream and can be conditioned on both text and image.
+You can load the more memory intensive "all-in-one" [`VersatileDiffusionPipeline`] that supports all the tasks or use the individual pipelines which are more memory efficient.
-### *Run VersatileDiffusion*
+| **Pipeline**                                         | **Supported tasks**               |
+|------------------------------------------------------|-----------------------------------|
+| [`VersatileDiffusionPipeline`]                       | all of the below                  |
+| [`VersatileDiffusionTextToImagePipeline`]            | text-to-image                     |
+| [`VersatileDiffusionImageVariationPipeline`]         | image variation                   |
+| [`VersatileDiffusionDualGuidedPipeline`]             | image-text dual guided generation |
-You can both load the memory intensive "all-in-one" [`VersatileDiffusionPipeline`] that can run all tasks 
+<Tip>
-with the same class as shown in [`VersatileDiffusionPipeline.text_to_image`], [`VersatileDiffusionPipeline.image_variation`], and [`VersatileDiffusionPipeline.dual_guided`]
-**or**
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
-You can run the individual pipelines which are much more memory efficient:
+</Tip>
- *Text-to-Image*: [`VersatileDiffusionTextToImagePipeline.__call__`]
- *Image Variation*: [`VersatileDiffusionImageVariationPipeline.__call__`]
- *Dual Text and Image Guided Generation*: [`VersatileDiffusionDualGuidedPipeline.__call__`]
-### *How to load and use different schedulers.*
-The versatile diffusion pipelines uses [`DDIMScheduler`] scheduler by default. But `diffusers` provides many other schedulers that can be used with the alt diffusion pipeline such as [`PNDMScheduler`], [`LMSDiscreteScheduler`], [`EulerDiscreteScheduler`], [`EulerAncestralDiscreteScheduler`] etc.
-To use a different scheduler, you can either change it via the [`ConfigMixin.from_config`] method or pass the `scheduler` argument to the `from_pretrained` method of the pipeline. For example, to use the [`EulerDiscreteScheduler`], you can do the following:
-```python
->>> from diffusers import VersatileDiffusionPipeline, EulerDiscreteScheduler
->>> pipeline = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion")
->>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)
->>> # or
->>> euler_scheduler = EulerDiscreteScheduler.from_pretrained("shi-labs/versatile-diffusion", subfolder="scheduler")
->>> pipeline = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion", scheduler=euler_scheduler)
-```
 ## VersatileDiffusionPipeline
 [[autodoc]] VersatileDiffusionPipeline

--- a/docs/source/en/api/pipelines/vq_diffusion.mdx
+++ b/docs/source/en/api/pipelines/vq_diffusion.mdx
@@ -10,26 +10,26 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# VQDiffusion
+# VQ Diffusion
-## Overview
+[Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://huggingface.co/papers/2111.14822) is by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo.
-[Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo
+The abstract from the paper is:
-The abstract of the paper is the following:
+*We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.*
-We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.
+The original codebase can be found at [microsoft/VQ-Diffusion](https://github.com/microsoft/VQ-Diffusion).
-The original codebase can be found [here](https://github.com/microsoft/VQ-Diffusion).
+<Tip>
-## Available Pipelines:
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
-| Pipeline | Tasks | Colab
-|---|---|:---:|
-| [pipeline_vq_diffusion.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/vq_diffusion/pipeline_vq_diffusion.py) | *Text-to-Image Generation* | - |
+</Tip>
 ## VQDiffusionPipeline
 [[autodoc]] VQDiffusionPipeline
 	- all
 	- __call__
+## ImagePipelineOutput
+[[autodoc]] pipelines.ImagePipelineOutput
--- a/src/diffusers/loaders.py
+++ b/src/diffusers/loaders.py
@@ -816,7 +816,8 @@ class LoraLoaderMixin:
    def load_lora_weights(self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], **kwargs):
        """
-        Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into self.unet and self.text_encoder.
+        Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into `self.unet` and
+        `self.text_encoder`.
        All kwargs are forwarded to `self.lora_state_dict`.
@@ -831,8 +832,7 @@ class LoraLoaderMixin:
        Parameters:
            pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`):
                See [`~loaders.LoraLoaderMixin.lora_state_dict`].
+            kwargs (`dict`, *optional*):
-            kwargs:
                See [`~loaders.LoraLoaderMixin.lora_state_dict`].
        """
        state_dict, network_alpha = self.lora_state_dict(pretrained_model_name_or_path_or_dict, **kwargs)
@@ -1171,10 +1171,10 @@ class LoraLoaderMixin:
            save_directory (`str` or `os.PathLike`):
                Directory to save LoRA parameters to. Will be created if it doesn't exist.
            unet_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`):
-                State dict of the LoRA layers corresponding to the UNet.
+                State dict of the LoRA layers corresponding to the `unet`.
-            text_encoder_lora_layers (`Dict[str, torch.nn.Module] or `Dict[str, torch.Tensor]`):
+            text_encoder_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`):
                State dict of the LoRA layers corresponding to the `text_encoder`. Must explicitly pass the text
-                encoder LoRA state dict because it comes 🤗 Transformers.
+                encoder LoRA state dict because it comes from 🤗 Transformers.
            is_main_process (`bool`, *optional*, defaults to `True`):
                Whether the process calling this is the main process or not. Useful during distributed training and you
                need to call this function on all processes. In this case, set `is_main_process=True` only on the main
@@ -1353,7 +1353,7 @@ class FromSingleFileMixin:
                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to True, the model
+                Whether to only load local model weights and configuration files or not. If set to `True`, the model
                won't be downloaded from the Hub.
            use_auth_token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
@@ -1367,7 +1367,7 @@ class FromSingleFileMixin:
                weights. If set to `False`, safetensors weights are not loaded.
            extract_ema (`bool`, *optional*, defaults to `False`):
                Whether to extract the EMA weights or not. Pass `True` to extract the EMA weights which usually yield
-                higher quality images for inference. Non-EMA weights are usually better to continue finetuning.
+                higher quality images for inference. Non-EMA weights are usually better for continuing finetuning.
            upcast_attention (`bool`, *optional*, defaults to `None`):
                Whether the attention computation should always be upcasted.
            image_size (`int`, *optional*, defaults to 512):
@@ -1377,23 +1377,19 @@ class FromSingleFileMixin:
                The prediction type the model was trained on. Use `'epsilon'` for all Stable Diffusion v1 models and
                the Stable Diffusion v2 base model. Use `'v_prediction'` for Stable Diffusion v2.
            num_in_channels (`int`, *optional*, defaults to `None`):
-                The number of input channels. If `None`, it will be automatically inferred.
+                The number of input channels. If `None`, it is automatically inferred.
            scheduler_type (`str`, *optional*, defaults to `"pndm"`):
                Type of scheduler to use. Should be one of `["pndm", "lms", "heun", "euler", "euler-ancestral", "dpm",
                "ddim"]`.
            load_safety_checker (`bool`, *optional*, defaults to `True`):
                Whether to load the safety checker or not.
-            text_encoder (`CLIPTextModel`, *optional*, defaults to `None`):
+            text_encoder ([`~transformers.CLIPTextModel`], *optional*, defaults to `None`):
-                An instance of
+                An instance of `CLIPTextModel` to use, specifically the
-                [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel) to use,
+                [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. If this
-                specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)
+                parameter is `None`, the function loads a new instance of `CLIPTextModel` by itself if needed.
-                variant. If this parameter is `None`, the function will load a new instance of [CLIP] by itself, if
+            tokenizer ([`~transformers.CLIPTokenizer`], *optional*, defaults to `None`):
-                needed.
+                An instance of `CLIPTokenizer` to use. If this parameter is `None`, the function loads a new instance
-            tokenizer (`CLIPTokenizer`, *optional*, defaults to `None`):
+                of `CLIPTokenizer` by itself if needed.
-                An instance of
-                [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer)
-                to use. If this parameter is `None`, the function will load a new instance of [CLIPTokenizer] by
-                itself, if needed.
            kwargs (remaining dictionary of keyword arguments, *optional*):
                Can be used to overwrite load and saveable variables (for example the pipeline components of the
                specific pipeline class). The overwritten components are directly passed to the pipelines `__init__`

--- a/src/diffusers/pipelines/alt_diffusion/__init__.py
+++ b/src/diffusers/pipelines/alt_diffusion/__init__.py
@@ -16,11 +16,11 @@ class AltDiffusionPipelineOutput(BaseOutput):
    Args:
        images (`List[PIL.Image.Image]` or `np.ndarray`)
-            List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width,
+            List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
-            num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline.
+            num_channels)`.
        nsfw_content_detected (`List[bool]`)
-            List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work"
+            List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or
-            (nsfw) content, or `None` if safety checking could not be performed.
+            `None` if safety checking could not be performed.
    """
    images: Union[List[PIL.Image.Image], np.ndarray]

--- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py
+++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion.py
@@ -71,36 +71,33 @@ class AltDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
    r"""
    Pipeline for text-to-image generation using Alt Diffusion.
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
-    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
-    In addition the pipeline inherits the following loading methods:
+    The pipeline also inherits the following loading methods:
-        - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`]
+        - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
-        - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`]
+        - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
-        - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`]
+        - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
+        - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
-    as well as the following saving methods:
-        - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
    Args:
        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
+            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
-        text_encoder ([`RobertaSeriesModelWithTransformation`]):
+        text_encoder ([`~transformers.RobertaSeriesModelWithTransformation`]):
-            Frozen text-encoder. Alt Diffusion uses the text portion of
+            Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.RobertaSeriesModelWithTransformation),
+        tokenizer ([`~transformers.XLMRobertaTokenizer`]):
-            specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
+            A `XLMRobertaTokenizer` to tokenize text.
-        tokenizer (`XLMRobertaTokenizer`):
+        unet ([`UNet2DConditionModel`]):
-            Tokenizer of class
+            A `UNet2DConditionModel` to denoise the encoded image latents.
-            [XLMRobertaTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.XLMRobertaTokenizer).
-        unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
        scheduler ([`SchedulerMixin`]):
            A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
            [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
        safety_checker ([`StableDiffusionSafetyChecker`]):
            Classification module that estimates whether generated images could be considered offensive or harmful.
-            Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details.
+            Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
-        feature_extractor ([`CLIPImageProcessor`]):
+            about a model's potential harms.
-            Model that extracts features from generated images to be used as inputs for the `safety_checker`.
+        feature_extractor ([`~transformers.CLIPImageProcessor`]):
+            A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
    """
    _optional_components = ["safety_checker", "feature_extractor"]
@@ -196,42 +193,39 @@ class AltDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
    def enable_vae_slicing(self):
        r"""
-        Enable sliced VAE decoding.
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
-        steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()
    def disable_vae_slicing(self):
        r"""
-        Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to
+        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()
    def enable_vae_tiling(self):
        r"""
-        Enable tiled VAE decoding.
+        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in
+        processing larger images.
-        several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
        """
        self.vae.enable_tiling()
    def disable_vae_tiling(self):
        r"""
-        Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to
+        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_tiling()
    def enable_model_cpu_offload(self, gpu_id=0):
        r"""
-        Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared
+        Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
-        to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward`
+        time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
-        method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with
+        Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
-        `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`.
+        iterative execution of the `unet`.
        """
        if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
            from accelerate import cpu_offload_with_hook
@@ -542,78 +536,69 @@ class AltDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
        guidance_rescale: float = 0.0,
    ):
        r"""
-        Function invoked when calling the pipeline for generation.
+        The call function to the pipeline for generation.
        Args:
            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
-                instead.
+            height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
-            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
                The height in pixels of the generated image.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+            width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
                The width in pixels of the generated image.
            num_inference_steps (`int`, *optional*, defaults to 50):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            guidance_scale (`float`, *optional*, defaults to 7.5):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                A higher guidance scale value encourages the model to generate images closely linked to the text
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                The prompt or prompts to guide what to not include in image generation. If not defined, you need to
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
-                less than `1`).
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
+                Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
-                [`schedulers.DDIMScheduler`], will be ignored for others.
+                to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                to make generation deterministic.
+                generation deterministic.
            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
+                tensor is generated by sampling using the supplied random `generator`.
            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
-                provided, text embeddings will be generated from `prompt` input argument.
+                provided, text embeddings are generated from the `prompt` input argument.
            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
-                argument.
            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
+                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] instead of a
                plain tuple.
            callback (`Callable`, *optional*):
-                A function that will be called every `callback_steps` steps during inference. The function will be
+                A function that calls every `callback_steps` steps during inference. The function is called with the
-                called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+                following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
            callback_steps (`int`, *optional*, defaults to 1):
-                The frequency at which the `callback` function will be called. If not specified, the callback will be
+                The frequency at which the `callback` function is called. If not specified, the callback is called at
-                called at every step.
+                every step.
            cross_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
-                `self.processor` in
+                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
-                [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
            guidance_rescale (`float`, *optional*, defaults to 0.7):
-                Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
+                Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are
-                Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
+                Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when
-                [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
+                using zero terminal SNR.
-                Guidance rescale factor should fix overexposure when using zero terminal SNR.
        Examples:
        Returns:
            [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] or `tuple`:
-            [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
+                If `return_dict` is `True`, [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] is returned,
-            When returning a tuple, the first element is a list with the generated images, and the second element is a
+                otherwise a `tuple` is returned where the first element is a list with the generated images and the
-            list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
+                second element is a list of `bool`s indicating whether the corresponding generated image contains
-            (nsfw) content, according to the `safety_checker`.
+                "not-safe-for-work" (nsfw) content.
        """
        # 0. Default height and width to unet
        height = height or self.unet.config.sample_size * self.vae_scale_factor

--- a/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py
+++ b/src/diffusers/pipelines/alt_diffusion/pipeline_alt_diffusion_img2img.py
@@ -99,38 +99,35 @@ class AltDiffusionImg2ImgPipeline(
    DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin
 ):
    r"""
-    Pipeline for text-guided image to image generation using Alt Diffusion.
+    Pipeline for text-guided image-to-image generation using Alt Diffusion.
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
-    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
-    In addition the pipeline inherits the following loading methods:
+    The pipeline also inherits the following loading methods:
-        - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`]
+        - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
-        - *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`]
+        - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
-        - *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`]
+        - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
+        - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
-    as well as the following saving methods:
-        - *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
    Args:
        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
+            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
-        text_encoder ([`RobertaSeriesModelWithTransformation`]):
+        text_encoder ([`~transformers.RobertaSeriesModelWithTransformation`]):
-            Frozen text-encoder. Alt Diffusion uses the text portion of
+            Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.RobertaSeriesModelWithTransformation),
+        tokenizer ([`~transformers.XLMRobertaTokenizer`]):
-            specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
+            A `XLMRobertaTokenizer` to tokenize text.
-        tokenizer (`XLMRobertaTokenizer`):
+        unet ([`UNet2DConditionModel`]):
-            Tokenizer of class
+            A `UNet2DConditionModel` to denoise the encoded image latents.
-            [XLMRobertaTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.XLMRobertaTokenizer).
-        unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
        scheduler ([`SchedulerMixin`]):
            A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
            [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
        safety_checker ([`StableDiffusionSafetyChecker`]):
            Classification module that estimates whether generated images could be considered offensive or harmful.
-            Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details.
+            Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
-        feature_extractor ([`CLIPImageProcessor`]):
+            about a model's potential harms.
-            Model that extracts features from generated images to be used as inputs for the `safety_checker`.
+        feature_extractor ([`~transformers.CLIPImageProcessor`]):
+            A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
    """
    _optional_components = ["safety_checker", "feature_extractor"]
@@ -226,10 +223,10 @@ class AltDiffusionImg2ImgPipeline(
    def enable_model_cpu_offload(self, gpu_id=0):
        r"""
-        Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared
+        Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
-        to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward`
+        time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
-        method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with
+        Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
-        `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`.
+        iterative execution of the `unet`.
        """
        if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
            from accelerate import cpu_offload_with_hook
@@ -587,74 +584,66 @@ class AltDiffusionImg2ImgPipeline(
        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
    ):
        r"""
-        Function invoked when calling the pipeline for generation.
+        The call function to the pipeline for generation.
        Args:
            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
-                instead.
            image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
-                `Image`, or tensor representing an image batch, that will be used as the starting point for the
+                `Image` or tensor representing an image batch to be used as the starting point. Can also accept image
-                process. Can also accpet image latents as `image`, if passing latents directly, it will not be encoded
+                latents as `image`, but if passing latents directly it is not encoded again.
-                again.
            strength (`float`, *optional*, defaults to 0.8):
-                Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image`
+                Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
-                will be used as a starting point, adding more noise to it the larger the `strength`. The number of
+                starting point and more noise is added the higher the `strength`. The number of denoising steps depends
-                denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will
+                on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
-                be maximum and the denoising process will run for the full number of iterations specified in
+                process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
-                `num_inference_steps`. A value of 1, therefore, essentially ignores `image`.
+                essentially ignores `image`.
            num_inference_steps (`int`, *optional*, defaults to 50):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference. This parameter will be modulated by `strength`.
+                expense of slower inference. This parameter is modulated by `strength`.
            guidance_scale (`float`, *optional*, defaults to 7.5):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                A higher guidance scale value encourages the model to generate images closely linked to the text
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                The prompt or prompts to guide what to not include in image generation. If not defined, you need to
-                `negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale`
+                pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
-                is less than `1`).
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
+                Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
-                [`schedulers.DDIMScheduler`], will be ignored for others.
+                to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
-            generator (`torch.Generator`, *optional*):
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                to make generation deterministic.
+                generation deterministic.
            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
-                provided, text embeddings will be generated from `prompt` input argument.
+                provided, text embeddings are generated from the `prompt` input argument.
            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
-                argument.
            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
+                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] instead of a
                plain tuple.
            callback (`Callable`, *optional*):
-                A function that will be called every `callback_steps` steps during inference. The function will be
+                A function that calls every `callback_steps` steps during inference. The function is called with the
-                called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+                following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
            callback_steps (`int`, *optional*, defaults to 1):
-                The frequency at which the `callback` function will be called. If not specified, the callback will be
+                The frequency at which the `callback` function is called. If not specified, the callback is called at
-                called at every step.
+                every step.
            cross_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
-                `self.processor` in
+                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
-                [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
        Examples:
        Returns:
            [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] or `tuple`:
-            [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
+                If `return_dict` is `True`, [`~pipelines.stable_diffusion.AltDiffusionPipelineOutput`] is returned,
-            When returning a tuple, the first element is a list with the generated images, and the second element is a
+                otherwise a `tuple` is returned where the first element is a list with the generated images and the
-            list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
+                second element is a list of `bool`s indicating whether the corresponding generated image contains
-            (nsfw) content, according to the `safety_checker`.
+                "not-safe-for-work" (nsfw) content.
        """
        # 1. Check inputs. Raise error if not correct
        self.check_inputs(prompt, strength, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds)

--- a/src/diffusers/pipelines/audio_diffusion/mel.py
+++ b/src/diffusers/pipelines/audio_diffusion/mel.py
@@ -37,13 +37,20 @@ from PIL import Image  # noqa: E402
 class Mel(ConfigMixin, SchedulerMixin):
    """
    Parameters:
-        x_res (`int`): x resolution of spectrogram (time)
+        x_res (`int`):
-        y_res (`int`): y resolution of spectrogram (frequency bins)
+            x resolution of spectrogram (time).
-        sample_rate (`int`): sample rate of audio
+        y_res (`int`):
-        n_fft (`int`): number of Fast Fourier Transforms
+            y resolution of spectrogram (frequency bins).
-        hop_length (`int`): hop length (a higher number is recommended for lower than 256 y_res)
+        sample_rate (`int`):
-        top_db (`int`): loudest in decibels
+            Sample rate of audio.
-        n_iter (`int`): number of iterations for Griffin Linn mel inversion
+        n_fft (`int`):
+            Number of Fast Fourier Transforms.
+        hop_length (`int`):
+            Hop length (a higher number is recommended if `y_res` < 256).
+        top_db (`int`):
+            Loudest decibel value.
+        n_iter (`int`):
+            Number of iterations for Griffin-Lim Mel inversion.
    """
    config_name = "mel_config.json"
@@ -74,8 +81,10 @@ class Mel(ConfigMixin, SchedulerMixin):
        """Set resolution.
        Args:
-            x_res (`int`): x resolution of spectrogram (time)
+            x_res (`int`):
-            y_res (`int`): y resolution of spectrogram (frequency bins)
+                x resolution of spectrogram (time).
+            y_res (`int`):
+                y resolution of spectrogram (frequency bins).
        """
        self.x_res = x_res
        self.y_res = y_res
@@ -86,8 +95,10 @@ class Mel(ConfigMixin, SchedulerMixin):
        """Load audio.
        Args:
-            audio_file (`str`): must be a file on disk due to Librosa limitation or
+            audio_file (`str`):
-            raw_audio (`np.ndarray`): audio as numpy array
+                An audio file that must be on disk due to [Librosa](https://librosa.org/) limitation.
+            raw_audio (`np.ndarray`):
+                The raw audio file as a NumPy array.
        """
        if audio_file is not None:
            self.audio, _ = librosa.load(audio_file, mono=True, sr=self.sr)
@@ -102,7 +113,8 @@ class Mel(ConfigMixin, SchedulerMixin):
        """Get number of slices in audio.
        Returns:
-            `int`: number of spectograms audio can be sliced into
+            `int`:
+                Number of spectograms audio can be sliced into.
        """
        return len(self.audio) // self.slice_size
@@ -110,18 +122,21 @@ class Mel(ConfigMixin, SchedulerMixin):
        """Get slice of audio.
        Args:
-            slice (`int`): slice number of audio (out of get_number_of_slices())
+            slice (`int`):
+                Slice number of audio (out of `get_number_of_slices()`).
        Returns:
-            `np.ndarray`: audio as numpy array
+            `np.ndarray`:
+                The audio slice as a NumPy array.
        """
        return self.audio[self.slice_size * slice : self.slice_size * (slice + 1)]
    def get_sample_rate(self) -> int:
-        """Get sample rate:
+        """Get sample rate.
        Returns:
-            `int`: sample rate of audio
+            `int`:
+                Sample rate of audio.
        """
        return self.sr
@@ -129,10 +144,12 @@ class Mel(ConfigMixin, SchedulerMixin):
        """Convert slice of audio to spectrogram.
        Args:
-            slice (`int`): slice number of audio to convert (out of get_number_of_slices())
+            slice (`int`):
+                Slice number of audio to convert (out of `get_number_of_slices()`).
        Returns:
-            `PIL Image`: grayscale image of x_res x y_res
+            `PIL Image`:
+                A grayscale image of `x_res x y_res`.
        """
        S = librosa.feature.melspectrogram(
            y=self.get_audio_slice(slice), sr=self.sr, n_fft=self.n_fft, hop_length=self.hop_length, n_mels=self.n_mels
@@ -146,10 +163,12 @@ class Mel(ConfigMixin, SchedulerMixin):
        """Converts spectrogram to audio.
        Args:
-            image (`PIL Image`): x_res x y_res grayscale image
+            image (`PIL Image`):
+                An grayscale image of `x_res x y_res`.
        Returns:
-            audio (`np.ndarray`): raw audio
+            audio (`np.ndarray`):
+                The audio as a NumPy array.
        """
        bytedata = np.frombuffer(image.tobytes(), dtype="uint8").reshape((image.height, image.width))
        log_S = bytedata.astype("float") * self.top_db / 255 - self.top_db

--- a/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py
+++ b/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py
@@ -29,14 +29,21 @@ from .mel import Mel
 class AudioDiffusionPipeline(DiffusionPipeline):
    """
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+    Pipeline for audio diffusion.
-    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
+    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
    Parameters:
-        vqae ([`AutoencoderKL`]): Variational AutoEncoder for Latent Audio Diffusion or None
+        vqae ([`AutoencoderKL`]):
-        unet ([`UNet2DConditionModel`]): UNET model
+            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
-        mel ([`Mel`]): transform audio <-> spectrogram
+        unet ([`UNet2DConditionModel`]):
-        scheduler ([`DDIMScheduler` or `DDPMScheduler`]): de-noising scheduler
+            A `UNet2DConditionModel` to denoise the encoded image latents.
+        mel ([`Mel`]):
+            Transform audio into a spectrogram.
+        scheduler ([`DDIMScheduler`] or [`DDPMScheduler`]):
+            A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
+            [`DDIMScheduler`] or [`DDPMScheduler`].
    """
    _optional_components = ["vqvae"]
@@ -52,10 +59,11 @@ class AudioDiffusionPipeline(DiffusionPipeline):
        self.register_modules(unet=unet, scheduler=scheduler, mel=mel, vqvae=vqvae)
    def get_default_steps(self) -> int:
-        """Returns default number of steps recommended for inference
+        """Returns default number of steps recommended for inference.
        Returns:
-            `int`: number of steps
+            `int`:
+                The number of steps.
        """
        return 50 if isinstance(self.scheduler, DDIMScheduler) else 1000
@@ -80,26 +88,90 @@ class AudioDiffusionPipeline(DiffusionPipeline):
        Union[AudioPipelineOutput, ImagePipelineOutput],
        Tuple[List[Image.Image], Tuple[int, List[np.ndarray]]],
    ]:
-        """Generate random mel spectrogram from audio input and convert to audio.
+        """
+        The call function to the pipeline for generation.
        Args:
-            batch_size (`int`): number of samples to generate
+            batch_size (`int`):
-            audio_file (`str`): must be a file on disk due to Librosa limitation or
+                Number of samples to generate.
-            raw_audio (`np.ndarray`): audio as numpy array
+            audio_file (`str`):
-            slice (`int`): slice number of audio to convert
+                An audio file that must be on disk due to [Librosa](https://librosa.org/) limitation.
-            start_step (int): step to start from
+            raw_audio (`np.ndarray`):
-            steps (`int`): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
+                The raw audio file as a NumPy array.
-            generator (`torch.Generator`): random number generator or None
+            slice (`int`):
-            mask_start_secs (`float`): number of seconds of audio to mask (not generate) at start
+                Slice number of audio to convert.
-            mask_end_secs (`float`): number of seconds of audio to mask (not generate) at end
+            start_step (int):
-            step_generator (`torch.Generator`): random number generator used to de-noise or None
+                Step to start diffusion from.
-            eta (`float`): parameter between 0 and 1 used with DDIM scheduler
+            steps (`int`):
-            noise (`torch.Tensor`): noise tensor of shape (batch_size, 1, height, width) or None
+                Number of denoising steps (defaults to `50` for DDIM and `1000` for DDPM).
-            encoding (`torch.Tensor`): for UNet2DConditionModel shape (batch_size, seq_length, cross_attention_dim)
+            generator (`torch.Generator`):
-            return_dict (`bool`): if True return AudioPipelineOutput, ImagePipelineOutput else Tuple
+                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
+                generation deterministic.
+            mask_start_secs (`float`):
+                Number of seconds of audio to mask (not generate) at start.
+            mask_end_secs (`float`):
+                Number of seconds of audio to mask (not generate) at end.
+            step_generator (`torch.Generator`):
+                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) used to denoise.
+                None
+            eta (`float`):
+                Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
+                to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
+            noise (`torch.Tensor`):
+                A noise tensor of shape `(batch_size, 1, height, width)` or `None`.
+            encoding (`torch.Tensor`):
+                A tensor for [`UNet2DConditionModel`] of shape `(batch_size, seq_length, cross_attention_dim)`.
+            return_dict (`bool`):
+                Whether or not to return a [`AudioPipelineOutput`], [`ImagePipelineOutput`] or a plain tuple.
+        Examples:
+        For audio diffusion:
+        ```py
+        import torch
+        from IPython.display import Audio
+        from diffusers import DiffusionPipeline
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)
+        output = pipe()
+        display(output.images[0])
+        display(Audio(output.audios[0], rate=mel.get_sample_rate()))
+        ```
+        For latent audio diffusion:
+        ```py
+        import torch
+        from IPython.display import Audio
+        from diffusers import DiffusionPipeline
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)
+        output = pipe()
+        display(output.images[0])
+        display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
+        ```
+        For other tasks like variation, inpainting, outpainting, etc:
+        ```py
+        output = pipe(
+            raw_audio=output.audios[0, 0],
+            start_step=int(pipe.get_default_steps() / 2),
+            mask_start_secs=1,
+            mask_end_secs=1,
+        )
+        display(output.images[0])
+        display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
+        ```
        Returns:
-            `List[PIL Image]`: mel spectrograms (`float`, `List[np.ndarray]`): sample rate and raw audios
+            `List[PIL Image]`:
+                A list of Mel spectrograms (`float`, `List[np.ndarray]`) with the sample rate and raw audio.
        """
        steps = steps or self.get_default_steps()
@@ -197,14 +269,18 @@ class AudioDiffusionPipeline(DiffusionPipeline):
    @torch.no_grad()
    def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray:
-        """Reverse step process: recover noisy image from generated image.
+        """
+        Reverse the denoising step process to recover a noisy image from the generated image.
        Args:
-            images (`List[PIL Image]`): list of images to encode
+            images (`List[PIL Image]`):
-            steps (`int`): number of encoding steps to perform (defaults to 50)
+                List of images to encode.
+            steps (`int`):
+                Number of encoding steps to perform (defaults to `50`).
        Returns:
-            `np.ndarray`: noise tensor of shape (batch_size, 1, height, width)
+            `np.ndarray`:
+                A noise tensor of shape `(batch_size, 1, height, width)`.
        """
        # Only works with DDIM as this method is deterministic
@@ -234,15 +310,19 @@ class AudioDiffusionPipeline(DiffusionPipeline):
    @staticmethod
    def slerp(x0: torch.Tensor, x1: torch.Tensor, alpha: float) -> torch.Tensor:
-        """Spherical Linear intERPolation
+        """Spherical Linear intERPolation.
        Args:
-            x0 (`torch.Tensor`): first tensor to interpolate between
+            x0 (`torch.Tensor`):
-            x1 (`torch.Tensor`): seconds tensor to interpolate between
+                The first tensor to interpolate between.
-            alpha (`float`): interpolation between 0 and 1
+            x1 (`torch.Tensor`):
+                Second tensor to interpolate between.
+            alpha (`float`):
+                Interpolation between 0 and 1
        Returns:
-            `torch.Tensor`: interpolated tensor
+            `torch.Tensor`:
+                The interpolated tensor.
        """
        theta = acos(torch.dot(torch.flatten(x0), torch.flatten(x1)) / torch.norm(x0) / torch.norm(x1))

--- a/src/diffusers/pipelines/audioldm/pipeline_audioldm.py
+++ b/src/diffusers/pipelines/audioldm/pipeline_audioldm.py
@@ -31,14 +31,19 @@ logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 EXAMPLE_DOC_STRING = """
    Examples:
        ```py
-        >>> import torch
        >>> from diffusers import AudioLDMPipeline
+        >>> import torch
+        >>> import scipy
-        >>> pipe = AudioLDMPipeline.from_pretrained("cvssp/audioldm", torch_dtype=torch.float16)
+        >>> repo_id = "cvssp/audioldm-s-full-v2"
+        >>> pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
        >>> pipe = pipe.to("cuda")
-        >>> prompt = "A hammer hitting a wooden surface"
+        >>> prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
-        >>> audio = pipe(prompt).audios[0]
+        >>> audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
+        >>> # save the audio sample as a .wav file
+        >>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
        ```
 """
@@ -47,26 +52,24 @@ class AudioLDMPipeline(DiffusionPipeline):
    r"""
    Pipeline for text-to-audio generation using AudioLDM.
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
-    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
    Args:
        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode audios to and from latent representations.
+            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
-        text_encoder ([`ClapTextModelWithProjection`]):
+        text_encoder ([`~transformers.ClapTextModelWithProjection`]):
-            Frozen text-encoder. AudioLDM uses the text portion of
+            Frozen text-encoder (`ClapTextModelWithProjection`, specifically the
-            [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap#transformers.ClapTextModelWithProjection),
+            [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused) variant.
-            specifically the [RoBERTa HSTAT-unfused](https://huggingface.co/laion/clap-htsat-unfused) variant.
        tokenizer ([`PreTrainedTokenizer`]):
-            Tokenizer of class
+            A [`~transformers.RobertaTokenizer`] to tokenize text.
-            [RobertaTokenizer](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer).
+        unet ([`UNet2DConditionModel`]):
-        unet ([`UNet2DConditionModel`]): U-Net architecture to denoise the encoded audio latents.
+            A `UNet2DConditionModel` to denoise the encoded audio latents.
        scheduler ([`SchedulerMixin`]):
            A scheduler to be used in combination with `unet` to denoise the encoded audio latents. Can be one of
            [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
-        vocoder ([`SpeechT5HifiGan`]):
+        vocoder ([`~transformers.SpeechT5HifiGan`]):
-            Vocoder of class
+            Vocoder of class `SpeechT5HifiGan`.
-            [SpeechT5HifiGan](https://huggingface.co/docs/transformers/main/en/model_doc/speecht5#transformers.SpeechT5HifiGan).
    """
    def __init__(
@@ -93,17 +96,15 @@ class AudioLDMPipeline(DiffusionPipeline):
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
    def enable_vae_slicing(self):
        r"""
-        Enable sliced VAE decoding.
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
-        steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
    def disable_vae_slicing(self):
        r"""
-        Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to
+        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()
@@ -382,70 +383,62 @@ class AudioLDMPipeline(DiffusionPipeline):
        output_type: Optional[str] = "np",
    ):
        r"""
-        Function invoked when calling the pipeline for generation.
+        The call function to the pipeline for generation.
        Args:
            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the audio generation. If not defined, one has to pass `prompt_embeds`.
+                The prompt or prompts to guide audio generation. If not defined, you need to pass `prompt_embeds`.
-                instead.
            audio_length_in_s (`int`, *optional*, defaults to 5.12):
                The length of the generated audio sample in seconds.
            num_inference_steps (`int`, *optional*, defaults to 10):
                The number of denoising steps. More denoising steps usually lead to a higher quality audio at the
                expense of slower inference.
            guidance_scale (`float`, *optional*, defaults to 2.5):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                A higher guidance scale value encourages the model to generate audio that is closely linked to the text
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                `prompt` at the expense of lower sound quality. Guidance scale is enabled when `guidance_scale > 1`.
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate audios that are closely linked to the text `prompt`,
-                usually at the expense of lower sound quality.
            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the audio generation. If not defined, one has to pass
+                The prompt or prompts to guide what to not include in audio generation. If not defined, you need to
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
-                less than `1`).
            num_waveforms_per_prompt (`int`, *optional*, defaults to 1):
                The number of waveforms to generate per prompt.
            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
+                Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
-                [`schedulers.DDIMScheduler`], will be ignored for others.
+                to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                to make generation deterministic.
+                generation deterministic.
            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for audio
+                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
+                tensor is generated by sampling using the supplied random `generator`.
            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
-                provided, text embeddings will be generated from `prompt` input argument.
+                provided, text embeddings are generated from the `prompt` input argument.
            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
-                argument.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
                plain tuple.
            callback (`Callable`, *optional*):
-                A function that will be called every `callback_steps` steps during inference. The function will be
+                A function that calls every `callback_steps` steps during inference. The function is called with the
-                called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+                following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
            callback_steps (`int`, *optional*, defaults to 1):
-                The frequency at which the `callback` function will be called. If not specified, the callback will be
+                The frequency at which the `callback` function is called. If not specified, the callback is called at
-                called at every step.
+                every step.
            cross_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttnProcessor` as defined under
+                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
-                `self.processor` in
+                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
-                [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
            output_type (`str`, *optional*, defaults to `"np"`):
-                The output format of the generate image. Choose between:
+                The output format of the generated image. Choose between `"np"` to return a NumPy `np.ndarray` or
-                - `"np"`: Return Numpy `np.ndarray` objects.
+                `"pt"` to return a PyTorch `torch.Tensor` object.
-                - `"pt"`: Return PyTorch `torch.Tensor` objects.
        Examples:
        Returns:
            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
-            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
+                If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
-            When returning a tuple, the first element is a list with the generated audios.
+                otherwise a `tuple` is returned where the first element is a list with the generated audio.
        """
        # 0. Convert audio input length from seconds to spectrogram height
        vocoder_upsample_factor = np.prod(self.vocoder.config.upsample_rates) / self.vocoder.config.sampling_rate

--- a/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py
+++ b/src/diffusers/pipelines/consistency_models/pipeline_consistency_models.py
@@ -50,20 +50,17 @@ EXAMPLE_DOC_STRING = """
 class ConsistencyModelPipeline(DiffusionPipeline):
    r"""
-    Pipeline for consistency models for unconditional or class-conditional image generation, as introduced in [1].
+    Pipeline for unconditional or class-conditional image generation.
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
-    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
-    [1] Song, Yang and Dhariwal, Prafulla and Chen, Mark and Sutskever, Ilya. "Consistency Models"
-    https://arxiv.org/pdf/2303.01469
    Args:
        unet ([`UNet2DModel`]):
-            Unconditional or class-conditional U-Net architecture to denoise image latents.
+            A `UNet2DModel` to denoise the encoded image latents.
        scheduler ([`SchedulerMixin`]):
-            A scheduler to be used in combination with `unet` to denoise the image latents. Currently only compatible
+            A scheduler to be used in combination with `unet` to denoise the encoded image latents. Currently only
-            with [`CMStochasticIterativeScheduler`].
+            compatible with [`CMStochasticIterativeScheduler`].
    """
    def __init__(self, unet: UNet2DModel, scheduler: CMStochasticIterativeScheduler) -> None:
@@ -78,10 +75,10 @@ class ConsistencyModelPipeline(DiffusionPipeline):
    def enable_model_cpu_offload(self, gpu_id=0):
        r"""
-        Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared
+        Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
-        to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward`
+        time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
-        method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with
+        Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
-        `enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`.
+        iterative execution of the `unet`.
        """
        if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
            from accelerate import cpu_offload_with_hook
@@ -201,8 +198,8 @@ class ConsistencyModelPipeline(DiffusionPipeline):
            batch_size (`int`, *optional*, defaults to 1):
                The number of images to generate.
            class_labels (`torch.Tensor` or `List[int]` or `int`, *optional*):
-                Optional class labels for conditioning class-conditional consistency models. Will not be used if the
+                Optional class labels for conditioning class-conditional consistency models. Not used if the model is
-                model is not class-conditional.
+                not class-conditional.
            num_inference_steps (`int`, *optional*, defaults to 1):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
@@ -210,29 +207,29 @@ class ConsistencyModelPipeline(DiffusionPipeline):
                Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
                timesteps are used. Must be in descending order.
            generator (`torch.Generator`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                to make generation deterministic.
+                generation deterministic.
            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
+                tensor is generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
+                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
            callback (`Callable`, *optional*):
-                A function that will be called every `callback_steps` steps during inference. The function will be
+                A function that calls every `callback_steps` steps during inference. The function is called with the
-                called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+                following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
            callback_steps (`int`, *optional*, defaults to 1):
-                The frequency at which the `callback` function will be called. If not specified, the callback will be
+                The frequency at which the `callback` function is called. If not specified, the callback is called at
-                called at every step.
+                every step.
        Examples:
        Returns:
-            [`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is
+            [`~pipelines.ImagePipelineOutput`] or `tuple`:
-            True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images.
+                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
+                returned where the first element is a list with the generated images.
        """
        # 0. Prepare call parameters
        img_size = self.unet.config.sample_size

--- a/src/diffusers/pipelines/controlnet/pipeline_controlnet.py
+++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet.py
@@ -181,17 +181,15 @@ class StableDiffusionControlNetPipeline(
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
    def enable_vae_slicing(self):
        r"""
-        Enable sliced VAE decoding.
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
-        steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
    def disable_vae_slicing(self):
        r"""
-        Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to
+        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()
@@ -199,17 +197,16 @@ class StableDiffusionControlNetPipeline(
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
    def enable_vae_tiling(self):
        r"""
-        Enable tiled VAE decoding.
+        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in
+        processing larger images.
-        several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
        """
        self.vae.enable_tiling()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
    def disable_vae_tiling(self):
        r"""
-        Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to
+        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_tiling()

--- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py
+++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py
@@ -207,17 +207,15 @@ class StableDiffusionControlNetImg2ImgPipeline(
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
    def enable_vae_slicing(self):
        r"""
-        Enable sliced VAE decoding.
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
-        steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
    def disable_vae_slicing(self):
        r"""
-        Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to
+        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()
@@ -225,17 +223,16 @@ class StableDiffusionControlNetImg2ImgPipeline(
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
    def enable_vae_tiling(self):
        r"""
-        Enable tiled VAE decoding.
+        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in
+        processing larger images.
-        several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
        """
        self.vae.enable_tiling()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
    def disable_vae_tiling(self):
        r"""
-        Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to
+        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_tiling()

--- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py
+++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py
@@ -324,17 +324,15 @@ class StableDiffusionControlNetInpaintPipeline(
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
    def enable_vae_slicing(self):
        r"""
-        Enable sliced VAE decoding.
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
-        steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
    def disable_vae_slicing(self):
        r"""
-        Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to
+        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()
@@ -342,17 +340,16 @@ class StableDiffusionControlNetInpaintPipeline(
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
    def enable_vae_tiling(self):
        r"""
-        Enable tiled VAE decoding.
+        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in
+        processing larger images.
-        several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
        """
        self.vae.enable_tiling()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
    def disable_vae_tiling(self):
        r"""
-        Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to
+        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_tiling()

--- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py
+++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py
@@ -136,17 +136,15 @@ class StableDiffusionXLControlNetPipeline(DiffusionPipeline, TextualInversionLoa
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
    def enable_vae_slicing(self):
        r"""
-        Enable sliced VAE decoding.
+        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
+        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
-        steps. This is useful to save some memory and allow larger batch sizes.
        """
        self.vae.enable_slicing()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
    def disable_vae_slicing(self):
        r"""
-        Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to
+        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_slicing()
@@ -154,17 +152,16 @@ class StableDiffusionXLControlNetPipeline(DiffusionPipeline, TextualInversionLoa
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
    def enable_vae_tiling(self):
        r"""
-        Enable tiled VAE decoding.
+        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
+        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in
+        processing larger images.
-        several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
        """
        self.vae.enable_tiling()
    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
    def disable_vae_tiling(self):
        r"""
-        Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to
+        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
        computing decoding in one step.
        """
        self.vae.disable_tiling()