[`Docs`] Fix typos and update files at API's Pipelines page 2 (#5748)

* Fix typos, update, add Copyright info, and trim trailing whitespace * Update docs/source/en/api/pipelines/text_to_video_zero.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * 1 second is not a long video, but 6 seconds is * Update text_to_video_zero.md * Update text_to_video_zero.md * Update text_to_video_zero.md * Update wuerstchen.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

[`Docs`] Fix typos and update files at API's Pipelines page 2 (#5748)
* Fix typos, update, add Copyright info, and trim trailing whitespace * Update docs/source/en/api/pipelines/text_to_video_zero.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * 1 second is not a long video, but 6 seconds is * Update text_to_video_zero.md * Update text_to_video_zero.md * Update text_to_video_zero.md * Update wuerstchen.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
ecbe27a0 · M. Tolga Cangöz · GitHub · 3ad4207d · ecbe27a0 · ecbe27a0
Unverified Commit ecbe27a0 authored Nov 15, 2023 by M. Tolga Cangöz Committed by GitHub Nov 15, 2023
7 changed files
--- a/docs/source/en/api/pipelines/text_to_video.md
+++ b/docs/source/en/api/pipelines/text_to_video.md
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 <Tip warning={true}>

-🧪 This pipeline is for research purposes only. 
+🧪 This pipeline is for research purposes only.

 </Tip>

@@ -26,13 +26,13 @@ The abstract from the paper is:

 You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense).

-## Usage example 
+## Usage example

 ### `text-to-video-ms-1.7b`

 Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):

-```python 
+```python
 import torch
 from diffusers import DiffusionPipeline
 from diffusers.utils import export_to_video
@@ -88,7 +88,7 @@ video_path = export_to_video(video_frames)
 video_path
 ```

-Here are some sample outputs: 
+Here are some sample outputs:

 <table>
    <tr>
@@ -118,8 +118,9 @@ which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zero

 ```py
 import torch
-from diffusers import DiffusionPipeline
+from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
 from diffusers.utils import export_to_video
+from PIL import Image

 pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
 pipe.enable_model_cpu_offload()
@@ -152,7 +153,7 @@ video_path = export_to_video(video_frames)
 video_path
 ```

-Here are some sample outputs: 
+Here are some sample outputs:

 <table>
    <tr>
@@ -166,6 +167,12 @@ Here are some sample outputs:
    </tr>
 </table>

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## TextToVideoSDPipeline
 [[autodoc]] TextToVideoSDPipeline
 	- all

--- a/docs/source/en/api/pipelines/text_to_video_zero.md
+++ b/docs/source/en/api/pipelines/text_to_video_zero.md
@@ -12,12 +12,7 @@ specific language governing permissions and limitations under the License.

 # Text2Video-Zero

-[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by
-Levon Khachatryan,
-Andranik Movsisyan,
-Vahram Tadevosyan,
-Roberto Henschel,
-[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).
+[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).

 Text2Video-Zero enables zero-shot video generation using either:
 1. A textual prompt
@@ -35,16 +30,15 @@ Our key modifications include (i) enriching the latent codes of the generated fr
 Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
 As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.*

-You can find additional information about Text-to-Video Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero).
+You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero).

 ## Usage example

 ### Text-To-Video

-To generate a video from prompt, run the following python command
+To generate a video from prompt, run the following Python code:
 ```python
 import torch
-import imageio
 from diffusers import TextToVideoZeroPipeline

 model_id = "runwayml/stable-diffusion-v1-5"
@@ -63,18 +57,17 @@ You can change these parameters in the pipeline call:
 * Video length:
    * `video_length`, the number of frames video_length to be generated. Default: `video_length=8`

-We an also generate longer videos by doing the processing in a chunk-by-chunk manner:
+We can also generate longer videos by doing the processing in a chunk-by-chunk manner:
 ```python
 import torch
-import imageio
 from diffusers import TextToVideoZeroPipeline
 import numpy as np

 model_id = "runwayml/stable-diffusion-v1-5"
 pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
 seed = 0
-video_length = 8
-chunk_size = 4
+video_length = 24  #24 ÷ 4fps = 6 seconds
+chunk_size = 8
 prompt = "A panda is playing guitar on times square"

 # Generate the video chunk-by-chunk
@@ -122,7 +115,7 @@ To generate a video from prompt with additional pose control
    frame_count = 8
    pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
    ```
-    To extract pose from actual video, read [ControlNet documentation](./stable_diffusion/controlnet).
+    To extract pose from actual video, read [ControlNet documentation](controlnet).

 3. Run `StableDiffusionControlNetPipeline` with our custom attention processor

@@ -152,13 +145,12 @@ To generate a video from prompt with additional pose control

 ### Text-To-Video with Edge Control

-To generate a video from prompt with additional pose control,
-follow the steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny).
+To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny).


 ### Video Instruct-Pix2Pix

-To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/pix2pix)):
+To perform text-guided video editing (with [InstructPix2Pix](pix2pix)):

 1. Download a demo video

@@ -196,12 +188,12 @@ To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/
    ```


-### DreamBooth specialization 
+### DreamBooth specialization

 Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control**
-can run with custom [DreamBooth](../training/dreambooth) models, as shown below for
+can run with custom [DreamBooth](../../training/dreambooth) models, as shown below for
 [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and
-[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model
+[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model:

 1. Download a demo video

@@ -250,6 +242,11 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below

 You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth).

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>

 ## TextToVideoZeroPipeline
 [[autodoc]] TextToVideoZeroPipeline
@@ -257,4 +254,4 @@ You can filter out some available DreamBooth-trained models with [this link](htt
 	- __call__

 ## TextToVideoPipelineOutput
-[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput
\ No newline at end of file
+[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput
--- a/docs/source/en/api/pipelines/unclip.md
+++ b/docs/source/en/api/pipelines/unclip.md
@@ -9,13 +9,13 @@ specific language governing permissions and limitations under the License.

 # unCLIP

-[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)).
+[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo](https://github.com/kakaobrain/karlo).

 The abstract from the paper is following:

 *Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*

-You can find lucidrains DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch).
+You can find lucidrains' DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch).

 <Tip>


--- a/docs/source/en/api/pipelines/unidiffuser.md
+++ b/docs/source/en/api/pipelines/unidiffuser.md
@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.

 The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu.

-The abstract from the [paper](https://arxiv.org/abs/2303.06555) is:
+The abstract from the paper is:

 *This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).*

@@ -54,7 +54,7 @@ image.save("unidiffuser_joint_sample_image.png")
 print(text)
 ```

-This is also called "joint" generation in the UniDiffusers paper, since we are sampling from the joint image-text distribution.
+This is also called "joint" generation in the UniDiffuser paper, since we are sampling from the joint image-text distribution.

 Note that the generation task is inferred from the inputs used when calling the pipeline.
 It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]:
@@ -65,7 +65,7 @@ pipe.set_joint_mode()
 sample = pipe(num_inference_steps=20, guidance_scale=8.0)
 ```

-When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting the infer the mode.
+When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting to infer the mode.
 You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode.

 You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively):
@@ -100,7 +100,7 @@ prompt = "an elephant under the sea"

 sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
 t2i_image = sample.images[0]
-t2i_image.save("unidiffuser_text2img_sample_image.png")
+t2i_image
 ```

 The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`].
@@ -133,7 +133,7 @@ The `img2text` mode requires that an input `image` be supplied. You can set the

 ### Image Variation

-The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and the perform a text-to-image generation on the outputs of the first generation.
+The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and then perform a text-to-image generation on the outputs of the first generation.
 This produces a new image which is semantically similar to the input image:

 ```python
@@ -147,7 +147,7 @@ model_id_or_path = "thu-ml/unidiffuser-v1"
 pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
 pipe.to(device)

-# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
+# Image variation can be performed with an image-to-text generation followed by a text-to-image generation:
 # 1. Image-to-text generation
 image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
 init_image = load_image(image_url).resize((512, 512))
@@ -164,7 +164,6 @@ final_image.save("unidiffuser_image_variation_sample.png")

 ### Text Variation

-
 Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation:

 ```python
@@ -191,10 +190,16 @@ final_prompt = sample.text[0]
 print(final_prompt)
 ```

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## UniDiffuserPipeline
 [[autodoc]] UniDiffuserPipeline
 	- all
 	- __call__

 ## ImageTextPipelineOutput
-[[autodoc]] pipelines.ImageTextPipelineOutput
\ No newline at end of file
+[[autodoc]] pipelines.ImageTextPipelineOutput
--- a/docs/source/en/api/pipelines/value_guided_sampling.md
+++ b/docs/source/en/api/pipelines/value_guided_sampling.md
@@ -22,11 +22,17 @@ This pipeline is based on the [Planning with Diffusion for Flexible Behavior Syn

 The abstract from the paper is:

-*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility*.
+*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.*

-You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb). 
+You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb).

 The script to run the model is available [here](https://github.com/huggingface/diffusers/tree/main/examples/reinforcement_learning).

+<Tip>
+
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
 ## ValueGuidedRLPipeline
-[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline
\ No newline at end of file
+[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline
--- a/docs/source/en/api/pipelines/versatile_diffusion.md
+++ b/docs/source/en/api/pipelines/versatile_diffusion.md
@@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License.

 # Versatile Diffusion

-Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi .
+Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi.

 The abstract from the paper is:

-*The recent advances in diffusion models have set an impressive milestone in many generation tasks. Trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest in academia and industry. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-flow network, dubbed Versatile Diffusion (VD), that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model. Moreover, we generalize VD to a unified multi-flow multimodal diffusion framework with grouped layers, swappable streams, and other propositions that can process modalities beyond images and text. Through our experiments, we demonstrate that VD and its underlying framework have the following merits: a) VD handles all subtasks with competitive quality; b) VD initiates novel extensions and applications such as disentanglement of style and semantic, image-text dual-guided generation, etc.; c) Through these experiments and applications, VD provides more semantic insights of the generated outputs.*
+*Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research.*

 ## Tips


--- a/docs/source/en/api/pipelines/wuerstchen.md
+++ b/docs/source/en/api/pipelines/wuerstchen.md
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
 # Würstchen

 <img src="https://github.com/dome272/Wuerstchen/assets/61938694/0617c863-165a-43ee-9303-2a17299a0cf9">

-[Würstchen: Efficient Pretraining of Text-to-Image Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.
+[Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville.

 The abstract from the paper is:

-*We introduce Würstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.*
+*We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.*

 ## Würstchen Overview
-Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637) ). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.
+Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637)). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference.

 ## Würstchen v2 comes to Diffusers

@@ -21,7 +33,7 @@ After the initial paper release, we have improved numerous things in the archite
 - Better quality


-We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are: 
+We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are:

 - v2-base
 - v2-aesthetic
@@ -45,7 +57,7 @@ pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dty

 caption = "Anthropomorphic cat dressed as a fire fighter"
 images = pipe(
-    caption, 
+    caption,
    width=1024,
    height=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
@@ -90,7 +102,8 @@ decoder_output = decoder_pipeline(
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
-).images
+).images[0]
+decoder_output
 ```

 ## Speed-Up Inference
@@ -113,6 +126,7 @@ after 1024x1024 is 1152x1152

 The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen).

+
 ## WuerstchenCombinedPipeline

 [[autodoc]] WuerstchenCombinedPipeline
@@ -139,8 +153,8 @@ The original codebase, as well as experimental ideas, can be found at [dome272/W

 ```bibtex
      @misc{pernias2023wuerstchen,
-            title={Wuerstchen: Efficient Pretraining of Text-to-Image Models}, 
-            author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville},
+            title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models},
+            author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville},
            year={2023},
            eprint={2306.00637},
            archivePrefix={arXiv},