[docs] General updates (#5378)

* first draft * feedback * feedback

[docs] General updates (#5378)
* first draft * feedback * feedback
7c3a75a1 · Steven Liu · GitHub · b8896a15 · 7c3a75a1 · b8896a15
Unverified Commit 7c3a75a1 authored Oct 24, 2023 by Steven Liu Committed by GitHub Oct 24, 2023
9 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -81,8 +81,8 @@
    - local: using-diffusers/custom_pipeline_examples
      title: Community pipelines
    - local: using-diffusers/contribute_pipeline
-      title: How to contribute a community pipeline
+      title: Contribute a community pipeline
-    title: Pipelines for Inference
+    title: Specific pipeline examples
  - sections:
    - local: training/overview
      title: Overview
@@ -168,8 +168,6 @@
      title: Custom normalization layers
    - local: api/attnprocessor
      title: Attention Processor
-    - local: api/diffusion_pipeline
-      title: Diffusion Pipeline
    - local: api/logging
      title: Logging
    - local: api/configuration

--- a/docs/source/en/api/diffusion_pipeline.md
+++ b/docs/source/en/api/diffusion_pipeline.md
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# Pipelines
-The [`DiffusionPipeline`] is the quickest way to load any pretrained diffusion pipeline from the [Hub](https://huggingface.co/models?library=diffusers) for inference.
-<Tip>
-You shouldn't use the [`DiffusionPipeline`] class for training or finetuning a diffusion model. Individual 
-components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead.
-</Tip>
-The pipeline type (for example [`StableDiffusionPipeline`]) of any diffusion pipeline loaded with [`~DiffusionPipeline.from_pretrained`] is automatically 
-detected and pipeline components are loaded and passed to the `__init__` function of the pipeline.
-Any pipeline object can be saved locally with [`~DiffusionPipeline.save_pretrained`].
-## DiffusionPipeline
-[[autodoc]] DiffusionPipeline
-	- all
-	- __call__
-	- device
-	- to
-	- components
--- a/docs/source/en/api/pipelines/overview.md
+++ b/docs/source/en/api/pipelines/overview.md
@@ -12,16 +12,74 @@ specific language governing permissions and limitations under the License.
 # Pipelines
-Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different scheduler or even model components.
+Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different schedulers or even model components.
-All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components.
+All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. Specific pipeline types (for example [`StableDiffusionPipeline`]) loaded with [`~DiffusionPipeline.from_pretrained`] are automatically detected and the pipeline components are loaded and passed to the `__init__` function of the pipeline.
 <Tip warning={true}>
-Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../traininig/overview) guides instead!
+You shouldn't use the [`DiffusionPipeline`] class for training. Individual components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead.
+<br>
+Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../../training/overview) guides instead!
 </Tip>
+The table below lists all the pipelines currently available in 🤗 Diffusers and the tasks they support. Click on a pipeline to view its abstract and published paper.
+| Pipeline | Tasks |
+|---|---|
+| [AltDiffusion](alt_diffusion) | image2image |
+| [Attend-and-Excite](attend_and_excite) | text2image |
+| [Audio Diffusion](audio_diffusion) | image2audio |
+| [AudioLDM](audioldm) | text2audio |
+| [AudioLDM2](audioldm2) | text2audio |
+| [BLIP Diffusion](blip_diffusion) | text2image |
+| [Consistency Models](consistency_models) | unconditional image generation |
+| [ControlNet](controlnet) | text2image, image2image, inpainting |
+| [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image |
+| [Cycle Diffusion](cycle_diffusion) | image2image |
+| [Dance Diffusion](dance_diffusion) | unconditional audio generation |
+| [DDIM](ddim) | unconditional image generation |
+| [DDPM](ddpm) | unconditional image generation |
+| [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution |
+| [DiffEdit](diffedit) | inpainting |
+| [DiT](dit) | text2image |
+| [GLIGEN](gligen) | text2image |
+| [InstructPix2Pix](pix2pix) | image editing |
+| [Kandinsky](kandinsky) | text2image, image2image, inpainting, interpolation |
+| [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting |
+| [Latent Diffusion](latent_diffusion) | text2image, super-resolution |
+| [LDM3D](ldm3d_diffusion) | text2image, text-to-3D |
+| [MultiDiffusion](panorama) | text2image |
+| [MusicLDM](musicldm) | text2audio |
+| [PaintByExample](paint_by_example) | inpainting |
+| [ParaDiGMS](paradigms) | text2image |
+| [Pix2Pix Zero](pix2pix_zero) | image editing |
+| [PNDM](pndm) | unconditional image generation |
+| [RePaint](repaint) | inpainting |
+| [ScoreSdeVe](score_sde_ve) | unconditional image generation |
+| [Self-Attention Guidance](self_attention_guidance) | text2image |
+| [Semantic Guidance](semantic_stable_diffusion) | text2image |
+| [Shap-E](shap_e) | text-to-3D, image-to-3D |
+| [Spectrogram Diffusion](spectrogram_diffusion) |  |
+| [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution |
+| [Stable Diffusion Model Editing](model_editing) | model editing |
+| [Stable Diffusion XL](stable_diffusion_xl) | text2image, image2image, inpainting |
+| [Stable unCLIP](stable_unclip) | text2image, image variation |
+| [KarrasVe](karras_ve) | unconditional image generation |
+| [T2I Adapter](adapter) | text2image |
+| [Text2Video](text_to_video) | text2video, video2video |
+| [Text2Video Zero](text_to_video_zero) | text2video |
+| [UnCLIP](unclip) | text2image, image variation |
+| [Unconditional Latent Diffusion](latent_diffusion_uncond) | unconditional image generation |
+| [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation |
+| [Value-guided planning](value_guided_sampling) | value guided sampling |
+| [Versatile Diffusion](versatile_diffusion) | text2image, image variation |
+| [VQ Diffusion](vq_diffusion) | text2image |
+| [Wuerstchen](wuerstchen) | text2image |
 ## DiffusionPipeline
 [[autodoc]] DiffusionPipeline

--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -22,7 +22,7 @@ specific language governing permissions and limitations under the License.
 The library has three main components:
- State-of-the-art [diffusion pipelines](api/pipelines/overview) for inference with just a few lines of code.
+- State-of-the-art diffusion pipelines for inference with just a few lines of code. There are many pipelines in 🤗 Diffusers, check out the table in the pipeline [overview](api/pipelines/overview) for a complete list of available pipelines and the task they solve.
 - Interchangeable [noise schedulers](api/schedulers/overview) for balancing trade-offs between generation speed and quality.
 - Pretrained [models](api/models) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems.
@@ -45,54 +45,4 @@ The library has three main components:
      <p class="text-gray-700">Technical descriptions of how 🤗 Diffusers classes and methods work.</p>
    </a>
  </div>
 </div>
\ No newline at end of file
-## Supported pipelines
-| Pipeline | Paper/Repository | Tasks |
-|---|---|:---:|
-| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation |
-| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation |
-| [controlnet](./api/pipelines/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation |
-| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation |
-| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation |
-| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation |
-| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation |
-| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation |
-| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation |
-| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation |
-| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation |
-| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image |
-| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation |
-| [paint_by_example](./api/pipelines/paint_by_example) | [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting |
-| [pndm](./api/pipelines/pndm) | [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation |
-| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
-| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation |
-| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation |
-| [stable_diffusion_adapter](./api/pipelines/stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation | -
-| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation |
-| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation |
-| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting |
-| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [MultiDiffusion](https://multidiffusion.github.io/) | Text-to-Panorama Generation |
-| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800)  | Text-Guided Image Editing|
-| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing |
-| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation |
-| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation |
-| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation |
-| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image |
-| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing |
-| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation |
-| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting |
-| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Depth-Conditional Stable Diffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation |
-| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image |
-| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [Safe Stable Diffusion](https://arxiv.org/abs/2211.05105) | Text-Guided Generation |
-| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation |
-| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation |
-| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
-| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
-| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation |
-| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
-| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
-| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
-| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
-| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation |
--- a/docs/source/en/installation.md
+++ b/docs/source/en/installation.md
@@ -12,12 +12,10 @@ specific language governing permissions and limitations under the License.
 # Installation
-Install 🤗 Diffusers for whichever deep learning library you're working with.
+🤗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+, and Flax. Follow the installation instructions below for the deep learning library you are using:
-🤗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+ and Flax. Follow the installation instructions below for the deep learning library you are using:
+- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions
+- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions
- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions.
- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions.
 ## Install with pip
@@ -37,7 +35,7 @@ Activate the virtual environment:
 source .env/bin/activate
 ```
-🤗 Diffusers also relies on the 🤗 Transformers library, and you can install both with the following command:
+You should also install 🤗 Transformers because 🤗 Diffusers relies on its models:
 <frameworkcontent>
 <pt>
@@ -54,9 +52,7 @@ pip install diffusers["flax"] transformers
 ## Install from source
-Before installing 🤗 Diffusers from source, make sure you have `torch` and 🤗 Accelerate installed.
+Before installing 🤗 Diffusers from source, make sure you have PyTorch and 🤗 Accelerate installed.
-For `torch` installation, refer to the `torch` [installation](https://pytorch.org/get-started/locally/#start-locally) guide.
 To install 🤗 Accelerate:
@@ -64,7 +60,7 @@ To install 🤗 Accelerate:
 pip install accelerate
 ```
-Install 🤗 Diffusers from source with the following command:
+Then install 🤗 Diffusers from source:
 ```bash
 pip install git+https://github.com/huggingface/diffusers
@@ -75,7 +71,7 @@ The `main` version is useful for staying up-to-date with the latest developments
 For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet.
 However, this means the `main` version may not always be stable.
 We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day.
-If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose), so we can fix it even sooner!
+If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can fix it even sooner!
 ## Editable install
@@ -123,17 +119,29 @@ git pull
 Your Python environment will find the `main` version of 🤗 Diffusers on the next run.
-## Notice on telemetry logging
+## Cache
+Model weights and files are downloaded from the Hub to a cache which is usually your home directory. You can change the cache location by specifying the `HF_HOME` or `HUGGINFACE_HUB_CACHE` environment variables or configuring the `cache_dir` parameter in methods like [`~DiffusionPipeline.from_pretrained`].
+Cached files allow you to run 🤗 Diffusers offline. To prevent 🤗 Diffusers from connecting to the internet, set the `HF_HUB_OFFLINE` environment variable to `True` and 🤗 Diffusers will only load previously downloaded files in the cache.
+```shell
+export HF_HUB_OFFLINE=True
+```
+For more details about managing and cleaning the cache, take a look at the [caching](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) guide.
+## Telemetry logging
-Our library gathers telemetry information during `from_pretrained()` requests.
+Our library gathers telemetry information during [`~DiffusionPipeline.from_pretrained`] requests.
-This data includes the version of Diffusers and PyTorch/Flax, the requested model or pipeline class,
+The data gathered includes the version of 🤗 Diffusers and PyTorch/Flax, the requested model or pipeline class,
-and the path to a pre-trained checkpoint if it is hosted on the Hub.
+and the path to a pretrained checkpoint if it is hosted on the Hugging Face Hub.
 This usage data helps us debug issues and prioritize new features.
-Telemetry is only sent when loading models and pipelines from the HuggingFace Hub,
+Telemetry is only sent when loading models and pipelines from the Hub,
-and is not collected during local usage.
+and it is not collected if you're loading local files.
-We understand that not everyone wants to share additional information, and we respect your privacy,
+We understand that not everyone wants to share additional information,and we respect your privacy.
-so you can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal:
+You can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal:
 On Linux/MacOS:
 ```bash

--- a/docs/source/en/using-diffusers/contribute_pipeline.md
+++ b/docs/source/en/using-diffusers/contribute_pipeline.md
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# How to contribute a community pipeline
+# Contribute a community pipeline
 <Tip>

--- a/docs/source/en/using-diffusers/custom_pipeline_examples.md
+++ b/docs/source/en/using-diffusers/custom_pipeline_examples.md
@@ -14,273 +14,106 @@ specific language governing permissions and limitations under the License.
 [[open-in-colab]]
-> **For more information about community pipelines, please have a look at [this issue](https://github.com/huggingface/diffusers/issues/841).**
+<Tip>
-**Community** examples consist of both inference and training examples that have been added by the community.
+For more context about the design choices behind community pipelines, please have a look at [this issue](https://github.com/huggingface/diffusers/issues/841).
-Please have a look at the following table to get an overview of all community examples. Click on the **Code Example** to get a copy-and-paste ready code example that you can try out.
-If a community doesn't work as expected, please open an issue and ping the author on it.
-| Example                                | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Code Example                                                      | Colab                                                                                                                                                                                                              |                                                     Author |
+</Tip>
-|:---------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------:|
-| CLIP Guided Stable Diffusion           | Doing CLIP guidance for text to image generation with Stable Diffusion                                                                                                                                                                                                                                                                                                                                                                                                                                   | [CLIP Guided Stable Diffusion](#clip-guided-stable-diffusion)     | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/CLIP_Guided_Stable_diffusion_with_diffusers.ipynb) |             [Suraj Patil](https://github.com/patil-suraj/) |
+Community pipelines allow you to get creative and build your own unique pipelines to share with the community. You can find all community pipelines in the [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) folder along with inference and training examples for how to use them. This guide showcases some of the community pipelines and hopefully it'll inspire you to create your own (feel free to open a PR with your own pipeline and we will merge it!).
-| One Step U-Net (Dummy)                 | Example showcasing of how to use Community Pipelines (see https://github.com/huggingface/diffusers/issues/841)                                                                                                                                                                                                                                                                                                                                                                                           | [One Step U-Net](#one-step-unet)                                  | -                                                                                                                                                                                                                  | [Patrick von Platen](https://github.com/patrickvonplaten/) |
-| Stable Diffusion Interpolation         | Interpolate the latent space of Stable Diffusion between different prompts/seeds                                                                                                                                                                                                                                                                                                                                                                                                                         | [Stable Diffusion Interpolation](#stable-diffusion-interpolation) | -                                                                                                                                                                                                                  |                    [Nate Raw](https://github.com/nateraw/) |
+To load a community pipeline, use the `custom_pipeline` argument in [`DiffusionPipeline`] to specify one of the files in [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community):
-| Stable Diffusion Mega                  | **One** Stable Diffusion Pipeline with all functionalities of [Text2Image](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py), [Image2Image](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py) and [Inpainting](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py) | [Stable Diffusion Mega](#stable-diffusion-mega)                   | -                                                                                                                                                                                                                  | [Patrick von Platen](https://github.com/patrickvonplaten/) |
-| Long Prompt Weighting Stable Diffusion | **One** Stable Diffusion Pipeline without tokens length limit, and support parsing weighting in prompt.                                                                                                                                                                                                                                                                                                                                                                                                  | [Long Prompt Weighting Stable Diffusion](#long-prompt-weighting-stable-diffusion)                                                                 | -                                                                                                                                                                                                                  |                        [SkyTNT](https://github.com/SkyTNT) |
-| Speech to Image                        | Using automatic-speech-recognition to transcribe text and Stable Diffusion to generate images                                                                                                                                                                                                                                                                                                                                                                                                            | [Speech to Image](#speech-to-image)                               | -                                                                                                                                                                                                                  | [Mikail Duzenli](https://github.com/MikailINTech)
-To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.
 ```py
 pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True
 )
 ```
-## Example usages
+If a community pipeline doesn't work as expected, please open a GitHub issue and mention the author.
-### CLIP Guided Stable Diffusion
+You can learn more about community pipelines in the how to [load community pipelines](custom_pipeline_overview) and how to [contribute a community pipeline](contribute_pipeline) guides.
-CLIP guided stable diffusion can help to generate more realistic images
+## Multilingual Stable Diffusion
-by guiding stable diffusion at every denoising step with an additional CLIP model.
-The following code requires roughly 12GB of GPU RAM.
+The multilingual Stable Diffusion pipeline uses a pretrained [XLM-RoBERTa](https://huggingface.co/papluca/xlm-roberta-base-language-detection) to identify a language and the [mBART-large-50](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) model to handle the translation. This allows you to generate images from text in 20 languages.
-```python
+```py
-from diffusers import DiffusionPipeline
+from PIL import Image
-from transformers import CLIPImageProcessor, CLIPModel
 import torch
-feature_extractor = CLIPImageProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
-clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16)
-guided_pipeline = DiffusionPipeline.from_pretrained(
-    "CompVis/stable-diffusion-v1-4",
-    custom_pipeline="clip_guided_stable_diffusion",
-    clip_model=clip_model,
-    feature_extractor=feature_extractor,
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-)
-guided_pipeline.enable_attention_slicing()
-guided_pipeline = guided_pipeline.to("cuda")
-prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece"
-generator = torch.Generator(device="cuda").manual_seed(0)
-images = []
-for i in range(4):
-    image = guided_pipeline(
-        prompt,
-        num_inference_steps=50,
-        guidance_scale=7.5,
-        clip_guidance_scale=100,
-        num_cutouts=4,
-        use_cutouts=False,
-        generator=generator,
-    ).images[0]
-    images.append(image)
-# save images locally
-for i, img in enumerate(images):
-    img.save(f"./clip_guided_sd/image_{i}.png")
-```
-The `images` list contains a list of PIL images that can be saved locally or displayed directly in a google colab.
-Generated images tend to be of higher qualtiy than natively using stable diffusion. E.g. the above script generates the following images:
-![clip_guidance](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/clip_guidance/merged_clip_guidance.jpg).
-### One Step Unet
-The dummy "one-step-unet" can be run as follows:
-```python
 from diffusers import DiffusionPipeline
+from diffusers.utils import make_image_grid
-pipe = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="one_step_unet")
+from transformers import (
-pipe()
+    pipeline,
-```
+    MBart50TokenizerFast,
+    MBartForConditionalGeneration,
-**Note**: This community pipeline is not useful as a feature, but rather just serves as an example of how community pipelines can be added (see https://github.com/huggingface/diffusers/issues/841).
-### Stable Diffusion Interpolation
-The following code can be run on a GPU of at least 8GB VRAM and should take approximately 5 minutes.
-```python
-from diffusers import DiffusionPipeline
-import torch
-pipe = DiffusionPipeline.from_pretrained(
-    "CompVis/stable-diffusion-v1-4",
-    torch_dtype=torch.float16,
-    safety_checker=None,  # Very important for videos...lots of false positives while interpolating
-    custom_pipeline="interpolate_stable_diffusion",
-    use_safetensors=True,
-).to("cuda")
-pipe.enable_attention_slicing()
-frame_filepaths = pipe.walk(
-    prompts=["a dog", "a cat", "a horse"],
-    seeds=[42, 1337, 1234],
-    num_interpolation_steps=16,
-    output_dir="./dreams",
-    batch_size=4,
-    height=512,
-    width=512,
-    guidance_scale=8.5,
-    num_inference_steps=50,
 )
-```
-The output of the `walk(...)` function returns a list of images saved under the folder as defined in `output_dir`. You can use these images to create videos of stable diffusion.
-> **Please have a look at https://github.com/nateraw/stable-diffusion-videos for more in-detail information on how to create videos using stable diffusion as well as more feature-complete functionality.**
-### Stable Diffusion Mega
-The Stable Diffusion Mega Pipeline lets you use the main use cases of the stable diffusion pipeline in a single class.
-```python
-#!/usr/bin/env python3
-from diffusers import DiffusionPipeline
-import PIL
-import requests
-from io import BytesIO
-import torch
+device = "cuda" if torch.cuda.is_available() else "cpu"
+device_dict = {"cuda": 0, "cpu": -1}
-def download_image(url):
+# add language detection pipeline
-    response = requests.get(url)
+language_detection_model_ckpt = "papluca/xlm-roberta-base-language-detection"
-    return PIL.Image.open(BytesIO(response.content)).convert("RGB")
+language_detection_pipeline = pipeline("text-classification",
+                                       model=language_detection_model_ckpt,
+                                       device=device_dict[device])
+# add model for language translation
+trans_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
+trans_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device)
-pipe = DiffusionPipeline.from_pretrained(
+diffuser_pipeline = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
-    custom_pipeline="stable_diffusion_mega",
+    custom_pipeline="multilingual_stable_diffusion",
+    detection_pipeline=language_detection_pipeline,
+    translation_model=trans_model,
+    translation_tokenizer=trans_tokenizer,
    torch_dtype=torch.float16,
-    use_safetensors=True,
-)
-pipe.to("cuda")
-pipe.enable_attention_slicing()
-### Text-to-Image
-images = pipe.text2img("An astronaut riding a horse").images
-### Image-to-Image
-init_image = download_image(
-    "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
 )
-prompt = "A fantasy landscape, trending on artstation"
+diffuser_pipeline.enable_attention_slicing()
+diffuser_pipeline = diffuser_pipeline.to(device)
-images = pipe.img2img(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images
-### Inpainting
-img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
-mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
-init_image = download_image(img_url).resize((512, 512))
-mask_image = download_image(mask_url).resize((512, 512))
-prompt = "a cat sitting on a bench"
-images = pipe.inpaint(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.75).images
-```
-As shown above this one pipeline can run all both "text-to-image", "image-to-image", and "inpainting" in one pipeline.
-### Long Prompt Weighting Stable Diffusion
-The Pipeline lets you input prompt without 77 token length limit. And you can increase words weighting by using "()" or decrease words weighting by using "[]"
-The Pipeline also lets you use the main use cases of the stable diffusion pipeline in a single class.
-#### pytorch
-```python
-from diffusers import DiffusionPipeline
-import torch
-pipe = DiffusionPipeline.from_pretrained(
-    "hakurei/waifu-diffusion", custom_pipeline="lpw_stable_diffusion", torch_dtype=torch.float16, use_safetensors=True
-)
-pipe = pipe.to("cuda")
-prompt = "best_quality (1girl:1.3) bow bride brown_hair closed_mouth frilled_bow frilled_hair_tubes frills (full_body:1.3) fox_ear hair_bow hair_tubes happy hood japanese_clothes kimono long_sleeves red_bow smile solo tabi uchikake white_kimono wide_sleeves cherry_blossoms"
-neg_prompt = "lowres, bad_anatomy, error_body, error_hair, error_arm, error_hands, bad_hands, error_fingers, bad_fingers, missing_fingers, error_legs, bad_legs, multiple_legs, missing_legs, error_lighting, error_shadow, error_reflection, text, error, extra_digit, fewer_digits, cropped, worst_quality, low_quality, normal_quality, jpeg_artifacts, signature, watermark, username, blurry"
-pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0]
-```
-#### onnxruntime
-```python
-from diffusers import DiffusionPipeline
-import torch
-pipe = DiffusionPipeline.from_pretrained(
-    "CompVis/stable-diffusion-v1-4",
-    custom_pipeline="lpw_stable_diffusion_onnx",
-    revision="onnx",
-    provider="CUDAExecutionProvider",
-    use_safetensors=True,
-)
-prompt = "a photo of an astronaut riding a horse on mars, best quality"
+prompt = ["a photograph of an astronaut riding a horse", 
-neg_prompt = "lowres, bad anatomy, error body, error hair, error arm, error hands, bad hands, error fingers, bad fingers, missing fingers, error legs, bad legs, multiple legs, missing legs, error lighting, error shadow, error reflection, text, error, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
+          "Una casa en la playa",
+          "Ein Hund, der Orange isst",
+          "Un restaurant parisien"]
-pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0]
+images = diffuser_pipeline(prompt).images
+grid = make_image_grid(images, rows=2, cols=2)
+grid
 ```
-if you see `Token indices sequence length is longer than the specified maximum sequence length for this model ( *** > 77 ) . Running this sequence through the model will result in indexing errors`. Do not worry, it is normal.
+<div class="flex justify-center">
+    <img src="https://user-images.githubusercontent.com/4313860/198328706-295824a4-9856-4ce5-8e66-278ceb42fd29.png"/>
-### Speech to Image
+</div>
-The following code can generate an image from an audio sample using pre-trained OpenAI whisper-small and Stable Diffusion.
-```Python
-import torch
-import matplotlib.pyplot as plt
-from datasets import load_dataset
-from diffusers import DiffusionPipeline
-from transformers import (
-    WhisperForConditionalGeneration,
-    WhisperProcessor,
-)
-device = "cuda" if torch.cuda.is_available() else "cpu"
-ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
-audio_sample = ds[3]
+## MagicMix
-text = audio_sample["text"].lower()
+[MagicMix](https://huggingface.co/papers/2210.16056) is a pipeline that can mix an image and text prompt to generate a new image that preserves the image structure. The `mix_factor` determines how much influence the prompt has on the layout generation, `kmin` controls the number of steps during the content generation process, and `kmax` determines how much information is kept in the layout of the original image.
-speech_data = audio_sample["audio"]["array"]
-model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
+```py
-processor = WhisperProcessor.from_pretrained("openai/whisper-small")
+from diffusers import DiffusionPipeline, DDIMScheduler
+from diffusers.utils import load_image
-diffuser_pipeline = DiffusionPipeline.from_pretrained(
+pipeline = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
-    custom_pipeline="speech_to_image_diffusion",
+    custom_pipeline="magic_mix",
-    speech_model=model,
+    scheduler = DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
-    speech_processor=processor,
+).to('cuda')
-    torch_dtype=torch.float16,
-    use_safetensors=True,
-)
-diffuser_pipeline.enable_attention_slicing()
-diffuser_pipeline = diffuser_pipeline.to(device)
-output = diffuser_pipeline(speech_data)
+img = load_image("https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg")
-plt.imshow(output.images[0])
+mix_img = pipeline(img, prompt="bed", kmin = 0.3, kmax = 0.5, mix_factor = 0.5)
+mix_img
 ```
-This example produces the following image:
-![image](https://user-images.githubusercontent.com/45072645/196901736-77d9c6fc-63ee-4072-90b0-dc8b903d63e3.png)
+<div class="flex gap-4">
\ No newline at end of file
+  <div>
+    <img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">image prompt</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://user-images.githubusercontent.com/59410571/209578602-70f323fa-05b7-4dd6-b055-e40683e37914.jpg" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">image and text prompt mix</figcaption>
+  </div>
+</div>
\ No newline at end of file
--- a/docs/source/en/using-diffusers/pipeline_overview.md
+++ b/docs/source/en/using-diffusers/pipeline_overview.md
@@ -14,4 +14,4 @@ specific language governing permissions and limitations under the License.
 A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components.
-This section introduces you to some of the more complex pipelines like Stable Diffusion XL, ControlNet, and DiffEdit, which require additional inputs. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to control randomness on your hardware when generating images, and how to create a community pipeline for a custom task like generating images from speech.
+This section demonstrates how to use specific pipelines such as Stable Diffusion XL, ControlNet, and DiffEdit. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to create reproducible pipelines, and how to use and contribute community pipelines.
\ No newline at end of file
--- a/docs/source/en/using-diffusers/textual_inversion_inference.md
+++ b/docs/source/en/using-diffusers/textual_inversion_inference.md
@@ -4,7 +4,7 @@
 The [`StableDiffusionPipeline`] supports textual inversion, a technique that enables a model like Stable Diffusion to learn a new concept from just a few sample images. This gives you more control over the generated images and allows you to tailor the model towards specific concepts. You can get started quickly with a collection of community created concepts in the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer).
-This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](./training/text_inversion) training guide.
+This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](../training/text_inversion) training guide.
 Login to your Hugging Face account: