Ldm3d first PR (#3668)

* added ldm3d pipeline and updated image processor to support depth * added description * added paper reference * added docs * fixed bug * added test * Update tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * added reference in indexmdx * reverted changes tto image processor' * added LDM3DOutput * Fixes with make style * fix failing tests for make fix-copies * aligned with our version * Update pipeline_stable_diffusion_ldm3d.py updated the guidance scale * Fix for failing check_code_quality test * Code review feedback * Fix typo in ldm3d_diffusion.mdx * updated the doc accordnlgy * copyrights * fixed test failure * make style * added image processor of LDM3D in the documentation: * added ldm3d doc to toctree * run make style && make quality * run make fix-copies * Update docs/source/en/api/image_processor.mdx Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * updated the safety checker to accept tuple * make style and make quality * Update src/diffusers/pipelines/stable_diffusion/__init__.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * LDM3D output * up --------- Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Aflalo <estellea@isl-gpu27.rr.intel.com> Co-authored-by: Anahita Bhiwandiwalla <anahita.bhiwandiwalla@intel.com> Co-authored-by: Aflalo <estellea@isl-gpu26.rr.intel.com> Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Aflalo <estellea@isl-gpu42.rr.intel.com> Co-authored-by: Aflalo <estellea@isl-gpu43.rr.intel.com>

Ldm3d first PR (#3668)
* added ldm3d pipeline and updated image processor to support depth * added description * added paper reference * added docs * fixed bug * added test * Update tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * added reference in indexmdx * reverted changes tto image processor' * added LDM3DOutput * Fixes with make style * fix failing tests for make fix-copies * aligned with our version * Update pipeline_stable_diffusion_ldm3d.py updated the guidance scale * Fix for failing check_code_quality test * Code review feedback * Fix typo in ldm3d_diffusion.mdx * updated the doc accordnlgy * copyrights * fixed test failure * make style * added image processor of LDM3D in the documentation: * added ldm3d doc to toctree * run make style && make quality * run make fix-copies * Update docs/source/en/api/image_processor.mdx Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * updated the safety checker to accept tuple * make style and make quality * Update src/diffusers/pipelines/stable_diffusion/__init__.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * LDM3D output * up --------- Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Aflalo <estellea@isl-gpu27.rr.intel.com> Co-authored-by: Anahita Bhiwandiwalla <anahita.bhiwandiwalla@intel.com> Co-authored-by: Aflalo <estellea@isl-gpu26.rr.intel.com> Co-authored-by: Aflalo <estellea@isl-iam1.rr.intel.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Aflalo <estellea@isl-gpu42.rr.intel.com> Co-authored-by: Aflalo <estellea@isl-gpu43.rr.intel.com>
958d9ec7 · estelleafl · GitHub · 77f9137f · 958d9ec7 · 958d9ec7
Unverified Commit 958d9ec7 authored Jun 15, 2023 by estelleafl Committed by GitHub Jun 15, 2023
11 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -221,6 +221,8 @@
        title: Stable-Diffusion-Latent-Upscaler
      - local: api/pipelines/stable_diffusion/upscale
        title: Super-Resolution
+      - local: api/pipelines/stable_diffusion/ldm3d_diffusion
+        title: LDM3D Text-to-(RGB, Depth)
      title: Stable Diffusion
    - local: api/pipelines/stable_unclip
      title: Stable unCLIP

--- a/docs/source/en/api/image_processor.mdx
+++ b/docs/source/en/api/image_processor.mdx
@@ -17,6 +17,17 @@ Image processor provides a unified API for Stable Diffusion pipelines to prepare
 All pipelines with VAE image processor will accept image inputs in the format of PIL Image, PyTorch tensor, or Numpy array, and will able to return outputs in the format of PIL Image, Pytorch tensor, and Numpy array based on the `output_type` argument from the user. Additionally, the User can pass encoded image latents directly to the pipeline, or ask the pipeline to return latents as output with `output_type = 'pt'` argument. This allows you to take the generated latents from one pipeline and pass it to another pipeline as input, without ever having to leave the latent space. It also makes it much easier to use multiple pipelines together, by passing PyTorch tensors directly between different pipelines. 


+# Image Processor for VAE adapted to LDM3D
+
+LDM3D Image processor does the same as the Image processor for VAE but accepts both RGB and depth inputs and will return RGB and depth outputs. 
+
+
+
 ## VaeImageProcessor

-[[autodoc]] image_processor.VaeImageProcessor
\ No newline at end of file
+[[autodoc]] image_processor.VaeImageProcessor
+
+
+## VaeImageProcessorLDM3D
+
+[[autodoc]] image_processor.VaeImageProcessorLDM3D
\ No newline at end of file
--- a/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.mdx
+<!--Copyright 2023 The Intel Labs Team Authors and HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# LDM3D
+
+LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal
+The abstract of the paper is the following:
+
+*This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).*
+
+
+*Overview*:
+
+| Pipeline | Tasks | Colab | Demo
+|---|---|:---:|:---:|
+| [pipeline_stable_diffusion_ldm3d.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py) | *Text-to-Image Generation* | - | -
+
+## Tips
+
+- LDM3D generates both an image and a depth map from a given text prompt, compared to the existing txt-to-img diffusion models such as [Stable Diffusion](./stable_diffusion/overview) that generates only an image.
+- With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. 
+
+
+Running LDM3D is straighforward with the [`StableDiffusionLDM3DPipeline`]:
+
+```python
+>>> from diffusers import StableDiffusionLDM3DPipeline
+
+>>> pipe_ldm3d = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d")
+prompt ="A picture of some lemons on a table"
+output = pipe_ldm3d(prompt)
+rgb_image, depth_image = output.rgb, output.depth
+rgb_image[0].save("lemons_ldm3d_rgb.jpg")
+depth_image[0].save("lemons_ldm3d_depth.png")
+```
+
+
+## StableDiffusionPipelineOutput
+[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput
+	- all
+	- __call__
+
+## StableDiffusionLDM3DPipeline
+[[autodoc]] StableDiffusionLDM3DPipeline
+	- all
+	- __call__
--- a/docs/source/en/index.mdx
+++ b/docs/source/en/index.mdx
@@ -94,3 +94,4 @@ The library has three main components:
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
 | [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
 | [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
+| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation |
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -149,6 +149,7 @@ else:
        StableDiffusionInpaintPipelineLegacy,
        StableDiffusionInstructPix2PixPipeline,
        StableDiffusionLatentUpscalePipeline,
+        StableDiffusionLDM3DPipeline,
        StableDiffusionModelEditingPipeline,
        StableDiffusionPanoramaPipeline,
        StableDiffusionPipeline,

--- a/src/diffusers/image_processor.py
+++ b/src/diffusers/image_processor.py
@@ -251,3 +251,109 @@ class VaeImageProcessor(ConfigMixin):

        if output_type == "pil":
            return self.numpy_to_pil(image)
+
+
+class VaeImageProcessorLDM3D(VaeImageProcessor):
+    """
+    Image Processor for VAE LDM3D.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to downscale the image's (height, width) dimensions to multiples of `vae_scale_factor`.
+        vae_scale_factor (`int`, *optional*, defaults to `8`):
+            VAE scale factor. If `do_resize` is True, the image will be automatically resized to multiples of this
+            factor.
+        resample (`str`, *optional*, defaults to `lanczos`):
+            Resampling filter to use when resizing the image.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image to [-1,1]
+    """
+
+    config_name = CONFIG_NAME
+
+    @register_to_config
+    def __init__(
+        self,
+        do_resize: bool = True,
+        vae_scale_factor: int = 8,
+        resample: str = "lanczos",
+        do_normalize: bool = True,
+    ):
+        super().__init__()
+
+    @staticmethod
+    def numpy_to_pil(images):
+        """
+        Convert a numpy image or a batch of images to a PIL image.
+        """
+        if images.ndim == 3:
+            images = images[None, ...]
+        images = (images * 255).round().astype("uint8")
+        if images.shape[-1] == 1:
+            # special case for grayscale (single channel) images
+            pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images]
+        else:
+            pil_images = [Image.fromarray(image[:, :, :3]) for image in images]
+
+        return pil_images
+
+    @staticmethod
+    def rgblike_to_depthmap(image):
+        """
+        Args:
+            image: RGB-like depth image
+
+        Returns: depth map
+
+        """
+        return image[:, :, 1] * 2**8 + image[:, :, 2]
+
+    def numpy_to_depth(self, images):
+        """
+        Convert a numpy depth image or a batch of images to a PIL image.
+        """
+        if images.ndim == 3:
+            images = images[None, ...]
+        images = (images * 255).round().astype("uint8")
+        if images.shape[-1] == 1:
+            # special case for grayscale (single channel) images
+            raise Exception("Not supported")
+        else:
+            pil_images = [Image.fromarray(self.rgblike_to_depthmap(image[:, :, 3:]), mode="I;16") for image in images]
+
+        return pil_images
+
+    def postprocess(
+        self,
+        image: torch.FloatTensor,
+        output_type: str = "pil",
+        do_denormalize: Optional[List[bool]] = None,
+    ):
+        if not isinstance(image, torch.Tensor):
+            raise ValueError(
+                f"Input for postprocessing is in incorrect format: {type(image)}. We only support pytorch tensor"
+            )
+        if output_type not in ["latent", "pt", "np", "pil"]:
+            deprecation_message = (
+                f"the output_type {output_type} is outdated and has been set to `np`. Please make sure to set it to one of these instead: "
+                "`pil`, `np`, `pt`, `latent`"
+            )
+            deprecate("Unsupported output_type", "1.0.0", deprecation_message, standard_warn=False)
+            output_type = "np"
+
+        if do_denormalize is None:
+            do_denormalize = [self.config.do_normalize] * image.shape[0]
+
+        image = torch.stack(
+            [self.denormalize(image[i]) if do_denormalize[i] else image[i] for i in range(image.shape[0])]
+        )
+
+        image = self.pt_to_numpy(image)
+
+        if output_type == "np":
+            return image[:, :, :, :3], np.stack([self.rgblike_to_depthmap(im[:, :, 3:]) for im in image], axis=0)
+
+        if output_type == "pil":
+            return self.numpy_to_pil(image), self.numpy_to_depth(image)
+        else:
+            raise Exception(f"This type {output_type} is not supported")
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -77,6 +77,7 @@ else:
        StableDiffusionInpaintPipelineLegacy,
        StableDiffusionInstructPix2PixPipeline,
        StableDiffusionLatentUpscalePipeline,
+        StableDiffusionLDM3DPipeline,
        StableDiffusionModelEditingPipeline,
        StableDiffusionPanoramaPipeline,
        StableDiffusionPipeline,

--- a/src/diffusers/pipelines/stable_diffusion/__init__.py
+++ b/src/diffusers/pipelines/stable_diffusion/__init__.py
@@ -50,6 +50,7 @@ else:
    from .pipeline_stable_diffusion_inpaint_legacy import StableDiffusionInpaintPipelineLegacy
    from .pipeline_stable_diffusion_instruct_pix2pix import StableDiffusionInstructPix2PixPipeline
    from .pipeline_stable_diffusion_latent_upscale import StableDiffusionLatentUpscalePipeline
+    from .pipeline_stable_diffusion_ldm3d import StableDiffusionLDM3DPipeline
    from .pipeline_stable_diffusion_model_editing import StableDiffusionModelEditingPipeline
    from .pipeline_stable_diffusion_panorama import StableDiffusionPanoramaPipeline
    from .pipeline_stable_diffusion_sag import StableDiffusionSAGPipeline

--- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py
+++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_ldm3d.py
--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -452,6 +452,21 @@ class StableDiffusionLatentUpscalePipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])


+class StableDiffusionLDM3DPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
 class StableDiffusionModelEditingPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]


--- a/tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py
+++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_ldm3d.py
+# coding=utf-8
+# Copyright 2023 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import gc
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+from diffusers import (
+    AutoencoderKL,
+    DDIMScheduler,
+    PNDMScheduler,
+    StableDiffusionLDM3DPipeline,
+    UNet2DConditionModel,
+)
+from diffusers.utils import nightly, slow, torch_device
+from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu
+
+from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
+
+
+enable_full_determinism()
+
+
+class StableDiffusionLDM3DPipelineFastTests(unittest.TestCase):
+    pipeline_class = StableDiffusionLDM3DPipeline
+    params = TEXT_TO_IMAGE_PARAMS
+    batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+    image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+
+    def get_dummy_components(self):
+        torch.manual_seed(0)
+        unet = UNet2DConditionModel(
+            block_out_channels=(32, 64),
+            layers_per_block=2,
+            sample_size=32,
+            in_channels=4,
+            out_channels=4,
+            down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+            up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+            cross_attention_dim=32,
+        )
+        scheduler = DDIMScheduler(
+            beta_start=0.00085,
+            beta_end=0.012,
+            beta_schedule="scaled_linear",
+            clip_sample=False,
+            set_alpha_to_one=False,
+        )
+        torch.manual_seed(0)
+        vae = AutoencoderKL(
+            block_out_channels=[32, 64],
+            in_channels=6,
+            out_channels=6,
+            down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+            up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+            latent_channels=4,
+        )
+        torch.manual_seed(0)
+        text_encoder_config = CLIPTextConfig(
+            bos_token_id=0,
+            eos_token_id=2,
+            hidden_size=32,
+            intermediate_size=37,
+            layer_norm_eps=1e-05,
+            num_attention_heads=4,
+            num_hidden_layers=5,
+            pad_token_id=1,
+            vocab_size=1000,
+        )
+        text_encoder = CLIPTextModel(text_encoder_config)
+        tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+        components = {
+            "unet": unet,
+            "scheduler": scheduler,
+            "vae": vae,
+            "text_encoder": text_encoder,
+            "tokenizer": tokenizer,
+            "safety_checker": None,
+            "feature_extractor": None,
+        }
+        return components
+
+    def get_dummy_inputs(self, device, seed=0):
+        if str(device).startswith("mps"):
+            generator = torch.manual_seed(seed)
+        else:
+            generator = torch.Generator(device=device).manual_seed(seed)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "generator": generator,
+            "num_inference_steps": 2,
+            "guidance_scale": 6.0,
+            "output_type": "numpy",
+        }
+        return inputs
+
+    def test_stable_diffusion_ddim(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+
+        components = self.get_dummy_components()
+        ldm3d_pipe = StableDiffusionLDM3DPipeline(**components)
+        ldm3d_pipe = ldm3d_pipe.to(torch_device)
+        ldm3d_pipe.set_progress_bar_config(disable=None)
+
+        inputs = self.get_dummy_inputs(device)
+        output = ldm3d_pipe(**inputs)
+        rgb, depth = output.rgb, output.depth
+
+        image_slice_rgb = rgb[0, -3:, -3:, -1]
+        image_slice_depth = depth[0, -3:, -1]
+
+        assert rgb.shape == (1, 64, 64, 3)
+        assert depth.shape == (1, 64, 64)
+
+        expected_slice_rgb = np.array(
+            [0.37301102, 0.7023895, 0.7418312, 0.5163375, 0.5825485, 0.60929704, 0.4188174, 0.48407027, 0.46555096]
+        )
+        expected_slice_depth = np.array([103.4673, 85.81202, 87.84926])
+
+        assert np.abs(image_slice_rgb.flatten() - expected_slice_rgb).max() < 1e-2
+        assert np.abs(image_slice_depth.flatten() - expected_slice_depth).max() < 1e-2
+
+    def test_stable_diffusion_prompt_embeds(self):
+        components = self.get_dummy_components()
+        ldm3d_pipe = StableDiffusionLDM3DPipeline(**components)
+        ldm3d_pipe = ldm3d_pipe.to(torch_device)
+        ldm3d_pipe.set_progress_bar_config(disable=None)
+
+        inputs = self.get_dummy_inputs(torch_device)
+        inputs["prompt"] = 3 * [inputs["prompt"]]
+
+        # forward
+        output = ldm3d_pipe(**inputs)
+        rgb_slice_1, depth_slice_1 = output.rgb, output.depth
+        rgb_slice_1 = rgb_slice_1[0, -3:, -3:, -1]
+        depth_slice_1 = depth_slice_1[0, -3:, -1]
+
+        inputs = self.get_dummy_inputs(torch_device)
+        prompt = 3 * [inputs.pop("prompt")]
+
+        text_inputs = ldm3d_pipe.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=ldm3d_pipe.tokenizer.model_max_length,
+            truncation=True,
+            return_tensors="pt",
+        )
+        text_inputs = text_inputs["input_ids"].to(torch_device)
+
+        prompt_embeds = ldm3d_pipe.text_encoder(text_inputs)[0]
+
+        inputs["prompt_embeds"] = prompt_embeds
+
+        # forward
+        output = ldm3d_pipe(**inputs)
+        rgb_slice_2, depth_slice_2 = output.rgb, output.depth
+        rgb_slice_2 = rgb_slice_2[0, -3:, -3:, -1]
+        depth_slice_2 = depth_slice_2[0, -3:, -1]
+
+        assert np.abs(rgb_slice_1.flatten() - rgb_slice_2.flatten()).max() < 1e-4
+        assert np.abs(depth_slice_1.flatten() - depth_slice_2.flatten()).max() < 1e-4
+
+    def test_stable_diffusion_negative_prompt(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        components["scheduler"] = PNDMScheduler(skip_prk_steps=True)
+        ldm3d_pipe = StableDiffusionLDM3DPipeline(**components)
+        ldm3d_pipe = ldm3d_pipe.to(device)
+        ldm3d_pipe.set_progress_bar_config(disable=None)
+
+        inputs = self.get_dummy_inputs(device)
+        negative_prompt = "french fries"
+        output = ldm3d_pipe(**inputs, negative_prompt=negative_prompt)
+
+        rgb, depth = output.rgb, output.depth
+        rgb_slice = rgb[0, -3:, -3:, -1]
+        depth_slice = depth[0, -3:, -1]
+
+        assert rgb.shape == (1, 64, 64, 3)
+        assert depth.shape == (1, 64, 64)
+
+        expected_slice_rgb = np.array(
+            [0.37044, 0.71811503, 0.7223251, 0.48603675, 0.5638391, 0.6364948, 0.42833704, 0.4901315, 0.47926217]
+        )
+        expected_slice_depth = np.array([107.84738, 84.62802, 89.962135])
+        assert np.abs(rgb_slice.flatten() - expected_slice_rgb).max() < 1e-2
+        assert np.abs(depth_slice.flatten() - expected_slice_depth).max() < 1e-2
+
+
+@slow
+@require_torch_gpu
+class StableDiffusionLDM3DPipelineSlowTests(unittest.TestCase):
+    def tearDown(self):
+        super().tearDown()
+        gc.collect()
+        torch.cuda.empty_cache()
+
+    def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+        generator = torch.Generator(device=generator_device).manual_seed(seed)
+        latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+        latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+        inputs = {
+            "prompt": "a photograph of an astronaut riding a horse",
+            "latents": latents,
+            "generator": generator,
+            "num_inference_steps": 3,
+            "guidance_scale": 7.5,
+            "output_type": "numpy",
+        }
+        return inputs
+
+    def test_ldm3d_stable_diffusion(self):
+        ldm3d_pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d")
+        ldm3d_pipe = ldm3d_pipe.to(torch_device)
+        ldm3d_pipe.set_progress_bar_config(disable=None)
+
+        inputs = self.get_inputs(torch_device)
+        output = ldm3d_pipe(**inputs)
+        rgb, depth = output.rgb, output.depth
+        rgb_slice = rgb[0, -3:, -3:, -1].flatten()
+        depth_slice = rgb[0, -3:, -1].flatten()
+
+        assert rgb.shape == (1, 512, 512, 3)
+        assert depth.shape == (1, 512, 512)
+
+        expected_slice_rgb = np.array(
+            [0.53805465, 0.56707305, 0.5486515, 0.57012236, 0.5814511, 0.56253487, 0.54843014, 0.55092263, 0.6459706]
+        )
+        expected_slice_depth = np.array(
+            [0.9263781, 0.6678672, 0.5486515, 0.92202145, 0.67831135, 0.56253487, 0.9241694, 0.7551478, 0.6459706]
+        )
+        assert np.abs(rgb_slice - expected_slice_rgb).max() < 3e-3
+        assert np.abs(depth_slice - expected_slice_depth).max() < 3e-3
+
+
+@nightly
+@require_torch_gpu
+class StableDiffusionPipelineNightlyTests(unittest.TestCase):
+    def tearDown(self):
+        super().tearDown()
+        gc.collect()
+        torch.cuda.empty_cache()
+
+    def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0):
+        generator = torch.Generator(device=generator_device).manual_seed(seed)
+        latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64))
+        latents = torch.from_numpy(latents).to(device=device, dtype=dtype)
+        inputs = {
+            "prompt": "a photograph of an astronaut riding a horse",
+            "latents": latents,
+            "generator": generator,
+            "num_inference_steps": 50,
+            "guidance_scale": 7.5,
+            "output_type": "numpy",
+        }
+        return inputs
+
+    def test_ldm3d(self):
+        ldm3d_pipe = StableDiffusionLDM3DPipeline.from_pretrained("Intel/ldm3d").to(torch_device)
+        ldm3d_pipe.set_progress_bar_config(disable=None)
+
+        inputs = self.get_inputs(torch_device)
+        output = ldm3d_pipe(**inputs)
+        rgb, depth = output.rgb, output.depth
+
+        expected_rgb_mean = 0.54461557
+        expected_rgb_std = 0.2806707
+        expected_depth_mean = 143.64595
+        expected_depth_std = 83.491776
+        assert np.abs(expected_rgb_mean - rgb.mean()) < 1e-3
+        assert np.abs(expected_rgb_std - rgb.std()) < 1e-3
+        assert np.abs(expected_depth_mean - depth.mean()) < 1e-3
+        assert np.abs(expected_depth_std - depth.std()) < 1e-3