[SDXL and IP2P]: instruction pix2pix XL training and pipeline (#4079)

* Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * [Community] Implementation of the IADB community pipeline (#3996) * community pipeline: implementation of iadb * iadb.py: reformat using black * iadb.py: linting update * add kandinsky to readme table (#4081) Co-authored-by: yiyixuxu <yixu310@gmail,com> * [From Single File] Force accelerate to be installed (#4078) force accelerate to be installed * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Clean up IP2P SDXL code * Clean up IP2P SDXL code * [IP2P and SDXL] clean up code * [IP2P and SDXL] clean up code * [IP2P and SDXL] clean up code * [IP2P SDXL] Address code reviews * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews * [IP2P SDXL] Address code reviews * [IP2P SDXL] Add README_SDXL * [IP2P SDXL] Address code reviews * [IP2P SDXL] Address code reviews * [IP2P SDXL] Fix the copy problems * [IP2P SDXL] Add license * [IP2P SDXL] Add license * [IP2P SDXL] Add license * [IP2P SDXL] Address code reivew for selecting VAE andd others * [IP2P SDXL] Update README_sdxl * [IP2P SDXL] Update __init__ * [IP2P SDXL] Update dummy_torch_and_transformers_and_invisible_watermark_objects * address patrick's comments and some additions to readmes. --------- Co-authored-by: Harutatsu Akiyama <kf.zy.qin@gmail.com> Co-authored-by: Thomas Chambon <36728882+tchambon@users.noreply.github.com> Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

[SDXL and IP2P]: instruction pix2pix XL training and pipeline (#4079)
* Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * [Community] Implementation of the IADB community pipeline (#3996) * community pipeline: implementation of iadb * iadb.py: reformat using black * iadb.py: linting update * add kandinsky to readme table (#4081) Co-authored-by: yiyixuxu <yixu310@gmail,com> * [From Single File] Force accelerate to be installed (#4078) force accelerate to be installed * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Support instruction pix2pix sdxl * Clean up IP2P SDXL code * Clean up IP2P SDXL code * [IP2P and SDXL] clean up code * [IP2P and SDXL] clean up code * [IP2P and SDXL] clean up code * [IP2P SDXL] Address code reviews * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews, add docs, tests * [IP2P SDXL] Address code reviews * [IP2P SDXL] Address code reviews * [IP2P SDXL] Add README_SDXL * [IP2P SDXL] Address code reviews * [IP2P SDXL] Address code reviews * [IP2P SDXL] Fix the copy problems * [IP2P SDXL] Add license * [IP2P SDXL] Add license * [IP2P SDXL] Add license * [IP2P SDXL] Address code reivew for selecting VAE andd others * [IP2P SDXL] Update README_sdxl * [IP2P SDXL] Update __init__ * [IP2P SDXL] Update dummy_torch_and_transformers_and_invisible_watermark_objects * address patrick's comments and some additions to readmes. --------- Co-authored-by: Harutatsu Akiyama <kf.zy.qin@gmail.com> Co-authored-by: Thomas Chambon <36728882+tchambon@users.noreply.github.com> Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
428dbfec · Harutatsu Akiyama · GitHub · 4e2a0218 · 428dbfec · 428dbfec
Unverified Commit 428dbfec authored Jul 25, 2023 by Harutatsu Akiyama Committed by GitHub Jul 25, 2023
10 changed files
--- a/docs/source/en/training/instructpix2pix.mdx
+++ b/docs/source/en/training/instructpix2pix.mdx
@@ -208,4 +208,8 @@ speed and quality during performance:
 Particularly, `image_guidance_scale` and `guidance_scale` can have a profound impact
 on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example).

-If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd). 
\ No newline at end of file
+If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd). 
+
+## Stable Diffusion XL
+
+We support fine-tuning of the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/README_sdxl.md). 
\ No newline at end of file
--- a/examples/instruct_pix2pix/README.md
+++ b/examples/instruct_pix2pix/README.md
@@ -186,4 +186,8 @@ speed and quality during performance:
 Particularly, `image_guidance_scale` and `guidance_scale` can have a profound impact
 on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example).

-If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd). 
\ No newline at end of file
+If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd). 
+
+## Stable Diffusion XL
+
+We support fine-tuning of the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md).
\ No newline at end of file
--- a/examples/instruct_pix2pix/README_sdxl.md
+++ b/examples/instruct_pix2pix/README_sdxl.md
+# InstructPix2Pix SDXL training example
+
+***This is based on the original InstructPix2Pix training example.***
+
+[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (or SDXL) is the latest image generation model that is tailored towards more photorealistic outputs with more detailed imagery and composition compared to previous SD models. It leverages a three times larger UNet backbone. The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. 
+
+The `train_instruct_pix2pix_xl.py` script shows how to implement the training procedure and adapt it for Stable Diffusion XL.
+
+***Disclaimer: Even though `train_instruct_pix2pix_xl.py` implements the InstructPix2Pix
+training procedure while being faithful to the [original implementation](https://github.com/timothybrooks/instruct-pix2pix) we have only tested it on a [small-scale dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples). This can impact the end results. For better results, we recommend longer training runs with a larger dataset. [Here](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) you can find a large dataset for InstructPix2Pix training.***
+
+## Running locally with PyTorch
+
+### Installing the dependencies
+
+Refer to the original InstructPix2Pix training example for installing the dependencies.
+
+You will also need to get access of SDXL by filling the [form](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9). 
+
+### Toy example
+
+As mentioned before, we'll use a [small toy dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) for training. The dataset 
+is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) used in the InstructPix2Pix paper.
+
+Configure environment variables such as the dataset identifier and the Stable Diffusion
+checkpoint:
+
+```bash
+export MODEL_NAME="stabilityai/stable-diffusion-xl-base-0.9"
+export DATASET_ID="fusing/instructpix2pix-1000-samples"
+```
+
+Now, we can launch training:
+
+```bash
+python train_instruct_pix2pix_xl.py \
+    --pretrained_model_name_or_path=$MODEL_NAME \
+    --dataset_name=$DATASET_ID \
+    --enable_xformers_memory_efficient_attention \
+    --resolution=256 --random_flip \
+    --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
+    --max_train_steps=15000 \
+    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
+    --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
+    --conditioning_dropout_prob=0.05 \
+    --seed=42 
+```
+
+Additionally, we support performing validation inference to monitor training progress
+with Weights and Biases. You can enable this feature with `report_to="wandb"`:
+
+```bash
+python train_instruct_pix2pix_xl.py \
+    --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-0.9 \
+    --dataset_name=$DATASET_ID \
+    --use_ema \
+    --enable_xformers_memory_efficient_attention \
+    --resolution=512 --random_flip \
+    --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
+    --max_train_steps=15000 \
+    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
+    --learning_rate=5e-05 --lr_warmup_steps=0 \
+    --conditioning_dropout_prob=0.05 \
+    --seed=42 \
+    --val_image_url_or_path="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \
+    --validation_prompt="make it in japan" \
+    --report_to=wandb
+ ```
+
+ We recommend this type of validation as it can be useful for model debugging. Note that you need `wandb` installed to use this. You can install `wandb` by running `pip install wandb`. 
+
+ [Here](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), you can find an example training run that includes some validation samples and the training hyperparameters.
+
+ ***Note: In the original paper, the authors observed that even when the model is trained with an image resolution of 256x256, it generalizes well to bigger resolutions such as 512x512. This is likely because of the larger dataset they used during training.***
+
+ ## Training with multiple GPUs
+
+`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
+for running distributed training with `accelerate`. Here is an example command:
+
+```bash 
+accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \
+    --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-0.9 \
+    --dataset_name=$DATASET_ID \
+    --use_ema \
+    --enable_xformers_memory_efficient_attention \
+    --resolution=512 --random_flip \
+    --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
+    --max_train_steps=15000 \
+    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
+    --learning_rate=5e-05 --lr_warmup_steps=0 \
+    --conditioning_dropout_prob=0.05 \
+    --seed=42 \
+    --val_image_url_or_path="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \
+    --validation_prompt="make it in japan" \
+    --report_to=wandb
+```
+
+ ## Inference
+
+ Once training is complete, we can perform inference:
+
+ ```python
+import PIL
+import requests
+import torch
+from diffusers import StableDiffusionXLInstructPix2PixPipeline
+
+model_id = "your_model_id" # <- replace this 
+pipe = StableDiffusionXLInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
+generator = torch.Generator("cuda").manual_seed(0)
+
+url = "https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg"
+
+
+def download_image(url):
+    image = PIL.Image.open(requests.get(url, stream=True).raw)
+    image = PIL.ImageOps.exif_transpose(image)
+    image = image.convert("RGB")
+    return image
+
+image = download_image(url)
+prompt = "make it Japan"
+num_inference_steps = 20
+image_guidance_scale = 1.5
+guidance_scale = 10
+
+edited_image = pipe(prompt, 
+    image=image, 
+    num_inference_steps=num_inference_steps, 
+    image_guidance_scale=image_guidance_scale, 
+    guidance_scale=guidance_scale,
+    generator=generator,
+).images[0]
+edited_image.save("edited_image.png")
+```
+
+We encourage you to play with the following three parameters to control
+speed and quality during performance:
+
+* `num_inference_steps`
+* `image_guidance_scale`
+* `guidance_scale`
+
+Particularly, `image_guidance_scale` and `guidance_scale` can have a profound impact
+on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example).
+
+If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd). 
--- a/examples/instruct_pix2pix/train_instruct_pix2pix_xl.py
+++ b/examples/instruct_pix2pix/train_instruct_pix2pix_xl.py
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -206,6 +206,7 @@ else:
        StableDiffusionXLControlNetPipeline,
        StableDiffusionXLImg2ImgPipeline,
        StableDiffusionXLInpaintPipeline,
+        StableDiffusionXLInstructPix2PixPipeline,
        StableDiffusionXLPipeline,
    )


--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -125,6 +125,7 @@ else:
    from .stable_diffusion_xl import (
        StableDiffusionXLImg2ImgPipeline,
        StableDiffusionXLInpaintPipeline,
+        StableDiffusionXLInstructPix2PixPipeline,
        StableDiffusionXLPipeline,
    )


--- a/src/diffusers/pipelines/stable_diffusion_xl/__init__.py
+++ b/src/diffusers/pipelines/stable_diffusion_xl/__init__.py
@@ -25,3 +25,4 @@ if is_transformers_available() and is_torch_available() and is_invisible_waterma
    from .pipeline_stable_diffusion_xl import StableDiffusionXLPipeline
    from .pipeline_stable_diffusion_xl_img2img import StableDiffusionXLImg2ImgPipeline
    from .pipeline_stable_diffusion_xl_inpaint import StableDiffusionXLInpaintPipeline
+    from .pipeline_stable_diffusion_xl_instruct_pix2pix import StableDiffusionXLInstructPix2PixPipeline
--- a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_instruct_pix2pix.py
+++ b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_instruct_pix2pix.py
--- a/src/diffusers/utils/dummy_torch_and_transformers_and_invisible_watermark_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_and_invisible_watermark_objects.py
@@ -47,6 +47,21 @@ class StableDiffusionXLInpaintPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers", "invisible_watermark"])


+class StableDiffusionXLInstructPix2PixPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers", "invisible_watermark"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers", "invisible_watermark"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers", "invisible_watermark"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers", "invisible_watermark"])
+
+
 class StableDiffusionXLPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers", "invisible_watermark"]


--- a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_instruction_pix2pix.py
+++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_instruction_pix2pix.py
+# coding=utf-8
+# Copyright 2023 Harutatsu Akiyama and HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
+
+from diffusers import (
+    AutoencoderKL,
+    EulerDiscreteScheduler,
+    UNet2DConditionModel,
+)
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.pipelines.stable_diffusion_xl.pipeline_stable_diffusion_xl_instruct_pix2pix import (
+    StableDiffusionXLInstructPix2PixPipeline,
+)
+from diffusers.utils import floats_tensor, torch_device
+from diffusers.utils.testing_utils import enable_full_determinism
+
+from ..pipeline_params import (
+    IMAGE_TO_IMAGE_IMAGE_PARAMS,
+    TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS,
+    TEXT_GUIDED_IMAGE_VARIATION_PARAMS,
+)
+from ..test_pipelines_common import PipelineKarrasSchedulerTesterMixin, PipelineLatentTesterMixin, PipelineTesterMixin
+
+
+enable_full_determinism()
+
+
+class StableDiffusionXLInstructPix2PixPipelineFastTests(
+    PipelineLatentTesterMixin, PipelineKarrasSchedulerTesterMixin, PipelineTesterMixin, unittest.TestCase
+):
+    pipeline_class = StableDiffusionXLInstructPix2PixPipeline
+    params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS - {"height", "width", "cross_attention_kwargs"}
+    batch_params = TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS
+    image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+    image_latents_params = IMAGE_TO_IMAGE_IMAGE_PARAMS
+
+    def get_dummy_components(self):
+        torch.manual_seed(0)
+        unet = UNet2DConditionModel(
+            block_out_channels=(32, 64),
+            layers_per_block=2,
+            sample_size=32,
+            in_channels=8,
+            out_channels=4,
+            down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+            up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+            # SD2-specific config below
+            attention_head_dim=(2, 4),
+            use_linear_projection=True,
+            addition_embed_type="text_time",
+            addition_time_embed_dim=8,
+            transformer_layers_per_block=(1, 2),
+            projection_class_embeddings_input_dim=80,  # 6 * 8 + 32
+            cross_attention_dim=64,
+        )
+
+        scheduler = EulerDiscreteScheduler(
+            beta_start=0.00085,
+            beta_end=0.012,
+            steps_offset=1,
+            beta_schedule="scaled_linear",
+            timestep_spacing="leading",
+        )
+        torch.manual_seed(0)
+        vae = AutoencoderKL(
+            block_out_channels=[32, 64],
+            in_channels=3,
+            out_channels=3,
+            down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+            up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+            latent_channels=4,
+            sample_size=128,
+        )
+        torch.manual_seed(0)
+        text_encoder_config = CLIPTextConfig(
+            bos_token_id=0,
+            eos_token_id=2,
+            hidden_size=32,
+            intermediate_size=37,
+            layer_norm_eps=1e-05,
+            num_attention_heads=4,
+            num_hidden_layers=5,
+            pad_token_id=1,
+            vocab_size=1000,
+            # SD2-specific config below
+            hidden_act="gelu",
+            projection_dim=32,
+        )
+        text_encoder = CLIPTextModel(text_encoder_config)
+        tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip", local_files_only=True)
+
+        text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
+        tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip", local_files_only=True)
+
+        components = {
+            "unet": unet,
+            "scheduler": scheduler,
+            "vae": vae,
+            "text_encoder": text_encoder,
+            "tokenizer": tokenizer,
+            "text_encoder_2": text_encoder_2,
+            "tokenizer_2": tokenizer_2,
+            # "safety_checker": None,
+            # "feature_extractor": None,
+        }
+        return components
+
+    def get_dummy_inputs(self, device, seed=0):
+        image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+        image = image / 2 + 0.5
+        if str(device).startswith("mps"):
+            generator = torch.manual_seed(seed)
+        else:
+            generator = torch.Generator(device=device).manual_seed(seed)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "image": image,
+            "generator": generator,
+            "num_inference_steps": 2,
+            "guidance_scale": 6.0,
+            "image_guidance_scale": 1,
+            "output_type": "numpy",
+        }
+        return inputs
+
+    def test_inference_batch_single_identical(self):
+        super().test_inference_batch_single_identical(expected_max_diff=3e-3)
+
+    def test_attention_slicing_forward_pass(self):
+        super().test_attention_slicing_forward_pass(expected_max_diff=2e-3)
+
+    # Overwrite the default test_latents_inputs because pix2pix encode the image differently
+    def test_latents_input(self):
+        components = self.get_dummy_components()
+        pipe = StableDiffusionXLInstructPix2PixPipeline(**components)
+        pipe.image_processor = VaeImageProcessor(do_resize=False, do_normalize=False)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+
+        out = pipe(**self.get_dummy_inputs_by_type(torch_device, input_image_type="pt"))[0]
+
+        vae = components["vae"]
+        inputs = self.get_dummy_inputs_by_type(torch_device, input_image_type="pt")
+
+        for image_param in self.image_latents_params:
+            if image_param in inputs.keys():
+                inputs[image_param] = vae.encode(inputs[image_param]).latent_dist.mode()
+
+        out_latents_inputs = pipe(**inputs)[0]
+
+        max_diff = np.abs(out - out_latents_inputs).max()
+        self.assertLess(max_diff, 1e-4, "passing latents as image input generate different result from passing image")
+
+    def test_cfg(self):
+        pass