[Pipelines] Adds pix2pix zero (#2334)

* add: support for BLIP generation. * add: support for editing synthetic images. * remove unnecessary comments. * add inits and run make fix-copies. * version change of diffusers. * fix: condition for loading the captioner. * default conditions_input_image to False. * guidance_amount -> cross_attention_guidance_amount * fix inputs to check_inputs() * fix: attribute. * fix: prepare_attention_mask() call. * debugging. * better placement of references. * remove torch.no_grad() decorations. * put torch.no_grad() context before the first denoising loop. * detach() latents before decoding them. * put deocding in a torch.no_grad() context. * add reconstructed image for debugging. * no_grad(0 * apply formatting. * address one-off suggestions from the draft PR. * back to torch.no_grad() and add more elaborate comments. * refactor prepare_unet() per Patrick's suggestions. * more elaborate description for . * formatting. * add docstrings to the methods specific to pix2pix zero. * suspecting a redundant noise prediction. * needed for gradient computation chain. * less hacks. * fix: attention mask handling within the processor. * remove attention reference map computation. * fix: cross attn args. * fix: prcoessor. * store attention maps. * fix: attention processor. * update docs and better treatment to xa args. * update the final noise computation call. * change xa args call. * remove xa args option from the pipeline. * add: docs. * first test. * fix: url call. * fix: argument call. * remove image conditioning for now. * 🚨 add: fast tests. * explicit placement of the xa attn weights. * add: slow tests 🐢 * fix: tests. * edited direction embedding should be on the same device as prompt_embeds. * debugging message. * debugging. * add pix2pix zero pipeline for a non-deterministic test. * debugging/ * remove debugging message. * make caption generation _ * address comments (part I). * address PR comments (part II) * fix: DDPM test assertion. * refactor doc. * address PR comments (part III). * fix: type annotation for the scheduler. * apply styling. * skip_mps and add note on embeddings in the docs.

[Pipelines] Adds pix2pix zero (#2334)
* add: support for BLIP generation. * add: support for editing synthetic images. * remove unnecessary comments. * add inits and run make fix-copies. * version change of diffusers. * fix: condition for loading the captioner. * default conditions_input_image to False. * guidance_amount -> cross_attention_guidance_amount * fix inputs to check_inputs() * fix: attribute. * fix: prepare_attention_mask() call. * debugging. * better placement of references. * remove torch.no_grad() decorations. * put torch.no_grad() context before the first denoising loop. * detach() latents before decoding them. * put deocding in a torch.no_grad() context. * add reconstructed image for debugging. * no_grad(0 * apply formatting. * address one-off suggestions from the draft PR. * back to torch.no_grad() and add more elaborate comments. * refactor prepare_unet() per Patrick's suggestions. * more elaborate description for . * formatting. * add docstrings to the methods specific to pix2pix zero. * suspecting a redundant noise prediction. * needed for gradient computation chain. * less hacks. * fix: attention mask handling within the processor. * remove attention reference map computation. * fix: cross attn args. * fix: prcoessor. * store attention maps. * fix: attention processor. * update docs and better treatment to xa args. * update the final noise computation call. * change xa args call. * remove xa args option from the pipeline. * add: docs. * first test. * fix: url call. * fix: argument call. * remove image conditioning for now. * 🚨 add: fast tests. * explicit placement of the xa attn weights. * add: slow tests 🐢 * fix: tests. * edited direction embedding should be on the same device as prompt_embeds. * debugging message. * debugging. * add pix2pix zero pipeline for a non-deterministic test. * debugging/ * remove debugging message. * make caption generation _ * address comments (part I). * address PR comments (part II) * fix: DDPM test assertion. * refactor doc. * address PR comments (part III). * fix: type annotation for the scheduler. * apply styling. * skip_mps and add note on embeddings in the docs.
fd3d5502 · Sayak Paul · GitHub · e5810e68 · fd3d5502 · fd3d5502
Unverified Commit fd3d5502 authored Feb 16, 2023 by Sayak Paul Committed by GitHub Feb 16, 2023
11 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -151,6 +151,8 @@
        title: Stable-Diffusion-Latent-Upscaler  
      - local: api/pipelines/stable_diffusion/pix2pix
        title: InstructPix2Pix
+      - local: api/pipelines/stable_diffusion/pix2pix_zero
+        title: Pix2Pix Zero
      title: Stable Diffusion
    - local: api/pipelines/stable_diffusion_2
      title: Stable Diffusion 2

--- a/docs/source/en/api/pipelines/stable_diffusion/overview.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/overview.mdx
@@ -33,6 +33,7 @@ For more details about how Stable Diffusion works and how it differs from the ba
 | [StableDiffusionUpscalePipeline](./upscale) | **Experimental** – *Text-Guided Image Super-Resolution * | | Coming soon
 | [StableDiffusionLatentUpscalePipeline](./latent_upscale) | **Experimental** – *Text-Guided Image Super-Resolution * | | Coming soon
 | [StableDiffusionInstructPix2PixPipeline](./pix2pix) | **Experimental** – *Text-Based Image Editing * | | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/spaces/timbrooks/instruct-pix2pix)
+| [StableDiffusionPix2PixZeroPipeline](./pix2pix_zero) | **Experimental** – *Text-Based Image Editing * | | [Zero-shot Image-to-Image Translation](https://arxiv.org/abs/2302.03027)

--- a/docs/source/en/api/pipelines/stable_diffusion/pix2pix_zero.mdx
+++ b/docs/source/en/api/pipelines/stable_diffusion/pix2pix_zero.mdx
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Zero-shot Image-to-Image Translation
+## Overview
+[Zero-shot Image-to-Image Translation](https://arxiv.org/abs/2302.03027) by Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu.
+The abstract of the paper is the following:
+*Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.*
+Resources:
+* [Project Page](https://pix2pixzero.github.io/).
+* [Paper](https://arxiv.org/abs/2302.03027).
+* [Original Code](https://github.com/pix2pixzero/pix2pix-zero).
+## Tips 
+* The pipeline exposes two arguments namely `source_embeds` and `target_embeds`
+that let you control the direction of the semantic edits in the final image to be generated. Let's say,
+you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
+this in the pipeline, you simply have to set the embeddings related to the phrases including "cat" to
+`source_embeds` and "dog" to `target_embeds`. Refer to the code example below for more details.
+* When you're using this pipeline from a prompt, specify the _source_ concept in the prompt. Taking
+the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gough".
+* If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to:
+    * Swap the `source_embeds` and `target_embeds`.
+    * Change the input prompt to include "dog".  
+* To learn more about how the source and target embeddings are generated, refer to the [original 
+paper](https://arxiv.org/abs/2302.03027). 
+## Available Pipelines:
+| Pipeline | Tasks | Demo
+|---|---|:---:|
+| [StableDiffusionPix2PixZeroPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py) | *Text-Based Image Editing* | [🤗 Space] (soon) |
+<!-- TODO: add Colab -->
+## Usage example
+**Based on an image generated with the input prompt**
+```python
+import requests
+import torch
+from diffusers import DDIMScheduler, StableDiffusionPix2PixZeroPipeline
+def download(embedding_url, local_filepath):
+    r = requests.get(embedding_url)
+    with open(local_filepath, "wb") as f:
+        f.write(r.content)
+model_ckpt = "CompVis/stable-diffusion-v1-4"
+pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
+    model_ckpt, conditions_input_image=False, torch_dtype=torch.float16
+)
+pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
+pipeline.to("cuda")
+prompt = "a high resolution painting of a cat in the style of van gough"
+src_embs_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/embeddings_sd_1.4/cat.pt"
+target_embs_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/embeddings_sd_1.4/dog.pt"
+for url in [src_embs_url, target_embs_url]:
+    download(url, url.split("/")[-1])
+src_embeds = torch.load(src_embs_url.split("/")[-1])
+target_embeds = torch.load(target_embs_url.split("/")[-1])
+images = pipeline(
+    prompt,
+    source_embeds=src_embeds,
+    target_embeds=target_embeds,
+    num_inference_steps=50,
+    cross_attention_guidance_amount=0.15,
+).images
+images[0].save("edited_image_dog.png")
+```
+**Based on an input image**
+_Coming soon_
+## StableDiffusionPix2PixZeroPipeline
+[[autodoc]] StableDiffusionPix2PixZeroPipeline
+	- __call__
+	- all
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -118,6 +118,7 @@ else:
        StableDiffusionLatentUpscalePipeline,
        StableDiffusionPipeline,
        StableDiffusionPipelineSafe,
+        StableDiffusionPix2PixZeroPipeline,
        StableDiffusionUpscalePipeline,
        StableUnCLIPImg2ImgPipeline,
        StableUnCLIPPipeline,

--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -54,6 +54,7 @@ else:
        StableDiffusionInstructPix2PixPipeline,
        StableDiffusionLatentUpscalePipeline,
        StableDiffusionPipeline,
+        StableDiffusionPix2PixZeroPipeline,
        StableDiffusionUpscalePipeline,
        StableUnCLIPImg2ImgPipeline,
        StableUnCLIPPipeline,

--- a/src/diffusers/pipelines/stable_diffusion/__init__.py
+++ b/src/diffusers/pipelines/stable_diffusion/__init__.py
@@ -66,6 +66,7 @@ except OptionalDependencyNotAvailable:
    from ...utils.dummy_torch_and_transformers_objects import StableDiffusionDepth2ImgPipeline
 else:
    from .pipeline_stable_diffusion_depth2img import StableDiffusionDepth2ImgPipeline
+    from .pipeline_stable_diffusion_pix2pix_zero import StableDiffusionPix2PixZeroPipeline
 try:

--- a/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py
+++ b/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_pix2pix_zero.py
--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -212,6 +212,21 @@ class StableDiffusionPipelineSafe(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])
+class StableDiffusionPix2PixZeroPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
 class StableDiffusionUpscalePipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

--- a/src/diffusers/utils/import_utils.py
+++ b/src/diffusers/utils/import_utils.py
@@ -408,9 +408,9 @@ def requires_backends(obj, backends):
            " --upgrade transformers \n```"
        )
-    if name in [
+    if name in ["StableDiffusionDepth2ImgPipeline", "StableDiffusionPix2PixZeroPipeline"] and is_transformers_version(
-        "StableDiffusionDepth2ImgPipeline",
+        "<", "4.26.0"
-    ] and is_transformers_version("<", "4.26.0"):
+    ):
        raise ImportError(
            f"You need to install `transformers>=4.26` in order to use {name}: \n```\n pip install"
            " --upgrade transformers \n```"

--- a/tests/pipelines/stable_diffusion/test_stable_diffusion_pix2pix_zero.py
+++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_pix2pix_zero.py
+# coding=utf-8
+# Copyright 2023 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import gc
+import unittest
+import numpy as np
+import requests
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+from diffusers import (
+    AutoencoderKL,
+    DDIMScheduler,
+    DDPMScheduler,
+    EulerAncestralDiscreteScheduler,
+    LMSDiscreteScheduler,
+    StableDiffusionPix2PixZeroPipeline,
+    UNet2DConditionModel,
+)
+from diffusers.utils import slow, torch_device
+from diffusers.utils.testing_utils import require_torch_gpu, skip_mps
+from ...test_pipelines_common import PipelineTesterMixin
+torch.backends.cuda.matmul.allow_tf32 = False
+def download_from_url(embedding_url, local_filepath):
+    r = requests.get(embedding_url)
+    with open(local_filepath, "wb") as f:
+        f.write(r.content)
+@skip_mps
+class StableDiffusionPix2PixZeroPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+    pipeline_class = StableDiffusionPix2PixZeroPipeline
+    def get_dummy_components(self):
+        torch.manual_seed(0)
+        unet = UNet2DConditionModel(
+            block_out_channels=(32, 64),
+            layers_per_block=2,
+            sample_size=32,
+            in_channels=4,
+            out_channels=4,
+            down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+            up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+            cross_attention_dim=32,
+        )
+        scheduler = DDIMScheduler()
+        torch.manual_seed(0)
+        vae = AutoencoderKL(
+            block_out_channels=[32, 64],
+            in_channels=3,
+            out_channels=3,
+            down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+            up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+            latent_channels=4,
+        )
+        torch.manual_seed(0)
+        text_encoder_config = CLIPTextConfig(
+            bos_token_id=0,
+            eos_token_id=2,
+            hidden_size=32,
+            intermediate_size=37,
+            layer_norm_eps=1e-05,
+            num_attention_heads=4,
+            num_hidden_layers=5,
+            pad_token_id=1,
+            vocab_size=1000,
+        )
+        text_encoder = CLIPTextModel(text_encoder_config)
+        tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+        components = {
+            "unet": unet,
+            "scheduler": scheduler,
+            "vae": vae,
+            "text_encoder": text_encoder,
+            "tokenizer": tokenizer,
+            "safety_checker": None,
+            "feature_extractor": None,
+        }
+        return components
+    def get_dummy_inputs(self, device, seed=0):
+        src_emb_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/src_emb_0.pt"
+        tgt_emb_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/tgt_emb_0.pt"
+        for url in [src_emb_url, tgt_emb_url]:
+            download_from_url(url, url.split("/")[-1])
+        src_embeds = torch.load(src_emb_url.split("/")[-1])
+        target_embeds = torch.load(tgt_emb_url.split("/")[-1])
+        generator = torch.manual_seed(seed)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "generator": generator,
+            "num_inference_steps": 2,
+            "guidance_scale": 6.0,
+            "cross_attention_guidance_amount": 0.15,
+            "source_embeds": src_embeds,
+            "target_embeds": target_embeds,
+            "output_type": "numpy",
+        }
+        return inputs
+    def test_stable_diffusion_pix2pix_zero_default_case(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        sd_pipe = StableDiffusionPix2PixZeroPipeline(**components)
+        sd_pipe = sd_pipe.to(device)
+        sd_pipe.set_progress_bar_config(disable=None)
+        inputs = self.get_dummy_inputs(device)
+        image = sd_pipe(**inputs).images
+        image_slice = image[0, -3:, -3:, -1]
+        assert image.shape == (1, 64, 64, 3)
+        expected_slice = np.array([0.5184, 0.503, 0.4917, 0.4022, 0.3455, 0.464, 0.5324, 0.5323, 0.4894])
+        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+    def test_stable_diffusion_pix2pix_zero_negative_prompt(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        sd_pipe = StableDiffusionPix2PixZeroPipeline(**components)
+        sd_pipe = sd_pipe.to(device)
+        sd_pipe.set_progress_bar_config(disable=None)
+        inputs = self.get_dummy_inputs(device)
+        negative_prompt = "french fries"
+        output = sd_pipe(**inputs, negative_prompt=negative_prompt)
+        image = output.images
+        image_slice = image[0, -3:, -3:, -1]
+        assert image.shape == (1, 64, 64, 3)
+        expected_slice = np.array([0.5464, 0.5072, 0.5012, 0.4124, 0.3624, 0.466, 0.5413, 0.5468, 0.4927])
+        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+    def test_stable_diffusion_pix2pix_zero_euler(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        components["scheduler"] = EulerAncestralDiscreteScheduler(
+            beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear"
+        )
+        sd_pipe = StableDiffusionPix2PixZeroPipeline(**components)
+        sd_pipe = sd_pipe.to(device)
+        sd_pipe.set_progress_bar_config(disable=None)
+        inputs = self.get_dummy_inputs(device)
+        image = sd_pipe(**inputs).images
+        image_slice = image[0, -3:, -3:, -1]
+        assert image.shape == (1, 64, 64, 3)
+        expected_slice = np.array([0.5114, 0.5051, 0.5222, 0.5279, 0.5037, 0.5156, 0.4604, 0.4966, 0.504])
+        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+    def test_stable_diffusion_pix2pix_zero_ddpm(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        components["scheduler"] = DDPMScheduler()
+        sd_pipe = StableDiffusionPix2PixZeroPipeline(**components)
+        sd_pipe = sd_pipe.to(device)
+        sd_pipe.set_progress_bar_config(disable=None)
+        inputs = self.get_dummy_inputs(device)
+        image = sd_pipe(**inputs).images
+        image_slice = image[0, -3:, -3:, -1]
+        assert image.shape == (1, 64, 64, 3)
+        expected_slice = np.array([0.5185, 0.5027, 0.492, 0.401, 0.3445, 0.464, 0.5321, 0.5327, 0.4892])
+        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
+    def test_stable_diffusion_pix2pix_zero_num_images_per_prompt(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        sd_pipe = StableDiffusionPix2PixZeroPipeline(**components)
+        sd_pipe = sd_pipe.to(device)
+        sd_pipe.set_progress_bar_config(disable=None)
+        # test num_images_per_prompt=1 (default)
+        inputs = self.get_dummy_inputs(device)
+        images = sd_pipe(**inputs).images
+        assert images.shape == (1, 64, 64, 3)
+        # test num_images_per_prompt=2 for a single prompt
+        num_images_per_prompt = 2
+        inputs = self.get_dummy_inputs(device)
+        images = sd_pipe(**inputs, num_images_per_prompt=num_images_per_prompt).images
+        assert images.shape == (num_images_per_prompt, 64, 64, 3)
+        # test num_images_per_prompt for batch of prompts
+        batch_size = 2
+        inputs = self.get_dummy_inputs(device)
+        inputs["prompt"] = [inputs["prompt"]] * batch_size
+        images = sd_pipe(**inputs, num_images_per_prompt=num_images_per_prompt).images
+        assert images.shape == (batch_size * num_images_per_prompt, 64, 64, 3)
+@slow
+@require_torch_gpu
+class StableDiffusionPix2PixZeroPipelineSlowTests(unittest.TestCase):
+    def tearDown(self):
+        super().tearDown()
+        gc.collect()
+        torch.cuda.empty_cache()
+    def get_inputs(self, seed=0):
+        generator = torch.manual_seed(seed)
+        src_emb_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/cat.pt"
+        tgt_emb_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/dog.pt"
+        for url in [src_emb_url, tgt_emb_url]:
+            download_from_url(url, url.split("/")[-1])
+        src_embeds = torch.load(src_emb_url.split("/1")[-1])
+        target_embeds = torch.load(tgt_emb_url.split("/1")[-1])
+        inputs = {
+            "prompt": "turn him into a cyborg",
+            "generator": generator,
+            "num_inference_steps": 3,
+            "guidance_scale": 7.5,
+            "cross_attention_guidance_amount": 0.15,
+            "source_embeds": src_embeds,
+            "target_embeds": target_embeds,
+            "output_type": "numpy",
+        }
+        return inputs
+    def test_stable_diffusion_pix2pix_zero_default(self):
+        pipe = StableDiffusionPix2PixZeroPipeline.from_pretrained(
+            "CompVis/stable-diffusion-v1-4", safety_checker=None, torch_dtype=torch.float16
+        )
+        pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+        pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+        pipe.enable_attention_slicing()
+        inputs = self.get_inputs()
+        image = pipe(**inputs).images
+        image_slice = image[0, -3:, -3:, -1].flatten()
+        assert image.shape == (1, 512, 512, 3)
+        expected_slice = np.array([0.4705, 0.4771, 0.4832, 0.4783, 0.4495, 0.447, 0.4658, 0.4568, 0.438])
+        assert np.abs(expected_slice - image_slice).max() < 1e-3
+    def test_stable_diffusion_pix2pix_zero_k_lms(self):
+        pipe = StableDiffusionPix2PixZeroPipeline.from_pretrained(
+            "CompVis/stable-diffusion-v1-4", safety_checker=None, torch_dtype=torch.float16
+        )
+        pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
+        pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+        pipe.enable_attention_slicing()
+        inputs = self.get_inputs()
+        image = pipe(**inputs).images
+        image_slice = image[0, -3:, -3:, -1].flatten()
+        assert image.shape == (1, 512, 512, 3)
+        expected_slice = np.array([0.6514, 0.5571, 0.5244, 0.5591, 0.4998, 0.4834, 0.502, 0.468, 0.4663])
+        assert np.abs(expected_slice - image_slice).max() < 1e-3
+    def test_stable_diffusion_pix2pix_zero_intermediate_state(self):
+        number_of_steps = 0
+        def callback_fn(step: int, timestep: int, latents: torch.FloatTensor) -> None:
+            callback_fn.has_been_called = True
+            nonlocal number_of_steps
+            number_of_steps += 1
+            if step == 1:
+                latents = latents.detach().cpu().numpy()
+                assert latents.shape == (1, 4, 64, 64)
+                latents_slice = latents[0, -3:, -3:, -1]
+                expected_slice = np.array(
+                    [-0.5176, 0.0669, -0.1963, -0.1653, -0.7856, -0.2871, -0.5562, -0.0096, -0.012]
+                )
+                assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+            elif step == 2:
+                latents = latents.detach().cpu().numpy()
+                assert latents.shape == (1, 4, 64, 64)
+                latents_slice = latents[0, -3:, -3:, -1]
+                expected_slice = np.array(
+                    [-0.5127, 0.0613, -0.1937, -0.1622, -0.7856, -0.2849, -0.5601, -0.0111, -0.0137]
+                )
+                assert np.abs(latents_slice.flatten() - expected_slice).max() < 5e-2
+        callback_fn.has_been_called = False
+        pipe = StableDiffusionPix2PixZeroPipeline.from_pretrained(
+            "CompVis/stable-diffusion-v1-4", safety_checker=None, torch_dtype=torch.float16
+        )
+        pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+        pipe.enable_attention_slicing()
+        inputs = self.get_inputs()
+        pipe(**inputs, callback=callback_fn, callback_steps=1)
+        assert callback_fn.has_been_called
+        assert number_of_steps == 3
+    def test_stable_diffusion_pipeline_with_sequential_cpu_offloading(self):
+        torch.cuda.empty_cache()
+        torch.cuda.reset_max_memory_allocated()
+        torch.cuda.reset_peak_memory_stats()
+        pipe = StableDiffusionPix2PixZeroPipeline.from_pretrained(
+            "CompVis/stable-diffusion-v1-4", safety_checker=None, torch_dtype=torch.float16
+        )
+        pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+        pipe.enable_attention_slicing(1)
+        pipe.enable_sequential_cpu_offload()
+        inputs = self.get_inputs()
+        _ = pipe(**inputs)
+        mem_bytes = torch.cuda.max_memory_allocated()
+        # make sure that less than 8.2 GB is allocated
+        assert mem_bytes < 8.2 * 10**9
--- a/tests/test_pipelines_common.py
+++ b/tests/test_pipelines_common.py
@@ -191,10 +191,16 @@ class PipelineTesterMixin:
    def _test_inference_batch_single_identical(
        self, test_max_difference=None, test_mean_pixel_difference=None, relax_max_difference=False
    ):
-        if self.pipeline_class.__name__ in ["CycleDiffusionPipeline", "RePaintPipeline"]:
+        if self.pipeline_class.__name__ in [
+            "CycleDiffusionPipeline",
+            "RePaintPipeline",
+            "StableDiffusionPix2PixZeroPipeline",
+        ]:
            # RePaint can hardly be made deterministic since the scheduler is currently always
            # nondeterministic
            # CycleDiffusion is also slightly nondeterministic
+            # There's a training loop inside Pix2PixZero and is guided by edit directions. This is
+            # why the slight non-determinism.
            return
        if test_max_difference is None: