Unverified Commit 12358b98 authored by Chong Mou's avatar Chong Mou Committed by GitHub
Browse files

add models for T2I-Adapter-XL (#4696)



* T2I-Adapter-XL

* update

* update

* add pipeline

* modify pipeline

* modify pipeline

* modify pipeline

* modify pipeline

* modify pipeline

* modify modeling_text_unet

* fix styling.

* fix: copies.

* adapter settings

* new test case

* new test case

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* revert prints.

* new test case

* remove print

* org test case

* add test_pipeline

* styling.

* fix copies.

* modify test parameter

* style.

* add adapter-xl doc

* double quotes in docs

* Fix potential type mismatch

* style.

---------
Co-authored-by: default avatarsayakpaul <spsayakpaul@gmail.com>
parent 5eeedd9e
...@@ -29,10 +29,11 @@ This model was contributed by the community contributor [HimariO](https://github ...@@ -29,10 +29,11 @@ This model was contributed by the community contributor [HimariO](https://github
| Pipeline | Tasks | Demo | Pipeline | Tasks | Demo
|---|---|:---:| |---|---|:---:|
| [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | - | [StableDiffusionAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning* | -
| [StableDiffusionXLAdapterPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_xl_adapter.py) | *Text-to-Image Generation with T2I-Adapter Conditioning on StableDiffusion-XL* | -
## Usage example ## Usage example with the base model of StableDiffusion-1.4/1.5
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference. In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
All adapters use the same pipeline. All adapters use the same pipeline.
1. Images are first converted into the appropriate *control image* format. 1. Images are first converted into the appropriate *control image* format.
...@@ -93,6 +94,62 @@ out_image = pipe( ...@@ -93,6 +94,62 @@ out_image = pipe(
![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png) ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png)
## Usage example with the base model of StableDiffusion-XL
In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL.
All adapters use the same pipeline.
1. Images are first downloaded into the appropriate *control image* format.
2. The *control image* and *prompt* are passed to the [`StableDiffusionXLAdapterPipeline`].
Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0).
```python
from diffusers.utils import load_image
sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")
```
![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png)
Then, create the adapter pipeline
```py
import torch
from diffusers import (
T2IAdapter,
StableDiffusionXLAdapterPipeline,
DDPMScheduler
)
from diffusers.models.unet_2d_condition import UNet2DConditionModel
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl")
scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
model_id, adapter=adapter, safety_checker=None, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
)
pipe.to("cuda")
```
Finally, pass the prompt and control image to the pipeline
```py
# fix the random seed, so you will get the same result as the example
generator = torch.Generator().manual_seed(42)
sketch_image_out = pipe(
prompt="a photo of a dog in real world, high quality",
negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
image=sketch_image,
generator=generator,
guidance_scale=7.5
).images[0]
```
![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png)
## Available checkpoints ## Available checkpoints
...@@ -113,6 +170,9 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu ...@@ -113,6 +170,9 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu
|[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)|| |[TencentARC/t2iadapter_depth_sd15v2](https://huggingface.co/TencentARC/t2iadapter_depth_sd15v2)||
|[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)|| |[TencentARC/t2iadapter_sketch_sd15v2](https://huggingface.co/TencentARC/t2iadapter_sketch_sd15v2)||
|[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)|| |[TencentARC/t2iadapter_zoedepth_sd15v1](https://huggingface.co/TencentARC/t2iadapter_zoedepth_sd15v1)||
|[Adapter/t2iadapter, subfolder='sketch_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0)||
|[Adapter/t2iadapter, subfolder='canny_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/canny_sdxl_1.0)||
|[Adapter/t2iadapter, subfolder='openpose_sdxl_1.0'](https://huggingface.co/Adapter/t2iadapter/tree/main/openpose_sdxl_1.0)||
## Combining multiple adapters ## Combining multiple adapters
...@@ -185,3 +245,14 @@ However, T2I-Adapter performs slightly worse than ControlNet. ...@@ -185,3 +245,14 @@ However, T2I-Adapter performs slightly worse than ControlNet.
- disable_vae_slicing - disable_vae_slicing
- enable_xformers_memory_efficient_attention - enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention
## StableDiffusionXLAdapterPipeline
[[autodoc]] StableDiffusionXLAdapterPipeline
- all
- __call__
- enable_attention_slicing
- disable_attention_slicing
- enable_vae_slicing
- disable_vae_slicing
- enable_xformers_memory_efficient_attention
- disable_xformers_memory_efficient_attention
...@@ -191,6 +191,7 @@ else: ...@@ -191,6 +191,7 @@ else:
StableDiffusionPix2PixZeroPipeline, StableDiffusionPix2PixZeroPipeline,
StableDiffusionSAGPipeline, StableDiffusionSAGPipeline,
StableDiffusionUpscalePipeline, StableDiffusionUpscalePipeline,
StableDiffusionXLAdapterPipeline,
StableDiffusionXLControlNetImg2ImgPipeline, StableDiffusionXLControlNetImg2ImgPipeline,
StableDiffusionXLControlNetPipeline, StableDiffusionXLControlNetPipeline,
StableDiffusionXLImg2ImgPipeline, StableDiffusionXLImg2ImgPipeline,
......
...@@ -128,6 +128,8 @@ class T2IAdapter(ModelMixin, ConfigMixin): ...@@ -128,6 +128,8 @@ class T2IAdapter(ModelMixin, ConfigMixin):
if adapter_type == "full_adapter": if adapter_type == "full_adapter":
self.adapter = FullAdapter(in_channels, channels, num_res_blocks, downscale_factor) self.adapter = FullAdapter(in_channels, channels, num_res_blocks, downscale_factor)
elif adapter_type == "full_adapter_xl":
self.adapter = FullAdapterXL(in_channels, channels, num_res_blocks, downscale_factor)
elif adapter_type == "light_adapter": elif adapter_type == "light_adapter":
self.adapter = LightAdapter(in_channels, channels, num_res_blocks, downscale_factor) self.adapter = LightAdapter(in_channels, channels, num_res_blocks, downscale_factor)
else: else:
...@@ -184,6 +186,48 @@ class FullAdapter(nn.Module): ...@@ -184,6 +186,48 @@ class FullAdapter(nn.Module):
return features return features
class FullAdapterXL(nn.Module):
def __init__(
self,
in_channels: int = 3,
channels: List[int] = [320, 640, 1280, 1280],
num_res_blocks: int = 2,
downscale_factor: int = 16,
):
super().__init__()
in_channels = in_channels * downscale_factor**2
self.unshuffle = nn.PixelUnshuffle(downscale_factor)
self.conv_in = nn.Conv2d(in_channels, channels[0], kernel_size=3, padding=1)
self.body = []
# blocks to extract XL features with dimensions of [320, 64, 64], [640, 64, 64], [1280, 32, 32], [1280, 32, 32]
for i in range(len(channels)):
if i == 1:
self.body.append(AdapterBlock(channels[i - 1], channels[i], num_res_blocks))
elif i == 2:
self.body.append(AdapterBlock(channels[i - 1], channels[i], num_res_blocks, down=True))
else:
self.body.append(AdapterBlock(channels[i], channels[i], num_res_blocks))
self.body = nn.ModuleList(self.body)
# XL has one fewer downsampling
self.total_downscale_factor = downscale_factor * 2 ** (len(channels) - 2)
def forward(self, x: torch.Tensor) -> List[torch.Tensor]:
x = self.unshuffle(x)
x = self.conv_in(x)
features = []
for block in self.body:
x = block(x)
features.append(x)
return features
class AdapterBlock(nn.Module): class AdapterBlock(nn.Module):
def __init__(self, in_channels, out_channels, num_res_blocks, down=False): def __init__(self, in_channels, out_channels, num_res_blocks, down=False):
super().__init__() super().__init__()
......
...@@ -965,6 +965,13 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -965,6 +965,13 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
cross_attention_kwargs=cross_attention_kwargs, cross_attention_kwargs=cross_attention_kwargs,
encoder_attention_mask=encoder_attention_mask, encoder_attention_mask=encoder_attention_mask,
) )
# To support T2I-Adapter-XL
if (
is_adapter
and len(down_block_additional_residuals) > 0
and sample.shape == down_block_additional_residuals[0].shape
):
sample += down_block_additional_residuals.pop(0)
if is_controlnet: if is_controlnet:
sample = sample + mid_block_additional_residual sample = sample + mid_block_additional_residual
......
...@@ -118,7 +118,7 @@ else: ...@@ -118,7 +118,7 @@ else:
StableDiffusionXLInstructPix2PixPipeline, StableDiffusionXLInstructPix2PixPipeline,
StableDiffusionXLPipeline, StableDiffusionXLPipeline,
) )
from .t2i_adapter import StableDiffusionAdapterPipeline from .t2i_adapter import StableDiffusionAdapterPipeline, StableDiffusionXLAdapterPipeline
from .text_to_video_synthesis import TextToVideoSDPipeline, TextToVideoZeroPipeline, VideoToVideoSDPipeline from .text_to_video_synthesis import TextToVideoSDPipeline, TextToVideoZeroPipeline, VideoToVideoSDPipeline
from .unclip import UnCLIPImageVariationPipeline, UnCLIPPipeline from .unclip import UnCLIPImageVariationPipeline, UnCLIPPipeline
from .unidiffuser import ImageTextPipelineOutput, UniDiffuserModel, UniDiffuserPipeline, UniDiffuserTextDecoder from .unidiffuser import ImageTextPipelineOutput, UniDiffuserModel, UniDiffuserPipeline, UniDiffuserTextDecoder
......
...@@ -12,3 +12,4 @@ except OptionalDependencyNotAvailable: ...@@ -12,3 +12,4 @@ except OptionalDependencyNotAvailable:
from ...utils.dummy_torch_and_transformers_objects import * # noqa F403 from ...utils.dummy_torch_and_transformers_objects import * # noqa F403
else: else:
from .pipeline_stable_diffusion_adapter import StableDiffusionAdapterPipeline from .pipeline_stable_diffusion_adapter import StableDiffusionAdapterPipeline
from .pipeline_stable_diffusion_xl_adapter import StableDiffusionXLAdapterPipeline
...@@ -1137,6 +1137,13 @@ class UNetFlatConditionModel(ModelMixin, ConfigMixin): ...@@ -1137,6 +1137,13 @@ class UNetFlatConditionModel(ModelMixin, ConfigMixin):
cross_attention_kwargs=cross_attention_kwargs, cross_attention_kwargs=cross_attention_kwargs,
encoder_attention_mask=encoder_attention_mask, encoder_attention_mask=encoder_attention_mask,
) )
# To support T2I-Adapter-XL
if (
is_adapter
and len(down_block_additional_residuals) > 0
and sample.shape == down_block_additional_residuals[0].shape
):
sample += down_block_additional_residuals.pop(0)
if is_controlnet: if is_controlnet:
sample = sample + mid_block_additional_residual sample = sample + mid_block_additional_residual
......
...@@ -902,6 +902,21 @@ class StableDiffusionUpscalePipeline(metaclass=DummyObject): ...@@ -902,6 +902,21 @@ class StableDiffusionUpscalePipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"]) requires_backends(cls, ["torch", "transformers"])
class StableDiffusionXLAdapterPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class StableDiffusionXLControlNetImg2ImgPipeline(metaclass=DummyObject): class StableDiffusionXLControlNetImg2ImgPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"] _backends = ["torch", "transformers"]
......
# coding=utf-8
# Copyright 2023 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import unittest
import numpy as np
import torch
from transformers import CLIPTextConfig, CLIPTextModel, CLIPTextModelWithProjection, CLIPTokenizer
from diffusers import (
AutoencoderKL,
EulerDiscreteScheduler,
StableDiffusionXLAdapterPipeline,
T2IAdapter,
UNet2DConditionModel,
)
from diffusers.utils import floats_tensor
from diffusers.utils.testing_utils import enable_full_determinism
from ..pipeline_params import TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS, TEXT_GUIDED_IMAGE_VARIATION_PARAMS
from ..test_pipelines_common import PipelineTesterMixin
enable_full_determinism()
class StableDiffusionXLAdapterPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = StableDiffusionXLAdapterPipeline
params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS
batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS
def get_dummy_components(self):
torch.manual_seed(0)
unet = UNet2DConditionModel(
block_out_channels=(32, 64),
layers_per_block=2,
sample_size=32,
in_channels=4,
out_channels=4,
down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
# SD2-specific config below
attention_head_dim=(2, 4),
use_linear_projection=True,
addition_embed_type="text_time",
addition_time_embed_dim=8,
transformer_layers_per_block=(1, 2),
projection_class_embeddings_input_dim=80, # 6 * 8 + 32
cross_attention_dim=64,
)
scheduler = EulerDiscreteScheduler(
beta_start=0.00085,
beta_end=0.012,
steps_offset=1,
beta_schedule="scaled_linear",
timestep_spacing="leading",
)
torch.manual_seed(0)
vae = AutoencoderKL(
block_out_channels=[32, 64],
in_channels=3,
out_channels=3,
down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
latent_channels=4,
sample_size=128,
)
torch.manual_seed(0)
text_encoder_config = CLIPTextConfig(
bos_token_id=0,
eos_token_id=2,
hidden_size=32,
intermediate_size=37,
layer_norm_eps=1e-05,
num_attention_heads=4,
num_hidden_layers=5,
pad_token_id=1,
vocab_size=1000,
# SD2-specific config below
hidden_act="gelu",
projection_dim=32,
)
text_encoder = CLIPTextModel(text_encoder_config)
tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
text_encoder_2 = CLIPTextModelWithProjection(text_encoder_config)
tokenizer_2 = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
adapter = T2IAdapter(
in_channels=3,
channels=[32, 64],
num_res_blocks=2,
downscale_factor=4,
adapter_type="full_adapter_xl",
)
components = {
"adapter": adapter,
"unet": unet,
"scheduler": scheduler,
"vae": vae,
"text_encoder": text_encoder,
"tokenizer": tokenizer,
"text_encoder_2": text_encoder_2,
"tokenizer_2": tokenizer_2,
# "safety_checker": None,
# "feature_extractor": None,
}
return components
def get_dummy_inputs(self, device, seed=0):
image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
if str(device).startswith("mps"):
generator = torch.manual_seed(seed)
else:
generator = torch.Generator(device=device).manual_seed(seed)
inputs = {
"prompt": "A painting of a squirrel eating a burger",
"image": image,
"generator": generator,
"num_inference_steps": 2,
"guidance_scale": 5.0,
"output_type": "numpy",
}
return inputs
def test_stable_diffusion_adapter_default_case(self):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
sd_pipe = StableDiffusionXLAdapterPipeline(**components)
sd_pipe = sd_pipe.to(device)
sd_pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
image = sd_pipe(**inputs).images
image_slice = image[0, -3:, -3:, -1]
assert image.shape == (1, 64, 64, 3)
expected_slice = np.array(
[0.5752919, 0.6022097, 0.4728038, 0.49861962, 0.57084894, 0.4644975, 0.5193715, 0.5133664, 0.4729858]
)
assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-3
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment