Unverified Commit bc9a8cef authored by Patrick von Platen's avatar Patrick von Platen Committed by GitHub
Browse files

[SD-XL] Add new pipelines (#3859)



* Add new text encoder

* add transformers depth

* More

* Correct conversion script

* Fix more

* Fix more

* Correct more

* correct text encoder

* Finish all

* proof that in works in run local xl

* clean up

* Get refiner to work

* Add red castle

* Fix batch size

* Improve pipelines more

* Finish text2image tests

* Add img2img test

* Fix more

* fix import

* Fix embeddings for classic models (#3888)

Fix embeddings for classic SD models.

* Allow multiple prompts to be passed to the refiner (#3895)

* finish more

* Apply suggestions from code review

* add watermarker

* Model offload (#3889)

* Model offload.

* Model offload for refiner / img2img

* Hardcode encoder offload on img2img vae encode

Saves some GPU RAM in img2img / refiner tasks so it remains below 8 GB.

---------
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* correct

* fix

* clean print

* Update install warning for `invisible-watermark`

* add: missing docstrings.

* fix and simplify the usage example in img2img.

* fix setup for watermarking.

* Revert "fix setup for watermarking."

This reverts commit 491bc9f5a640bbf46a97a8e52d6eff7e70eb8e4b.

* fix: watermarking setup.

* fix: op.

* run make fix-copies.

* make sure tests pass

* improve convert

* make tests pass

* make tests pass

* better error message

* fiinsh

* finish

* Fix final test

---------
Co-authored-by: default avatarPedro Cuenca <pedro@huggingface.co>
Co-authored-by: default avatarSayak Paul <spsayakpaul@gmail.com>
parent b62d9a1f
...@@ -9,13 +9,20 @@ on: ...@@ -9,13 +9,20 @@ on:
- v*-patch - v*-patch
jobs: jobs:
build: build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main steps:
with: - name: Install dependencies
commit_sha: ${{ github.sha }} run: |
package: diffusers apt-get update && apt-get install libsndfile1-dev libgl1 -y
notebook_folder: diffusers_doc
languages: en ko zh - name: Build doc
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: diffusers
notebook_folder: diffusers_doc
languages: en ko zh
secrets: secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }} token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
...@@ -9,9 +9,15 @@ concurrency: ...@@ -9,9 +9,15 @@ concurrency:
jobs: jobs:
build: build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main steps:
with: - name: Install dependencies
commit_sha: ${{ github.event.pull_request.head.sha }} run: |
pr_number: ${{ github.event.number }} apt-get update && apt-get install libsndfile1-dev libgl1 -y
package: diffusers
languages: en ko - name: Build doc
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: diffusers
languages: en ko zh
...@@ -62,7 +62,7 @@ jobs: ...@@ -62,7 +62,7 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
apt-get update && apt-get install libsndfile1-dev -y apt-get update && apt-get install libsndfile1-dev libgl1 -y
python -m pip install -e .[quality,test] python -m pip install -e .[quality,test]
- name: Environment - name: Environment
......
...@@ -14,6 +14,7 @@ RUN apt update && \ ...@@ -14,6 +14,7 @@ RUN apt update && \
libsndfile1-dev \ libsndfile1-dev \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
libgl1 \
python3.8-venv && \ python3.8-venv && \
rm -rf /var/lib/apt/lists rm -rf /var/lib/apt/lists
...@@ -27,6 +28,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -27,6 +28,7 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
torch \ torch \
torchvision \ torchvision \
torchaudio \ torchaudio \
invisible_watermark \
--extra-index-url https://download.pytorch.org/whl/cpu && \ --extra-index-url https://download.pytorch.org/whl/cpu && \
python3 -m pip install --no-cache-dir \ python3 -m pip install --no-cache-dir \
accelerate \ accelerate \
...@@ -40,4 +42,4 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -40,4 +42,4 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
tensorboard \ tensorboard \
transformers transformers
CMD ["/bin/bash"] CMD ["/bin/bash"]
\ No newline at end of file
...@@ -12,6 +12,7 @@ RUN apt update && \ ...@@ -12,6 +12,7 @@ RUN apt update && \
curl \ curl \
ca-certificates \ ca-certificates \
libsndfile1-dev \ libsndfile1-dev \
libgl1 \
python3.8 \ python3.8 \
python3-pip \ python3-pip \
python3.8-venv && \ python3.8-venv && \
...@@ -26,7 +27,8 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \ ...@@ -26,7 +27,8 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir \ python3 -m pip install --no-cache-dir \
torch \ torch \
torchvision \ torchvision \
torchaudio && \ torchaudio \
invisible_watermark && \
python3 -m pip install --no-cache-dir \ python3 -m pip install --no-cache-dir \
accelerate \ accelerate \
datasets \ datasets \
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable diffusion XL
Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of [Stable Diffusion 1](https://stability.ai/blog/stable-diffusion-public-release).
The project to train Stable Diffusion 2 was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/).
*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels.
These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).*
For more details about how Stable Diffusion 2 works and how it differs from Stable Diffusion 1, please refer to the official [launch announcement post](https://stability.ai/blog/stable-diffusion-v2-release).
## Tips
### Available checkpoints:
- *Text-to-Image (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-base-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9) with [`StableDiffusionXLPipeline`]
- *Image-to-Image / Refiner (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9) with [`StableDiffusionXLImg2ImgPipeline`]
TODO
## StableDiffusionXLPipeline
[[autodoc]] StableDiffusionXLPipeline
- all
- __call__
## StableDiffusionXLImg2ImgPipeline
[[autodoc]] StableDiffusionXLImg2ImgPipeline
- all
- __call__
...@@ -126,6 +126,13 @@ if __name__ == "__main__": ...@@ -126,6 +126,13 @@ if __name__ == "__main__":
"--controlnet", action="store_true", default=None, help="Set flag if this is a controlnet checkpoint." "--controlnet", action="store_true", default=None, help="Set flag if this is a controlnet checkpoint."
) )
parser.add_argument("--half", action="store_true", help="Save weights in half precision.") parser.add_argument("--half", action="store_true", help="Save weights in half precision.")
parser.add_argument(
"--vae_path",
type=str,
default=None,
required=False,
help="Set to a path, hub id to an already converted vae to not convert it again.",
)
args = parser.parse_args() args = parser.parse_args()
pipe = download_from_original_stable_diffusion_ckpt( pipe = download_from_original_stable_diffusion_ckpt(
...@@ -144,6 +151,7 @@ if __name__ == "__main__": ...@@ -144,6 +151,7 @@ if __name__ == "__main__":
stable_unclip_prior=args.stable_unclip_prior, stable_unclip_prior=args.stable_unclip_prior,
clip_stats_path=args.clip_stats_path, clip_stats_path=args.clip_stats_path,
controlnet=args.controlnet, controlnet=args.controlnet,
vae_path=args.vae_path,
) )
if args.half: if args.half:
......
...@@ -89,6 +89,7 @@ _deps = [ ...@@ -89,6 +89,7 @@ _deps = [
"huggingface-hub>=0.13.2", "huggingface-hub>=0.13.2",
"requests-mock==1.10.0", "requests-mock==1.10.0",
"importlib_metadata", "importlib_metadata",
"invisible-watermark",
"isort>=5.5.4", "isort>=5.5.4",
"jax>=0.2.8,!=0.3.2", "jax>=0.2.8,!=0.3.2",
"jaxlib>=0.1.65", "jaxlib>=0.1.65",
...@@ -193,6 +194,7 @@ extras["test"] = deps_list( ...@@ -193,6 +194,7 @@ extras["test"] = deps_list(
"compel", "compel",
"datasets", "datasets",
"Jinja2", "Jinja2",
"invisible-watermark",
"k-diffusion", "k-diffusion",
"librosa", "librosa",
"omegaconf", "omegaconf",
......
...@@ -5,6 +5,7 @@ from .utils import ( ...@@ -5,6 +5,7 @@ from .utils import (
OptionalDependencyNotAvailable, OptionalDependencyNotAvailable,
is_flax_available, is_flax_available,
is_inflect_available, is_inflect_available,
is_invisible_watermark_available,
is_k_diffusion_available, is_k_diffusion_available,
is_k_diffusion_version, is_k_diffusion_version,
is_librosa_available, is_librosa_available,
...@@ -179,6 +180,14 @@ else: ...@@ -179,6 +180,14 @@ else:
VQDiffusionPipeline, VQDiffusionPipeline,
) )
try:
if not (is_torch_available() and is_transformers_available() and is_invisible_watermark_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils.dummy_torch_and_transformers_and_invisible_watermark_objects import * # noqa F403
else:
from .pipelines import StableDiffusionXLImg2ImgPipeline, StableDiffusionXLPipeline
try: try:
if not (is_torch_available() and is_transformers_available() and is_k_diffusion_available()): if not (is_torch_available() and is_transformers_available() and is_k_diffusion_available()):
raise OptionalDependencyNotAvailable() raise OptionalDependencyNotAvailable()
......
...@@ -13,6 +13,7 @@ deps = { ...@@ -13,6 +13,7 @@ deps = {
"huggingface-hub": "huggingface-hub>=0.13.2", "huggingface-hub": "huggingface-hub>=0.13.2",
"requests-mock": "requests-mock==1.10.0", "requests-mock": "requests-mock==1.10.0",
"importlib_metadata": "importlib_metadata", "importlib_metadata": "importlib_metadata",
"invisible-watermark": "invisible-watermark",
"isort": "isort>=5.5.4", "isort": "isort>=5.5.4",
"jax": "jax>=0.2.8,!=0.3.2", "jax": "jax>=0.2.8,!=0.3.2",
"jaxlib": "jaxlib>=0.1.65", "jaxlib": "jaxlib>=0.1.65",
......
...@@ -1118,7 +1118,9 @@ class AttnProcessor2_0: ...@@ -1118,7 +1118,9 @@ class AttnProcessor2_0:
value = attn.to_v(encoder_hidden_states) value = attn.to_v(encoder_hidden_states)
head_dim = inner_dim // attn.heads head_dim = inner_dim // attn.heads
query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2) value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
......
...@@ -38,6 +38,7 @@ def get_down_block( ...@@ -38,6 +38,7 @@ def get_down_block(
add_downsample, add_downsample,
resnet_eps, resnet_eps,
resnet_act_fn, resnet_act_fn,
transformer_layers_per_block=1,
num_attention_heads=None, num_attention_heads=None,
resnet_groups=None, resnet_groups=None,
cross_attention_dim=None, cross_attention_dim=None,
...@@ -111,6 +112,7 @@ def get_down_block( ...@@ -111,6 +112,7 @@ def get_down_block(
raise ValueError("cross_attention_dim must be specified for CrossAttnDownBlock2D") raise ValueError("cross_attention_dim must be specified for CrossAttnDownBlock2D")
return CrossAttnDownBlock2D( return CrossAttnDownBlock2D(
num_layers=num_layers, num_layers=num_layers,
transformer_layers_per_block=transformer_layers_per_block,
in_channels=in_channels, in_channels=in_channels,
out_channels=out_channels, out_channels=out_channels,
temb_channels=temb_channels, temb_channels=temb_channels,
...@@ -232,6 +234,7 @@ def get_up_block( ...@@ -232,6 +234,7 @@ def get_up_block(
add_upsample, add_upsample,
resnet_eps, resnet_eps,
resnet_act_fn, resnet_act_fn,
transformer_layers_per_block=1,
num_attention_heads=None, num_attention_heads=None,
resnet_groups=None, resnet_groups=None,
cross_attention_dim=None, cross_attention_dim=None,
...@@ -287,6 +290,7 @@ def get_up_block( ...@@ -287,6 +290,7 @@ def get_up_block(
raise ValueError("cross_attention_dim must be specified for CrossAttnUpBlock2D") raise ValueError("cross_attention_dim must be specified for CrossAttnUpBlock2D")
return CrossAttnUpBlock2D( return CrossAttnUpBlock2D(
num_layers=num_layers, num_layers=num_layers,
transformer_layers_per_block=transformer_layers_per_block,
in_channels=in_channels, in_channels=in_channels,
out_channels=out_channels, out_channels=out_channels,
prev_output_channel=prev_output_channel, prev_output_channel=prev_output_channel,
...@@ -517,6 +521,7 @@ class UNetMidBlock2DCrossAttn(nn.Module): ...@@ -517,6 +521,7 @@ class UNetMidBlock2DCrossAttn(nn.Module):
temb_channels: int, temb_channels: int,
dropout: float = 0.0, dropout: float = 0.0,
num_layers: int = 1, num_layers: int = 1,
transformer_layers_per_block: int = 1,
resnet_eps: float = 1e-6, resnet_eps: float = 1e-6,
resnet_time_scale_shift: str = "default", resnet_time_scale_shift: str = "default",
resnet_act_fn: str = "swish", resnet_act_fn: str = "swish",
...@@ -559,7 +564,7 @@ class UNetMidBlock2DCrossAttn(nn.Module): ...@@ -559,7 +564,7 @@ class UNetMidBlock2DCrossAttn(nn.Module):
num_attention_heads, num_attention_heads,
in_channels // num_attention_heads, in_channels // num_attention_heads,
in_channels=in_channels, in_channels=in_channels,
num_layers=1, num_layers=transformer_layers_per_block,
cross_attention_dim=cross_attention_dim, cross_attention_dim=cross_attention_dim,
norm_num_groups=resnet_groups, norm_num_groups=resnet_groups,
use_linear_projection=use_linear_projection, use_linear_projection=use_linear_projection,
...@@ -862,6 +867,7 @@ class CrossAttnDownBlock2D(nn.Module): ...@@ -862,6 +867,7 @@ class CrossAttnDownBlock2D(nn.Module):
temb_channels: int, temb_channels: int,
dropout: float = 0.0, dropout: float = 0.0,
num_layers: int = 1, num_layers: int = 1,
transformer_layers_per_block: int = 1,
resnet_eps: float = 1e-6, resnet_eps: float = 1e-6,
resnet_time_scale_shift: str = "default", resnet_time_scale_shift: str = "default",
resnet_act_fn: str = "swish", resnet_act_fn: str = "swish",
...@@ -906,7 +912,7 @@ class CrossAttnDownBlock2D(nn.Module): ...@@ -906,7 +912,7 @@ class CrossAttnDownBlock2D(nn.Module):
num_attention_heads, num_attention_heads,
out_channels // num_attention_heads, out_channels // num_attention_heads,
in_channels=out_channels, in_channels=out_channels,
num_layers=1, num_layers=transformer_layers_per_block,
cross_attention_dim=cross_attention_dim, cross_attention_dim=cross_attention_dim,
norm_num_groups=resnet_groups, norm_num_groups=resnet_groups,
use_linear_projection=use_linear_projection, use_linear_projection=use_linear_projection,
...@@ -1995,6 +2001,7 @@ class CrossAttnUpBlock2D(nn.Module): ...@@ -1995,6 +2001,7 @@ class CrossAttnUpBlock2D(nn.Module):
temb_channels: int, temb_channels: int,
dropout: float = 0.0, dropout: float = 0.0,
num_layers: int = 1, num_layers: int = 1,
transformer_layers_per_block: int = 1,
resnet_eps: float = 1e-6, resnet_eps: float = 1e-6,
resnet_time_scale_shift: str = "default", resnet_time_scale_shift: str = "default",
resnet_act_fn: str = "swish", resnet_act_fn: str = "swish",
...@@ -2040,7 +2047,7 @@ class CrossAttnUpBlock2D(nn.Module): ...@@ -2040,7 +2047,7 @@ class CrossAttnUpBlock2D(nn.Module):
num_attention_heads, num_attention_heads,
out_channels // num_attention_heads, out_channels // num_attention_heads,
in_channels=out_channels, in_channels=out_channels,
num_layers=1, num_layers=transformer_layers_per_block,
cross_attention_dim=cross_attention_dim, cross_attention_dim=cross_attention_dim,
norm_num_groups=resnet_groups, norm_num_groups=resnet_groups,
use_linear_projection=use_linear_projection, use_linear_projection=use_linear_projection,
......
...@@ -98,7 +98,11 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -98,7 +98,11 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
norm_eps (`float`, *optional*, defaults to 1e-5): The epsilon to use for the normalization. norm_eps (`float`, *optional*, defaults to 1e-5): The epsilon to use for the normalization.
cross_attention_dim (`int` or `Tuple[int]`, *optional*, defaults to 1280): cross_attention_dim (`int` or `Tuple[int]`, *optional*, defaults to 1280):
The dimension of the cross attention features. The dimension of the cross attention features.
encoder_hid_dim (`int`, *optional*, defaults to `None`): transformer_layers_per_block (`int` or `Tuple[int]`, *optional*, defaults to 1):
The number of transformer blocks of type [`~models.attention.BasicTransformerBlock`]. Only relevant for
[`~models.unet_2d_blocks.CrossAttnDownBlock2D`], [`~models.unet_2d_blocks.CrossAttnUpBlock2D`],
[`~models.unet_2d_blocks.UNetMidBlock2DCrossAttn`].
encoder_hid_dim (`int`, *optional*, defaults to None):
If `encoder_hid_dim_type` is defined, `encoder_hidden_states` will be projected from `encoder_hid_dim` If `encoder_hid_dim_type` is defined, `encoder_hidden_states` will be projected from `encoder_hid_dim`
dimension to `cross_attention_dim`. dimension to `cross_attention_dim`.
encoder_hid_dim_type (`str`, *optional*, defaults to `None`): encoder_hid_dim_type (`str`, *optional*, defaults to `None`):
...@@ -115,6 +119,8 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -115,6 +119,8 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
addition_embed_type (`str`, *optional*, defaults to `None`): addition_embed_type (`str`, *optional*, defaults to `None`):
Configures an optional embedding which will be summed with the time embeddings. Choose from `None` or Configures an optional embedding which will be summed with the time embeddings. Choose from `None` or
"text". "text" will use the `TextTimeEmbedding` layer. "text". "text" will use the `TextTimeEmbedding` layer.
addition_time_embed_dim: (`int`, *optional*, defaults to `None`):
Dimension for the timestep embeddings.
num_class_embeds (`int`, *optional*, defaults to `None`): num_class_embeds (`int`, *optional*, defaults to `None`):
Input dimension of the learnable embedding matrix to be projected to `time_embed_dim`, when performing Input dimension of the learnable embedding matrix to be projected to `time_embed_dim`, when performing
class conditioning with `class_embed_type` equal to `None`. class conditioning with `class_embed_type` equal to `None`.
...@@ -170,6 +176,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -170,6 +176,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
norm_num_groups: Optional[int] = 32, norm_num_groups: Optional[int] = 32,
norm_eps: float = 1e-5, norm_eps: float = 1e-5,
cross_attention_dim: Union[int, Tuple[int]] = 1280, cross_attention_dim: Union[int, Tuple[int]] = 1280,
transformer_layers_per_block: Union[int, Tuple[int]] = 1,
encoder_hid_dim: Optional[int] = None, encoder_hid_dim: Optional[int] = None,
encoder_hid_dim_type: Optional[str] = None, encoder_hid_dim_type: Optional[str] = None,
attention_head_dim: Union[int, Tuple[int]] = 8, attention_head_dim: Union[int, Tuple[int]] = 8,
...@@ -178,6 +185,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -178,6 +185,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
use_linear_projection: bool = False, use_linear_projection: bool = False,
class_embed_type: Optional[str] = None, class_embed_type: Optional[str] = None,
addition_embed_type: Optional[str] = None, addition_embed_type: Optional[str] = None,
addition_time_embed_dim: Optional[int] = None,
num_class_embeds: Optional[int] = None, num_class_embeds: Optional[int] = None,
upcast_attention: bool = False, upcast_attention: bool = False,
resnet_time_scale_shift: str = "default", resnet_time_scale_shift: str = "default",
...@@ -351,6 +359,10 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -351,6 +359,10 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
self.add_embedding = TextImageTimeEmbedding( self.add_embedding = TextImageTimeEmbedding(
text_embed_dim=cross_attention_dim, image_embed_dim=cross_attention_dim, time_embed_dim=time_embed_dim text_embed_dim=cross_attention_dim, image_embed_dim=cross_attention_dim, time_embed_dim=time_embed_dim
) )
elif addition_embed_type == "text_time":
self.add_time_proj = Timesteps(addition_time_embed_dim, flip_sin_to_cos, freq_shift)
self.add_embedding = TimestepEmbedding(projection_class_embeddings_input_dim, time_embed_dim)
elif addition_embed_type is not None: elif addition_embed_type is not None:
raise ValueError(f"addition_embed_type: {addition_embed_type} must be None, 'text' or 'text_image'.") raise ValueError(f"addition_embed_type: {addition_embed_type} must be None, 'text' or 'text_image'.")
...@@ -383,6 +395,9 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -383,6 +395,9 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
if isinstance(layers_per_block, int): if isinstance(layers_per_block, int):
layers_per_block = [layers_per_block] * len(down_block_types) layers_per_block = [layers_per_block] * len(down_block_types)
if isinstance(transformer_layers_per_block, int):
transformer_layers_per_block = [transformer_layers_per_block] * len(down_block_types)
if class_embeddings_concat: if class_embeddings_concat:
# The time embeddings are concatenated with the class embeddings. The dimension of the # The time embeddings are concatenated with the class embeddings. The dimension of the
# time embeddings passed to the down, middle, and up blocks is twice the dimension of the # time embeddings passed to the down, middle, and up blocks is twice the dimension of the
...@@ -401,6 +416,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -401,6 +416,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
down_block = get_down_block( down_block = get_down_block(
down_block_type, down_block_type,
num_layers=layers_per_block[i], num_layers=layers_per_block[i],
transformer_layers_per_block=transformer_layers_per_block[i],
in_channels=input_channel, in_channels=input_channel,
out_channels=output_channel, out_channels=output_channel,
temb_channels=blocks_time_embed_dim, temb_channels=blocks_time_embed_dim,
...@@ -426,6 +442,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -426,6 +442,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
# mid # mid
if mid_block_type == "UNetMidBlock2DCrossAttn": if mid_block_type == "UNetMidBlock2DCrossAttn":
self.mid_block = UNetMidBlock2DCrossAttn( self.mid_block = UNetMidBlock2DCrossAttn(
transformer_layers_per_block=transformer_layers_per_block[-1],
in_channels=block_out_channels[-1], in_channels=block_out_channels[-1],
temb_channels=blocks_time_embed_dim, temb_channels=blocks_time_embed_dim,
resnet_eps=norm_eps, resnet_eps=norm_eps,
...@@ -467,6 +484,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -467,6 +484,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
reversed_num_attention_heads = list(reversed(num_attention_heads)) reversed_num_attention_heads = list(reversed(num_attention_heads))
reversed_layers_per_block = list(reversed(layers_per_block)) reversed_layers_per_block = list(reversed(layers_per_block))
reversed_cross_attention_dim = list(reversed(cross_attention_dim)) reversed_cross_attention_dim = list(reversed(cross_attention_dim))
reversed_transformer_layers_per_block = list(reversed(transformer_layers_per_block))
only_cross_attention = list(reversed(only_cross_attention)) only_cross_attention = list(reversed(only_cross_attention))
output_channel = reversed_block_out_channels[0] output_channel = reversed_block_out_channels[0]
...@@ -487,6 +505,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -487,6 +505,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
up_block = get_up_block( up_block = get_up_block(
up_block_type, up_block_type,
num_layers=reversed_layers_per_block[i] + 1, num_layers=reversed_layers_per_block[i] + 1,
transformer_layers_per_block=reversed_transformer_layers_per_block[i],
in_channels=input_channel, in_channels=input_channel,
out_channels=output_channel, out_channels=output_channel,
prev_output_channel=prev_output_channel, prev_output_channel=prev_output_channel,
...@@ -693,6 +712,9 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -693,6 +712,9 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
tuple. tuple.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the [`AttnProcessor`]. A kwargs dictionary that if specified is passed along to the [`AttnProcessor`].
added_cond_kwargs: (`dict`, *optional*):
A kwargs dictionary containin additional embeddings that if specified are added to the embeddings that
are passed along to the UNet blocks.
Returns: Returns:
[`~models.unet_2d_condition.UNet2DConditionOutput`] or `tuple`: [`~models.unet_2d_condition.UNet2DConditionOutput`] or `tuple`:
...@@ -763,6 +785,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -763,6 +785,7 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
t_emb = t_emb.to(dtype=sample.dtype) t_emb = t_emb.to(dtype=sample.dtype)
emb = self.time_embedding(t_emb, timestep_cond) emb = self.time_embedding(t_emb, timestep_cond)
aug_emb = None
if self.class_embedding is not None: if self.class_embedding is not None:
if class_labels is None: if class_labels is None:
...@@ -784,7 +807,6 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -784,7 +807,6 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
if self.config.addition_embed_type == "text": if self.config.addition_embed_type == "text":
aug_emb = self.add_embedding(encoder_hidden_states) aug_emb = self.add_embedding(encoder_hidden_states)
emb = emb + aug_emb
elif self.config.addition_embed_type == "text_image": elif self.config.addition_embed_type == "text_image":
# Kadinsky 2.1 - style # Kadinsky 2.1 - style
if "image_embeds" not in added_cond_kwargs: if "image_embeds" not in added_cond_kwargs:
...@@ -796,7 +818,25 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -796,7 +818,25 @@ class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
text_embs = added_cond_kwargs.get("text_embeds", encoder_hidden_states) text_embs = added_cond_kwargs.get("text_embeds", encoder_hidden_states)
aug_emb = self.add_embedding(text_embs, image_embs) aug_emb = self.add_embedding(text_embs, image_embs)
emb = emb + aug_emb elif self.config.addition_embed_type == "text_time":
if "text_embeds" not in added_cond_kwargs:
raise ValueError(
f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `text_embeds` to be passed in `added_cond_kwargs`"
)
text_embeds = added_cond_kwargs.get("text_embeds")
if "time_ids" not in added_cond_kwargs:
raise ValueError(
f"{self.__class__} has the config param `addition_embed_type` set to 'text_time' which requires the keyword argument `time_ids` to be passed in `added_cond_kwargs`"
)
time_ids = added_cond_kwargs.get("time_ids")
time_embeds = self.add_time_proj(time_ids.flatten())
time_embeds = time_embeds.reshape((text_embeds.shape[0], -1))
add_embeds = torch.concat([text_embeds, time_embeds], dim=-1)
add_embeds = add_embeds.to(emb.dtype)
aug_emb = self.add_embedding(add_embeds)
emb = emb + aug_emb if aug_emb is not None else emb
if self.time_embed_act is not None: if self.time_embed_act is not None:
emb = self.time_embed_act(emb) emb = self.time_embed_act(emb)
......
from ..utils import ( from ..utils import (
OptionalDependencyNotAvailable, OptionalDependencyNotAvailable,
is_flax_available, is_flax_available,
is_invisible_watermark_available,
is_k_diffusion_available, is_k_diffusion_available,
is_librosa_available, is_librosa_available,
is_note_seq_available, is_note_seq_available,
...@@ -101,6 +102,15 @@ else: ...@@ -101,6 +102,15 @@ else:
) )
from .vq_diffusion import VQDiffusionPipeline from .vq_diffusion import VQDiffusionPipeline
try:
if not (is_torch_available() and is_transformers_available() and is_invisible_watermark_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ..utils.dummy_torch_and_transformers_and_invisible_watermark_objects import * # noqa F403
else:
from .stable_diffusion_xl import StableDiffusionXLImg2ImgPipeline, StableDiffusionXLPipeline
try: try:
if not is_onnx_available(): if not is_onnx_available():
raise OptionalDependencyNotAvailable() raise OptionalDependencyNotAvailable()
......
...@@ -233,7 +233,10 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa ...@@ -233,7 +233,10 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa
if controlnet: if controlnet:
unet_params = original_config.model.params.control_stage_config.params unet_params = original_config.model.params.control_stage_config.params
else: else:
unet_params = original_config.model.params.unet_config.params if original_config.model.params.unet_config is not None:
unet_params = original_config.model.params.unet_config.params
else:
unet_params = original_config.model.params.network_config.params
vae_params = original_config.model.params.first_stage_config.params.ddconfig vae_params = original_config.model.params.first_stage_config.params.ddconfig
...@@ -253,6 +256,15 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa ...@@ -253,6 +256,15 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa
up_block_types.append(block_type) up_block_types.append(block_type)
resolution //= 2 resolution //= 2
if unet_params.transformer_depth is not None:
transformer_layers_per_block = (
unet_params.transformer_depth
if isinstance(unet_params.transformer_depth, int)
else list(unet_params.transformer_depth)
)
else:
transformer_layers_per_block = 1
vae_scale_factor = 2 ** (len(vae_params.ch_mult) - 1) vae_scale_factor = 2 ** (len(vae_params.ch_mult) - 1)
head_dim = unet_params.num_heads if "num_heads" in unet_params else None head_dim = unet_params.num_heads if "num_heads" in unet_params else None
...@@ -262,14 +274,28 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa ...@@ -262,14 +274,28 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa
if use_linear_projection: if use_linear_projection:
# stable diffusion 2-base-512 and 2-768 # stable diffusion 2-base-512 and 2-768
if head_dim is None: if head_dim is None:
head_dim = [5, 10, 20, 20] head_dim_mult = unet_params.model_channels // unet_params.num_head_channels
head_dim = [head_dim_mult * c for c in list(unet_params.channel_mult)]
class_embed_type = None class_embed_type = None
addition_embed_type = None
addition_time_embed_dim = None
projection_class_embeddings_input_dim = None projection_class_embeddings_input_dim = None
context_dim = None
if unet_params.context_dim is not None:
context_dim = (
unet_params.context_dim if isinstance(unet_params.context_dim, int) else unet_params.context_dim[0]
)
if "num_classes" in unet_params: if "num_classes" in unet_params:
if unet_params.num_classes == "sequential": if unet_params.num_classes == "sequential":
class_embed_type = "projection" if context_dim in [2048, 1280]:
# SDXL
addition_embed_type = "text_time"
addition_time_embed_dim = 256
else:
class_embed_type = "projection"
assert "adm_in_channels" in unet_params assert "adm_in_channels" in unet_params
projection_class_embeddings_input_dim = unet_params.adm_in_channels projection_class_embeddings_input_dim = unet_params.adm_in_channels
else: else:
...@@ -281,11 +307,14 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa ...@@ -281,11 +307,14 @@ def create_unet_diffusers_config(original_config, image_size: int, controlnet=Fa
"down_block_types": tuple(down_block_types), "down_block_types": tuple(down_block_types),
"block_out_channels": tuple(block_out_channels), "block_out_channels": tuple(block_out_channels),
"layers_per_block": unet_params.num_res_blocks, "layers_per_block": unet_params.num_res_blocks,
"cross_attention_dim": unet_params.context_dim, "cross_attention_dim": context_dim,
"attention_head_dim": head_dim, "attention_head_dim": head_dim,
"use_linear_projection": use_linear_projection, "use_linear_projection": use_linear_projection,
"class_embed_type": class_embed_type, "class_embed_type": class_embed_type,
"addition_embed_type": addition_embed_type,
"addition_time_embed_dim": addition_time_embed_dim,
"projection_class_embeddings_input_dim": projection_class_embeddings_input_dim, "projection_class_embeddings_input_dim": projection_class_embeddings_input_dim,
"transformer_layers_per_block": transformer_layers_per_block,
} }
if controlnet: if controlnet:
...@@ -400,6 +429,12 @@ def convert_ldm_unet_checkpoint( ...@@ -400,6 +429,12 @@ def convert_ldm_unet_checkpoint(
else: else:
raise NotImplementedError(f"Not implemented `class_embed_type`: {config['class_embed_type']}") raise NotImplementedError(f"Not implemented `class_embed_type`: {config['class_embed_type']}")
if config["addition_embed_type"] == "text_time":
new_checkpoint["add_embedding.linear_1.weight"] = unet_state_dict["label_emb.0.0.weight"]
new_checkpoint["add_embedding.linear_1.bias"] = unet_state_dict["label_emb.0.0.bias"]
new_checkpoint["add_embedding.linear_2.weight"] = unet_state_dict["label_emb.0.2.weight"]
new_checkpoint["add_embedding.linear_2.bias"] = unet_state_dict["label_emb.0.2.bias"]
new_checkpoint["conv_in.weight"] = unet_state_dict["input_blocks.0.0.weight"] new_checkpoint["conv_in.weight"] = unet_state_dict["input_blocks.0.0.weight"]
new_checkpoint["conv_in.bias"] = unet_state_dict["input_blocks.0.0.bias"] new_checkpoint["conv_in.bias"] = unet_state_dict["input_blocks.0.0.bias"]
...@@ -745,9 +780,12 @@ def convert_ldm_clip_checkpoint(checkpoint, local_files_only=False, text_encoder ...@@ -745,9 +780,12 @@ def convert_ldm_clip_checkpoint(checkpoint, local_files_only=False, text_encoder
text_model_dict = {} text_model_dict = {}
remove_prefixes = ["cond_stage_model.transformer", "conditioner.embedders.0.transformer"]
for key in keys: for key in keys:
if key.startswith("cond_stage_model.transformer"): for prefix in remove_prefixes:
text_model_dict[key[len("cond_stage_model.transformer.") :]] = checkpoint[key] if key.startswith(prefix):
text_model_dict[key[len(prefix + ".") :]] = checkpoint[key]
text_model.load_state_dict(text_model_dict) text_model.load_state_dict(text_model_dict)
...@@ -755,10 +793,11 @@ def convert_ldm_clip_checkpoint(checkpoint, local_files_only=False, text_encoder ...@@ -755,10 +793,11 @@ def convert_ldm_clip_checkpoint(checkpoint, local_files_only=False, text_encoder
textenc_conversion_lst = [ textenc_conversion_lst = [
("cond_stage_model.model.positional_embedding", "text_model.embeddings.position_embedding.weight"), ("positional_embedding", "text_model.embeddings.position_embedding.weight"),
("cond_stage_model.model.token_embedding.weight", "text_model.embeddings.token_embedding.weight"), ("token_embedding.weight", "text_model.embeddings.token_embedding.weight"),
("cond_stage_model.model.ln_final.weight", "text_model.final_layer_norm.weight"), ("ln_final.weight", "text_model.final_layer_norm.weight"),
("cond_stage_model.model.ln_final.bias", "text_model.final_layer_norm.bias"), ("ln_final.bias", "text_model.final_layer_norm.bias"),
("text_projection", "text_projection.weight"),
] ]
textenc_conversion_map = {x[0]: x[1] for x in textenc_conversion_lst} textenc_conversion_map = {x[0]: x[1] for x in textenc_conversion_lst}
...@@ -845,27 +884,36 @@ def convert_paint_by_example_checkpoint(checkpoint): ...@@ -845,27 +884,36 @@ def convert_paint_by_example_checkpoint(checkpoint):
return model return model
def convert_open_clip_checkpoint(checkpoint): def convert_open_clip_checkpoint(checkpoint, prefix="cond_stage_model.model."):
text_model = CLIPTextModel.from_pretrained("stabilityai/stable-diffusion-2", subfolder="text_encoder") # text_model = CLIPTextModel.from_pretrained("stabilityai/stable-diffusion-2", subfolder="text_encoder")
text_model = CLIPTextModelWithProjection.from_pretrained(
"laion/CLIP-ViT-bigG-14-laion2B-39B-b160k", projection_dim=1280
)
keys = list(checkpoint.keys()) keys = list(checkpoint.keys())
text_model_dict = {} text_model_dict = {}
if "cond_stage_model.model.text_projection" in checkpoint: if prefix + "text_projection" in checkpoint:
d_model = int(checkpoint["cond_stage_model.model.text_projection"].shape[0]) d_model = int(checkpoint[prefix + "text_projection"].shape[0])
else: else:
d_model = 1024 d_model = 1024
text_model_dict["text_model.embeddings.position_ids"] = text_model.text_model.embeddings.get_buffer("position_ids") text_model_dict["text_model.embeddings.position_ids"] = text_model.text_model.embeddings.get_buffer("position_ids")
for key in keys: for key in keys:
if "resblocks.23" in key: # Diffusers drops the final layer and only uses the penultimate layer # if "resblocks.23" in key: # Diffusers drops the final layer and only uses the penultimate layer
continue # continue
if key in textenc_conversion_map: if key[len(prefix) :] in textenc_conversion_map:
text_model_dict[textenc_conversion_map[key]] = checkpoint[key] if key.endswith("text_projection"):
if key.startswith("cond_stage_model.model.transformer."): value = checkpoint[key].T
new_key = key[len("cond_stage_model.model.transformer.") :] else:
value = checkpoint[key]
text_model_dict[textenc_conversion_map[key[len(prefix) :]]] = value
if key.startswith(prefix + "transformer."):
new_key = key[len(prefix + "transformer.") :]
if new_key.endswith(".in_proj_weight"): if new_key.endswith(".in_proj_weight"):
new_key = new_key[: -len(".in_proj_weight")] new_key = new_key[: -len(".in_proj_weight")]
new_key = textenc_pattern.sub(lambda m: protected[re.escape(m.group(0))], new_key) new_key = textenc_pattern.sub(lambda m: protected[re.escape(m.group(0))], new_key)
...@@ -1029,6 +1077,7 @@ def download_from_original_stable_diffusion_ckpt( ...@@ -1029,6 +1077,7 @@ def download_from_original_stable_diffusion_ckpt(
load_safety_checker: bool = True, load_safety_checker: bool = True,
pipeline_class: DiffusionPipeline = None, pipeline_class: DiffusionPipeline = None,
local_files_only=False, local_files_only=False,
vae_path=None,
text_encoder=None, text_encoder=None,
tokenizer=None, tokenizer=None,
) -> DiffusionPipeline: ) -> DiffusionPipeline:
...@@ -1096,6 +1145,8 @@ def download_from_original_stable_diffusion_ckpt( ...@@ -1096,6 +1145,8 @@ def download_from_original_stable_diffusion_ckpt(
PaintByExamplePipeline, PaintByExamplePipeline,
StableDiffusionControlNetPipeline, StableDiffusionControlNetPipeline,
StableDiffusionPipeline, StableDiffusionPipeline,
StableDiffusionXLImg2ImgPipeline,
StableDiffusionXLPipeline,
StableUnCLIPImg2ImgPipeline, StableUnCLIPImg2ImgPipeline,
StableUnCLIPPipeline, StableUnCLIPPipeline,
) )
...@@ -1187,9 +1238,9 @@ def download_from_original_stable_diffusion_ckpt( ...@@ -1187,9 +1238,9 @@ def download_from_original_stable_diffusion_ckpt(
checkpoint, original_config, checkpoint_path, image_size, upcast_attention, extract_ema checkpoint, original_config, checkpoint_path, image_size, upcast_attention, extract_ema
) )
num_train_timesteps = original_config.model.params.timesteps num_train_timesteps = original_config.model.params.timesteps or 1000
beta_start = original_config.model.params.linear_start beta_start = original_config.model.params.linear_start or 0.02
beta_end = original_config.model.params.linear_end beta_end = original_config.model.params.linear_end or 0.085
scheduler = DDIMScheduler( scheduler = DDIMScheduler(
beta_end=beta_end, beta_end=beta_end,
...@@ -1231,20 +1282,27 @@ def download_from_original_stable_diffusion_ckpt( ...@@ -1231,20 +1282,27 @@ def download_from_original_stable_diffusion_ckpt(
converted_unet_checkpoint = convert_ldm_unet_checkpoint( converted_unet_checkpoint = convert_ldm_unet_checkpoint(
checkpoint, unet_config, path=checkpoint_path, extract_ema=extract_ema checkpoint, unet_config, path=checkpoint_path, extract_ema=extract_ema
) )
unet.load_state_dict(converted_unet_checkpoint) unet.load_state_dict(converted_unet_checkpoint)
# Convert the VAE model. # Convert the VAE model.
vae_config = create_vae_diffusers_config(original_config, image_size=image_size) if vae_path is None:
converted_vae_checkpoint = convert_ldm_vae_checkpoint(checkpoint, vae_config) vae_config = create_vae_diffusers_config(original_config, image_size=image_size)
converted_vae_checkpoint = convert_ldm_vae_checkpoint(checkpoint, vae_config)
vae = AutoencoderKL(**vae_config) vae = AutoencoderKL(**vae_config)
vae.load_state_dict(converted_vae_checkpoint) vae.load_state_dict(converted_vae_checkpoint)
else:
vae = AutoencoderKL.from_pretrained(vae_path)
# Convert the text model. # Convert the text model.
if model_type is None: if model_type is None and original_config.model.params.cond_stage_config is not None:
model_type = original_config.model.params.cond_stage_config.target.split(".")[-1] model_type = original_config.model.params.cond_stage_config.target.split(".")[-1]
logger.debug(f"no `model_type` given, `model_type` inferred as: {model_type}") logger.debug(f"no `model_type` given, `model_type` inferred as: {model_type}")
elif model_type is None and original_config.model.params.network_config is not None:
if original_config.model.params.network_config.params.context_dim == 2048:
model_type = "SDXL"
else:
model_type = "SDXL-Refiner"
if model_type == "FrozenOpenCLIPEmbedder": if model_type == "FrozenOpenCLIPEmbedder":
text_model = convert_open_clip_checkpoint(checkpoint) text_model = convert_open_clip_checkpoint(checkpoint)
...@@ -1375,6 +1433,40 @@ def download_from_original_stable_diffusion_ckpt( ...@@ -1375,6 +1433,40 @@ def download_from_original_stable_diffusion_ckpt(
safety_checker=safety_checker, safety_checker=safety_checker,
feature_extractor=feature_extractor, feature_extractor=feature_extractor,
) )
elif model_type in ["SDXL", "SDXL-Refiner"]:
if model_type == "SDXL":
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = convert_ldm_clip_checkpoint(checkpoint, local_files_only=local_files_only)
tokenizer_2 = CLIPTokenizer.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k", pad_token="!")
text_encoder_2 = convert_open_clip_checkpoint(checkpoint, prefix="conditioner.embedders.1.model.")
pipe = StableDiffusionXLPipeline(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
text_encoder_2=text_encoder_2,
tokenizer_2=tokenizer_2,
unet=unet,
scheduler=scheduler,
force_zeros_for_empty_prompt=True,
)
else:
tokenizer = None
text_encoder = None
tokenizer_2 = CLIPTokenizer.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k", pad_token="!")
text_encoder_2 = convert_open_clip_checkpoint(checkpoint, prefix="conditioner.embedders.0.model.")
pipe = StableDiffusionXLImg2ImgPipeline(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
text_encoder_2=text_encoder_2,
tokenizer_2=tokenizer_2,
unet=unet,
scheduler=scheduler,
requires_aesthetics_score=True,
force_zeros_for_empty_prompt=False,
)
else: else:
text_config = create_ldm_bert_config(original_config) text_config = create_ldm_bert_config(original_config)
text_model = convert_ldm_bert_checkpoint(checkpoint, text_config) text_model = convert_ldm_bert_checkpoint(checkpoint, text_config)
......
...@@ -24,7 +24,12 @@ from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer ...@@ -24,7 +24,12 @@ from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer
from ...image_processor import VaeImageProcessor from ...image_processor import VaeImageProcessor
from ...loaders import LoraLoaderMixin, TextualInversionLoaderMixin from ...loaders import LoraLoaderMixin, TextualInversionLoaderMixin
from ...models import AutoencoderKL, UNet2DConditionModel from ...models import AutoencoderKL, UNet2DConditionModel
from ...models.attention_processor import AttnProcessor2_0, LoRAXFormersAttnProcessor, XFormersAttnProcessor from ...models.attention_processor import (
AttnProcessor2_0,
LoRAAttnProcessor2_0,
LoRAXFormersAttnProcessor,
XFormersAttnProcessor,
)
from ...schedulers import DDPMScheduler, KarrasDiffusionSchedulers from ...schedulers import DDPMScheduler, KarrasDiffusionSchedulers
from ...utils import deprecate, is_accelerate_available, is_accelerate_version, logging, randn_tensor from ...utils import deprecate, is_accelerate_available, is_accelerate_version, logging, randn_tensor
from ..pipeline_utils import DiffusionPipeline from ..pipeline_utils import DiffusionPipeline
...@@ -747,6 +752,7 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi ...@@ -747,6 +752,7 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi
AttnProcessor2_0, AttnProcessor2_0,
XFormersAttnProcessor, XFormersAttnProcessor,
LoRAXFormersAttnProcessor, LoRAXFormersAttnProcessor,
LoRAAttnProcessor2_0,
] ]
# if xformers or torch_2_0 is used attention block does not need # if xformers or torch_2_0 is used attention block does not need
# to be in float32 which can save lots of memory # to be in float32 which can save lots of memory
......
from dataclasses import dataclass
from typing import List, Optional, Union
import numpy as np
import PIL
from ...utils import BaseOutput, is_invisible_watermark_available, is_torch_available, is_transformers_available
@dataclass
# Copied from diffusers.pipelines.stable_diffusion.__init__.StableDiffusionPipelineOutput with StableDiffusion->StableDiffusionXL
class StableDiffusionXLPipelineOutput(BaseOutput):
"""
Output class for Stable Diffusion pipelines.
Args:
images (`List[PIL.Image.Image]` or `np.ndarray`)
List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width,
num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline.
nsfw_content_detected (`List[bool]`)
List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work"
(nsfw) content, or `None` if safety checking could not be performed.
"""
images: Union[List[PIL.Image.Image], np.ndarray]
nsfw_content_detected: Optional[List[bool]]
if is_transformers_available() and is_torch_available() and is_invisible_watermark_available():
from .pipeline_stable_diffusion_xl import StableDiffusionXLPipeline
from .pipeline_stable_diffusion_xl_img2img import StableDiffusionXLImg2ImgPipeline
import numpy as np
import torch
from imwatermark import WatermarkEncoder
# Copied from https://github.com/Stability-AI/generative-models/blob/613af104c6b85184091d42d374fef420eddb356d/scripts/demo/streamlit_helpers.py#L66
WATERMARK_MESSAGE = 0b101100111110110010010000011110111011000110011110
# bin(x)[2:] gives bits of x as str, use int to convert them to 0/1
WATERMARK_BITS = [int(bit) for bit in bin(WATERMARK_MESSAGE)[2:]]
class StableDiffusionXLWatermarker:
def __init__(self):
self.watermark = WATERMARK_BITS
self.encoder = WatermarkEncoder()
self.encoder.set_watermark("bits", self.watermark)
def apply_watermark(self, images: torch.FloatTensor):
# can't encode images that are smaller than 256
if images.shape[-1] < 256:
return images
images = (255 * (images / 2 + 0.5)).cpu().permute(0, 2, 3, 1).float().numpy()
images = [self.encoder.encode(image, "dwtDct") for image in images]
images = torch.from_numpy(np.array(images)).permute(0, 3, 1, 2)
images = torch.clamp(2 * (images / 255 - 0.5), min=-1.0, max=1.0)
return images
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment