[SDXL, Docs] Textual inversion (#5039)

* [SDXL, Docs] Textual inversion * Update docs/source/en/using-diffusers/sdxl.md * finish * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

[SDXL, Docs] Textual inversion (#5039)
* [SDXL, Docs] Textual inversion * Update docs/source/en/using-diffusers/sdxl.md * finish * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
abc47dec · Patrick von Platen · GitHub · 941473a1 · abc47dec · abc47dec
Unverified Commit abc47dec authored Sep 15, 2023 by Patrick von Platen Committed by GitHub Sep 15, 2023
Showing with 52 additions and 1 deletion

docs/source/en/using-diffusers/sdxl.md docs/source/en/using-diffusers/sdxl.md +3 -1

docs/source/en/using-diffusers/textual_inversion_inference.md .../source/en/using-diffusers/textual_inversion_inference.md +49 -0

No files found.
--- a/docs/source/en/using-diffusers/sdxl.md
+++ b/docs/source/en/using-diffusers/sdxl.md
@@ -397,6 +397,8 @@ image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-double-prompt.png" alt="generated image of an astronaut in a jungle in the style of a van gogh painting"/>
 </div>
+The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section.
 ## Optimizations
 SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference.
@@ -426,4 +428,4 @@ SDXL is a large model, and you may need to optimize memory to get it to run on y
 ## Other resources
 If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🤗 Diffusers.
\ No newline at end of file
--- a/docs/source/en/using-diffusers/textual_inversion_inference.md
+++ b/docs/source/en/using-diffusers/textual_inversion_inference.md
@@ -28,6 +28,8 @@ from diffusers.utils import make_image_grid
 from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
 ```
+## Stable Diffusion 1 and 2
 Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer):
 ```py
@@ -69,3 +71,50 @@ grid
 <div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/textual_inversion_inference.png">
 </div>
+## Stable Diffusion XL
+Stable Diffusion XL (SDXL) can also use textual inversion vectors for inference. In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you'll need two textual inversion embeddings - one for each text encoder model.
+Let's download the SDXL textual inversion embeddings and have a closer look at it's structure:
+```py
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors")
+state_dict = load_file(file)
+state_dict
+```
+```
+{'clip_g': tensor([[ 0.0077, -0.0112,  0.0065,  ...,  0.0195,  0.0159,  0.0275],
+         ...,
+         [-0.0170,  0.0213,  0.0143,  ..., -0.0302, -0.0240, -0.0362]],
+ 'clip_l': tensor([[ 0.0023,  0.0192,  0.0213,  ..., -0.0385,  0.0048, -0.0011],
+         ...,
+         [ 0.0475, -0.0508, -0.0145,  ...,  0.0070, -0.0089, -0.0163]],
+```
+There are two tensors, `"clip-g"` and `"clip-l"`.
+`"clip-g"` corresponds to the bigger text encoder in SDXL and refers to 
+`pipe.text_encoder_2` and `"clip-l"` refers to `pipe.text_encoder`.
+Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer
+to [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`]:
+```py
+from diffusers import AutoPipelineForText2Image
+import torch
+pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16)
+pipe.to("cuda")
+pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)
+pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)
+# the embedding should be used as a negative embedding, so we pass it as a negative prompt
+generator = torch.Generator().manual_seed(33)
+image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0]
+```