text_inversion.md 13.6 KB
Newer Older
1
 <!--Copyright 2023 The HuggingFace Team. All rights reserved.
Nathan Lambert's avatar
Nathan Lambert committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
# Textual Inversion
Patrick von Platen's avatar
Patrick von Platen committed
14

15
[Textual Inversion](https://hf.co/papers/2208.01618) is a training technique for personalizing image generation models with just a few example images of what you want it to learn. This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word you must use in the prompt) to match the example images you provide.
Patrick von Platen's avatar
Patrick von Platen committed
16

17
If you're training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing or xFormers. With the same configuration and setup as PyTorch, the Flax training script should be at least ~70% faster!
Patrick von Platen's avatar
Patrick von Platen committed
18

19
This guide will explore the [textual_inversion.py](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
20

21
Before running the script, make sure you install the library from source:
22

23
24
25
26
27
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```
28

29
Navigate to the example folder with the training script and install the required dependencies for the script you're using:
30

31
32
<hfoptions id="installation">
<hfoption id="PyTorch">
33

34
35
36
37
```bash
cd examples/textual_inversion
pip install -r requirements.txt
```
38

39
40
</hfoption>
<hfoption id="Flax">
41

42
```bash
43
44
cd examples/textual_inversion
pip install -r requirements_flax.txt
45
```
46

47
48
49
50
51
52
53
54
55
56
</hfoption>
</hfoptions>

<Tip>

🤗 Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the 🤗 Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.

</Tip>

Initialize an 🤗 Accelerate environment:
57

58
59
60
```bash
accelerate config
```
61

62
To setup a default 🤗 Accelerate environment without choosing any configurations:
63

64
65
66
```bash
accelerate config default
```
67

68
Or if your environment doesn't support an interactive shell, like a notebook, you can use:
69

70
71
```bash
from accelerate.utils import write_basic_config
72

73
74
write_basic_config()
```
75

76
77
78
79
80
81
82
83
84
Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.

<Tip>

The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) and let us know if you have any questions or concerns.

</Tip>

## Script parameters
85

86
The training script has many parameters to help you tailor the training run to your needs. All of the parameters and their descriptions are listed in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/839c2a5ece0af4e75530cb520d77bc7ed8acf474/examples/textual_inversion/textual_inversion.py#L176) function. Where applicable, Diffusers provides default values for each parameter such as the training batch size and learning rate, but feel free to change these values in the training command if you'd like.
87

88
For example, to increase the number of gradient accumulation steps above the default value of 1:
89
90

```bash
91
92
accelerate launch textual_inversion.py \
  --gradient_accumulation_steps=4
Patrick von Platen's avatar
Patrick von Platen committed
93
94
```

95
Some other basic and important parameters to specify include:
Patrick von Platen's avatar
Patrick von Platen committed
96

97
98
99
100
101
102
103
104
105
- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
- `--train_data_dir`: path to a folder containing the training dataset (example images)
- `--output_dir`: where to save the trained model
- `--push_to_hub`: whether to push the trained model to the Hub
- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command
- `--num_vectors`: the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs
- `--placeholder_token`: the special word to tie the learned embeddings to (you must use the word in your prompt for inference)
- `--initializer_token`: a single-word that roughly describes the object or style you're trying to train on
- `--learnable_property`: whether you're training the model to learn a new "style" (for example, Van Gogh's painting style) or "object" (for example, your dog)
Patrick von Platen's avatar
Patrick von Platen committed
106

107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
## Training script

Unlike some of the other training scripts, textual_inversion.py has a custom dataset class, [`TextualInversionDataset`](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L487) for creating a dataset. You can customize the image size, placeholder token, interpolation method, whether to crop the image, and more. If you need to change how the dataset is created, you can modify `TextualInversionDataset`.

Next, you'll find the dataset preprocessing code and training loop in the [`main()`](https://github.com/huggingface/diffusers/blob/839c2a5ece0af4e75530cb520d77bc7ed8acf474/examples/textual_inversion/textual_inversion.py#L573) function.

The script starts by loading the [tokenizer](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L616), [scheduler and model](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L622):

```py
# Load tokenizer
if args.tokenizer_name:
    tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name)
elif args.pretrained_model_name_or_path:
    tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")

# Load scheduler and models
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
text_encoder = CLIPTextModel.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision
)
vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision)
unet = UNet2DConditionModel.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision
)
131
```
Patrick von Platen's avatar
Patrick von Platen committed
132

133
The special [placeholder token](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L632) is added next to the tokenizer, and the embedding is readjusted to account for the new token.
Patrick von Platen's avatar
Patrick von Platen committed
134

135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
Then, the script [creates a dataset](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L716) from the `TextualInversionDataset`:

```py
train_dataset = TextualInversionDataset(
    data_root=args.train_data_dir,
    tokenizer=tokenizer,
    size=args.resolution,
    placeholder_token=(" ".join(tokenizer.convert_ids_to_tokens(placeholder_token_ids))),
    repeats=args.repeats,
    learnable_property=args.learnable_property,
    center_crop=args.center_crop,
    set="train",
)
train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.dataloader_num_workers
)
151
152
```

153
154
155
Finally, the [training loop](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L784) handles everything else from predicting the noisy residual to updating the embedding weights of the special placeholder token.

If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.
156

157
158
159
160
161
## Launch the script

Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! 🚀

For this guide, you'll download some images of a [cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. But remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide).
162

163
164
165
166
167
168
169
170
171
```py
from huggingface_hub import snapshot_download

local_dir = "./cat"
snapshot_download(
    "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes"
)
```

172
Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, and `DATA_DIR`  to the path where you just downloaded the cat images to. The script creates and saves the following files to your repository:
173

174
175
176
- `learned_embeds.bin`: the learned embedding vectors corresponding to your example images
- `token_identifier.txt`: the special placeholder token
- `type_of_concept.txt`: the type of concept you're training on (either "object" or "style")
177

178
<Tip warning={true}>
179

180
A full training run takes ~1 hour on a single V100 GPU.
181

182
183
</Tip>

184
185
186
187
188
189
190
191
192
193
194
One more thing before you launch the script. If you're interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command:

```bash
--validation_prompt="A <cat-toy> train"
--num_validation_images=4
--validation_steps=100
```

<hfoptions id="training-inference">
<hfoption id="PyTorch">

195
```bash
apolinario's avatar
apolinario committed
196
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
197
export DATA_DIR="./cat"
198
199

accelerate launch textual_inversion.py \
200
  --pretrained_model_name_or_path=$MODEL_NAME \
201
202
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
203
204
  --placeholder_token="<cat-toy>" \
  --initializer_token="toy" \
205
206
207
208
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
209
210
  --learning_rate=5.0e-04 \
  --scale_lr \
211
212
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
Steven Liu's avatar
Steven Liu committed
213
214
  --output_dir="textual_inversion_cat" \
  --push_to_hub
215
```
216

217
218
</hfoption>
<hfoption id="Flax">
219

220
221
```bash
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
222
export DATA_DIR="./cat"
223

224
225
226
227
python textual_inversion_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
228
229
  --placeholder_token="<cat-toy>" \
  --initializer_token="toy" \
230
231
232
  --resolution=512 \
  --train_batch_size=1 \
  --max_train_steps=3000 \
233
234
  --learning_rate=5.0e-04 \
  --scale_lr \
Steven Liu's avatar
Steven Liu committed
235
236
  --output_dir="textual_inversion_cat" \
  --push_to_hub
237
```
238

239
240
</hfoption>
</hfoptions>
241

242
After training is complete, you can use your newly trained model for inference like:
243

244
245
<hfoptions id="training-inference">
<hfoption id="PyTorch">
246

247
```py
248
from diffusers import StableDiffusionPipeline
249
import torch
250

251
252
253
254
pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
pipeline.load_textual_inversion("sd-concepts-library/cat-toy")
image = pipeline("A <cat-toy> train", num_inference_steps=50).images[0]
image.save("cat-train.png")
255
```
256

257
258
</hfoption>
<hfoption id="Flax">
259

260
Flax doesn't support the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] method, but the textual_inversion_flax.py script [saves](https://github.com/huggingface/diffusers/blob/c0f058265161178f2a88849e92b37ffdc81f1dcc/examples/textual_inversion/textual_inversion_flax.py#L636C2-L636C2) the learned embeddings as a part of the model after training. This means you can use the model for inference like any other Flax model:
261

262
```py
263
264
265
266
267
268
269
import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline

model_path = "path-to-your-trained-model"
270
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)
271

272
prompt = "A <cat-toy> train"
273
274
275
276
277
278
279
280
281
282
283
284
285
286
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, jax.device_count())
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
287
image.save("cat-train.png")
288
289
```

290
291
</hfoption>
</hfoptions>
292

293
## Next steps
294

295
Congratulations on training your own Textual Inversion model! 🎉 To learn more about how to use your new model, the following guides may be helpful:
296

297
298
- Learn how to [load Textual Inversion embeddings](../using-diffusers/loading_adapters) and also use them as negative embeddings.
- Learn how to use [Textual Inversion](textual_inversion_inference) for inference with Stable Diffusion 1/2 and Stable Diffusion XL.