README.md 16.6 KB
Newer Older
1
2
3
4
5
6
# Stable Diffusion text-to-image fine-tuning

The `train_text_to_image.py` script shows how to fine-tune stable diffusion model on your own dataset.

___Note___:

7
___This script is experimental. The script fine-tunes the whole model and often times the model overfits and runs into issues like catastrophic forgetting. It's recommended to try different hyperparameters to get the best result on your dataset.___
8
9


10
## Running locally with PyTorch
11
12
13
14
### Installing the dependencies

Before running the scripts, make sure to install the library's training dependencies:

15
16
17
18
19
20
21
22
23
24
**Important**

To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

Then cd in the example folder  and run
25
```bash
26
pip install -r requirements.txt
27
28
29
30
31
32
33
34
```

And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:

```bash
accelerate config
```

35
36
Note also that we use PEFT library as backend for LoRA training, make sure to have `peft>=0.6.0` installed in your environment.

Tolga Cangöz's avatar
Tolga Cangöz committed
37
### Naruto example
38

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
39
You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree.
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).

Run the following command to authenticate your token

```bash
huggingface-cli login
```

If you have already cloned the repo, then you won't need to go through these steps.

<br>

#### Hardware
With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB GPU. For higher `batch_size` and faster training it's better to use GPUs with >30GB memory.

56
**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___**
Mishig's avatar
Mishig committed
57
<!-- accelerate_snippet_start -->
58
59
```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
60
export DATASET_NAME="lambdalabs/naruto-blip-captions"
61

62
accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
63
  --pretrained_model_name_or_path=$MODEL_NAME \
64
  --dataset_name=$DATASET_NAME \
65
66
67
68
69
70
71
72
73
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
Tolga Cangöz's avatar
Tolga Cangöz committed
74
  --output_dir="sd-naruto-model"
75
```
Mishig's avatar
Mishig committed
76
<!-- accelerate_snippet_end -->
77
78
79
80
81
82
83
84
85


To run on your own training files prepare the dataset according to the format required by `datasets`, you can find the instructions for how to do that in this [document](https://huggingface.co/docs/datasets/v2.4.0/en/image_load#imagefolder-with-metadata).
If you wish to use custom loading logic, you should modify the script, we have left pointers for that in the training script.

```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export TRAIN_DIR="path_to_your_dataset"

86
accelerate launch --mixed_precision="fp16" train_text_to_image.py \
87
88
89
90
91
92
93
94
95
96
97
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
Tolga Cangöz's avatar
Tolga Cangöz committed
98
  --output_dir="sd-naruto-model"
99
100
```

101

Tolga Cangöz's avatar
Tolga Cangöz committed
102
Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-naruto-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`
103
104

```python
105
import torch
106
107
108
109
110
111
112
from diffusers import StableDiffusionPipeline

model_path = "path_to_saved_model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(prompt="yoda").images[0]
Tolga Cangöz's avatar
Tolga Cangöz committed
113
image.save("yoda-naruto.png")
114
115
```

116
Checkpoints only save the unet, so to run inference from a checkpoint, just load the unet
117

118
```python
119
import torch
120
121
122
from diffusers import StableDiffusionPipeline, UNet2DConditionModel

model_path = "path_to_saved_model"
123
unet = UNet2DConditionModel.from_pretrained(model_path + "/checkpoint-<N>/unet", torch_dtype=torch.float16)
124
125
126
127
128

pipe = StableDiffusionPipeline.from_pretrained("<initial model>", unet=unet, torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(prompt="yoda").images[0]
Tolga Cangöz's avatar
Tolga Cangöz committed
129
image.save("yoda-naruto.png")
130
131
```

132
133
134
135
136
137
138
#### Training with multiple GPUs

`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
for running distributed training with `accelerate`. Here is an example command:

```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
139
export DATASET_NAME="lambdalabs/naruto-blip-captions"
140
141
142

accelerate launch --mixed_precision="fp16" --multi_gpu  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
143
  --dataset_name=$DATASET_NAME \
144
145
146
147
148
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
149
  --max_train_steps=15000 \
150
151
152
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
Tolga Cangöz's avatar
Tolga Cangöz committed
153
  --output_dir="sd-naruto-model"
154
155
156
```


157
158
#### Training with Min-SNR weighting

Quentin Gallouédec's avatar
Quentin Gallouédec committed
159
We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://huggingface.co/papers/2303.09556) which helps to achieve faster convergence
160
by rebalancing the loss. In order to use it, one needs to set the `--snr_gamma` argument. The recommended
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
161
value when using it is 5.0.
162
163
164
165
166
167
168

You can find [this project on Weights and Biases](https://wandb.ai/sayakpaul/text2image-finetune-minsnr) that compares the loss surfaces of the following setups:

* Training without the Min-SNR weighting strategy
* Training with the Min-SNR weighting strategy (`snr_gamma` set to 5.0)
* Training with the Min-SNR weighting strategy (`snr_gamma` set to 1.0)

Tolga Cangöz's avatar
Tolga Cangöz committed
169
For our small Narutos dataset, the effects of Min-SNR weighting strategy might not appear to be pronounced, but for larger datasets, we believe the effects will be more pronounced.
170

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
171
Also, note that in this example, we either predict `epsilon` (i.e., the noise) or the `v_prediction`. For both of these cases, the formulation of the Min-SNR weighting strategy that we have used holds.
172

173
174
175
176
177
178
179

#### Training with EMA weights

Through the `EMAModel` class, we support a convenient method of tracking an exponential moving average of model parameters.  This helps to smooth out noise in model parameter updates and generally improves model performance.  If enabled with the `--use_ema` argument, the final model checkpoint that is saved at the end of training will use the EMA weights.

EMA weights require an additional full-precision copy of the model parameters to be stored in memory, but otherwise have very little performance overhead.  `--foreach_ema` can be used to further reduce the overhead.  If you are short on VRAM and still want to use EMA weights, you can store them in CPU RAM by using the `--offload_ema` argument.  This will keep the EMA weights in pinned CPU memory during the training step.  Then, once every model parameter update, it will transfer the EMA weights back to the GPU which can then update the parameters on the GPU, before sending them back to the CPU.  Both of these transfers are set up as non-blocking, so CUDA devices should be able to overlap this transfer with other computations.  With sufficient bandwidth between the host and device and a sufficiently long gap between model parameter updates, storing EMA weights in CPU RAM should have no additional performance overhead, as long as no other calls force synchronization.

180
181
#### Training with DREAM

Quentin Gallouédec's avatar
Quentin Gallouédec committed
182
We support training epsilon (noise) prediction models using the [DREAM (Diffusion Rectification and Estimation-Adaptive Models) strategy](https://huggingface.co/papers/2312.00210). DREAM claims to increase model fidelity for the performance cost of an extra grad-less unet `forward` step in the training loop.  You can turn on DREAM training by using the `--dream_training` argument. The `--dream_detail_preservation` argument controls the detail preservation variable p and is the default of 1 from the paper.
183
184


185

186
## Training with LoRA
187

Quentin Gallouédec's avatar
Quentin Gallouédec committed
188
Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://huggingface.co/papers/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.
189
190
191
192
193
194
195
196
197
198
199
200
201
202

In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition matrices to existing weights and **only** training those newly added weights. This has a couple of advantages:

- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114).
- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable.
- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter.

[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.

With LoRA, it's possible to fine-tune Stable Diffusion on a custom image-caption pair dataset
on consumer GPUs like Tesla T4, Tesla V100.

### Training

Tolga Cangöz's avatar
Tolga Cangöz committed
203
First, you need to set up your development environment as is explained in the [installation section](#installing-the-dependencies). Make sure to set the `MODEL_NAME` and `DATASET_NAME` environment variables. Here, we will use [Stable Diffusion v1-4](https://hf.co/CompVis/stable-diffusion-v1-4) and the [Narutos dataset](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions).
204
205
206
207
208
209
210

**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___**

**___Note: It is quite useful to monitor the training progress by regularly generating sample images during training. [Weights and Biases](https://docs.wandb.ai/quickstart) is a nice solution to easily see generating images during training. All you need to do is to run `pip install wandb` before training to automatically log images.___**

```bash
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
211
export DATASET_NAME="lambdalabs/naruto-blip-captions"
212
213
```

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
214
For this example we want to directly store the trained LoRA embeddings on the Hub, so
215
216
217
218
219
220
221
222
223
we need to be logged in and add the `--push_to_hub` flag.

```bash
huggingface-cli login
```

Now we can start training!

```bash
224
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
225
226
227
228
229
230
231
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=512 --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
Tolga Cangöz's avatar
Tolga Cangöz committed
232
  --output_dir="sd-naruto-model-lora" \
233
  --validation_prompt="cute dragon creature" --report_to="wandb"
234
235
236
237
```

The above command will also run inference as fine-tuning progresses and log the results to Weights and Biases.

238
**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___**
239

240
The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.___**
241

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
242
You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw).
243
244
245

### Inference

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
246
Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline` after loading the trained LoRA weights.  You
Tolga Cangöz's avatar
Tolga Cangöz committed
247
need to pass the `output_dir` for loading the LoRA weights which, in this case, is `sd-naruto-model-lora`.
248
249
250
251
252
253
254
255
256
257

```python
from diffusers import StableDiffusionPipeline
import torch

model_path = "sayakpaul/sd-model-finetuned-lora-t4"
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe.unet.load_attn_procs(model_path)
pipe.to("cuda")

Tolga Cangöz's avatar
Tolga Cangöz committed
258
prompt = "A naruto with green eyes and red legs."
259
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
Tolga Cangöz's avatar
Tolga Cangöz committed
260
image.save("naruto.png")
261
```
262

263
264
If you are loading the LoRA parameters from the Hub and if the Hub repository has
a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then
M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
265
you can do:
266

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
267
```py
268
269
270
271
272
273
274
275
276
277
from huggingface_hub.repocard import RepoCard

lora_model_id = "sayakpaul/sd-model-finetuned-lora-t4"
card = RepoCard.load(lora_model_id)
base_model_id = card.data.to_dict()["base_model"]

pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16)
...
```

278
279
280
281
## Training with Flax/JAX

For faster training on TPUs and GPUs you can leverage the flax training example. Follow the instructions above to get the model and dataset before running the script.

282
**___Note: The flax example doesn't yet support features like gradient checkpoint, gradient accumulation etc, so to use flax for faster training we will need >30GB cards or TPU v3.___**
283
284
285
286
287
288
289


Before running the scripts, make sure to install the library's training dependencies:

```bash
pip install -U -r requirements_flax.txt
```
290
291
292

```bash
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
293
export DATASET_NAME="lambdalabs/naruto-blip-captions"
294
295
296

python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
297
  --dataset_name=$DATASET_NAME \
298
299
300
301
302
303
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
Tolga Cangöz's avatar
Tolga Cangöz committed
304
  --output_dir="sd-naruto-model"
305
306
```

307
308
To run on your own training files prepare the dataset according to the format required by `datasets`, you can find the instructions for how to do that in this [document](https://huggingface.co/docs/datasets/v2.4.0/en/image_load#imagefolder-with-metadata).
If you wish to use custom loading logic, you should modify the script, we have left pointers for that in the training script.
309

310
311
312
```bash
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export TRAIN_DIR="path_to_your_dataset"
313

314
315
316
317
318
319
320
321
322
python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
Tolga Cangöz's avatar
Tolga Cangöz committed
323
  --output_dir="sd-naruto-model"
324
```
325

326
327
328
329
330
331
332
333
334
### Training with xFormers:

You can enable memory efficient attention by [installing xFormers](https://huggingface.co/docs/diffusers/main/en/optimization/xformers) and passing the `--enable_xformers_memory_efficient_attention` argument to the script.

xFormers training is not available for Flax/JAX.

**Note**:

According to [this issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training in some GPUs. If you observe that problem, please install a development version as indicated in that comment.
335
336
337

## Stable Diffusion XL

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
338
339
* We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md).
* We also support fine-tuning of the UNet and Text Encoder shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with LoRA via the `train_text_to_image_lora_sdxl.py` script. Please refer to the docs [here](./README_sdxl.md).