text2image.md 12.6 KB
Newer Older
1
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Nathan Lambert's avatar
Nathan Lambert committed
2
3
4
5
6
7
8
9
10
11
12

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

13
# Text-to-image
14
15
16

<Tip warning={true}>

17
The text-to-image script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset.
18
19
20

</Tip>

21
22
23
Text-to-image models like Stable Diffusion are conditioned to generate images given a text prompt.

Training a model can be taxing on your hardware, but if you enable `gradient_checkpointing` and `mixed_precision`, it is possible to train a model on a single 24GB GPU. If you're training with larger batch sizes or want to train faster, it's better to use GPUs with more than 30GB of memory. You can reduce your memory footprint by enabling memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing, gradient accumulation or xFormers. A GPU with at least 30GB of memory or a TPU v3 is recommended for training with Flax.
24

25
26
27
This guide will explore the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) training script to help you become familiar with it, and how you can adapt it for your own use-case.

Before running the script, make sure you install the library from source:
28
29

```bash
30
31
32
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
33
34
```

35
Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
36

37
38
<hfoptions id="installation">
<hfoption id="PyTorch">
39
```bash
40
41
cd examples/text_to_image
pip install -r requirements.txt
42
```
43
44
45
46
47
48
49
50
</hfoption>
<hfoption id="Flax">
```bash
cd examples/text_to_image
pip install -r requirements_flax.txt
```
</hfoption>
</hfoptions>
51

52
<Tip>
53

54
🤗 Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the 🤗 Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
55

56
</Tip>
57

58
Initialize an 🤗 Accelerate environment:
59

60
61
62
```bash
accelerate config
```
63

64
65
66
67
68
To setup a default 🤗 Accelerate environment without choosing any configurations:

```bash
accelerate config default
```
69

70
Or if your environment doesn't support an interactive shell, like a notebook, you can use:
71

M. Tolga Cangöz's avatar
M. Tolga Cangöz committed
72
```py
73
74
75
from accelerate.utils import write_basic_config

write_basic_config()
76
77
```

78
79
80
81
82
83
84
85
86
Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.

## Script parameters

<Tip>

The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) and let us know if you have any questions or concerns.

</Tip>
87

88
89
90
The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L193) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.

For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
91

92
```bash
93
94
accelerate launch train_text_to_image.py \
  --mixed_precision="fp16"
95
96
```

97
98
99
100
101
102
103
104
105
Some basic and important parameters include:

- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
- `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on
- `--image_column`: the name of the image column in the dataset to train on
- `--caption_column`: the name of the text column in the dataset to train on
- `--output_dir`: where to save the trained model
- `--push_to_hub`: whether to push the trained model to the Hub
- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command
106

107
108
109
110
111
### Min-SNR weighting

The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.

Add the `--snr_gamma` parameter and set it to the recommended value of 5.0:
112

113
```bash
114
115
accelerate launch train_text_to_image.py \
  --snr_gamma=5.0
116
```
117

118
You can compare the loss surfaces for different `snr_gamma` values in this [Weights and Biases](https://wandb.ai/sayakpaul/text2image-finetune-minsnr) report. For smaller datasets, the effects of Min-SNR may not be as obvious compared to larger datasets.
119

120
## Training script
121

122
The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L490) function. If you need to adapt the training script, this is where you'll need to make your changes.
123

124
The `train_text_to_image` script starts by [loading a scheduler](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L543) and tokenizer. You can choose to use a different scheduler here if you want:
Steven Liu's avatar
Steven Liu committed
125

126
127
128
129
130
```py
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
tokenizer = CLIPTokenizer.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision
)
Steven Liu's avatar
Steven Liu committed
131
```
132

133
Then the script [loads the UNet](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L619) model:
134

135
136
137
```py
load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet")
model.register_to_config(**load_model.config)
138

139
140
model.load_state_dict(load_model.state_dict())
```
141

142
143
144
145
146
147
148
149
Next, the text and image columns of the dataset need to be preprocessed. The [`tokenize_captions`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L724) function handles tokenizing the inputs, and the [`train_transforms`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L742) function specifies the type of transforms to apply to the image. Both of these functions are bundled into `preprocess_train`:

```py
def preprocess_train(examples):
    images = [image.convert("RGB") for image in examples[image_column]]
    examples["pixel_values"] = [train_transforms(image) for image in images]
    examples["input_ids"] = tokenize_captions(examples)
    return examples
150
```
151

152
153
154
155
156
Lastly, the [training loop](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L878) handles everything else. It encodes images into latent space, adds noise to the latents, computes the text embeddings to condition on, updates the model parameters, and saves and pushes the model to the Hub. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process.

## Launch the script

Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! 🚀
157

158
159
160
<hfoptions id="training-inference">
<hfoption id="PyTorch">

161
Let's train on the [Naruto BLIP captions](https://huggingface.co/datasets/lambdalabs/naruto-blip-captions) dataset to generate your own Naruto characters. Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path). If you're training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command.
162
163
164
165
166
167

<Tip>

To train on a local dataset, set the `TRAIN_DIR` and `OUTPUT_DIR` environment variables to the path of the dataset and where to save the model to.

</Tip>
168
169

```bash
170
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
171
export dataset_name="lambdalabs/naruto-blip-captions"
172

173
accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
174
175
176
177
178
179
180
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
181
  --max_train_steps=15000 \
182
183
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
184
185
  --enable_xformers_memory_efficient_attention
  --lr_scheduler="constant" --lr_warmup_steps=0 \
186
  --output_dir="sd-naruto-model" \
Steven Liu's avatar
Steven Liu committed
187
  --push_to_hub
188
189
```

190
191
</hfoption>
<hfoption id="Flax">
192

193
Training with Flax can be faster on TPUs and GPUs thanks to [@duongna211](https://github.com/duongna21). Flax is more efficient on a TPU, but GPU performance is also great.
194

195
196
197
Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path).

<Tip>
198

199
To train on a local dataset, set the `TRAIN_DIR` and `OUTPUT_DIR` environment variables to the path of the dataset and where to save the model to.
200

201
</Tip>
202

203
```bash
204
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
205
export dataset_name="lambdalabs/naruto-blip-captions"
206
207
208
209
210
211
212
213
214

python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
215
  --output_dir="sd-naruto-model" \
Steven Liu's avatar
Steven Liu committed
216
  --push_to_hub
217
```
218

219
220
</hfoption>
</hfoptions>
221

222
Once training is complete, you can use your newly trained model for inference:
223

224
225
<hfoptions id="training-inference">
<hfoption id="PyTorch">
226

227
```py
228
from diffusers import StableDiffusionPipeline
229
import torch
230

231
pipeline = StableDiffusionPipeline.from_pretrained("path/to/saved_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
232

233
image = pipeline(prompt="yoda").images[0]
234
image.save("yoda-naruto.png")
235
```
236
237
238
239
240

</hfoption>
<hfoption id="Flax">

```py
241
242
243
244
245
246
import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline

247
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained("path/to/saved_model", dtype=jax.numpy.bfloat16)
248

249
prompt = "yoda naruto"
250
251
252
253
254
255
256
257
258
259
260
261
262
263
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, jax.device_count())
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
264
image.save("yoda-naruto.png")
265
```
266

267
268
</hfoption>
</hfoptions>
269

270
## Next steps
271

272
Congratulations on training your own text-to-image model! To learn more about how to use your new model, the following guides may be helpful:
273

274
275
- Learn how to [load LoRA weights](../using-diffusers/loading_adapters#LoRA) for inference if you trained your model with LoRA.
- Learn more about how certain parameters like guidance scale or techniques such as prompt weighting can help you control inference in the [Text-to-image](../using-diffusers/conditional_image_generation) task guide.