README.md 12.5 KB
Newer Older
Patrick von Platen's avatar
Patrick von Platen committed
1
2
<p align="center">
    <br>
Patrick von Platen's avatar
Patrick von Platen committed
3
    <img src="https://raw.githubusercontent.com/huggingface/diffusers/main/docs/source/imgs/diffusers_library.jpg" width="400"/>
Patrick von Platen's avatar
Patrick von Platen committed
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
    <br>
<p>
<p align="center">
    <a href="https://github.com/huggingface/diffusers/blob/master/LICENSE">
        <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
    </a>
    <a href="https://github.com/huggingface/diffusers/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/datasets.svg">
    </a>
    <a href="CODE_OF_CONDUCT.md">
        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
    </a>
    <a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a>
</p>

🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves
as a modular toolbox for inference and training of diffusion models.

More precisely, 🤗 Diffusers offers:

- State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code (see [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines)).
- Various noise schedulers that can be used interchangeably for the prefered speed vs. quality trade-off in inference (see [src/diffusers/schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers)).
- Multiple types of diffusion models, such as UNet, that can be used as building blocks in an end-to-end diffusion system (see [src/diffusers/models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)).
Patrick von Platen's avatar
up  
Patrick von Platen committed
27
- Training examples to show how to train the most popular diffusion models (see [examples](https://github.com/huggingface/diffusers/tree/main/examples)).
Patrick von Platen's avatar
Patrick von Platen committed
28

Patrick von Platen's avatar
Patrick von Platen committed
29
## Definitions
Patrick von Platen's avatar
Patrick von Platen committed
30

Patrick von Platen's avatar
Patrick von Platen committed
31
32
**Models**: Neural network that models **p_θ(x_t-1|x_t)** (see image below) and is trained end-to-end to *denoise* a noisy input to an image.
*Examples*: UNet, Conditioned UNet, 3D UNet, Transformer UNet
Patrick von Platen's avatar
Patrick von Platen committed
33
34
35

![model_diff_1_50](https://user-images.githubusercontent.com/23423619/171610307-dab0cd8b-75da-4d4e-9f5a-5922072e2bb5.png)

Patrick von Platen's avatar
Patrick von Platen committed
36
37
38
**Schedulers**: Algorithm class for both **inference** and **training**.
The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training.
*Examples*: [DDPM](https://arxiv.org/abs/2006.11239), [DDIM](https://arxiv.org/abs/2010.02502), [PNDM](https://arxiv.org/abs/2202.09778), [DEIS](https://arxiv.org/abs/2204.13902)
Patrick von Platen's avatar
Patrick von Platen committed
39
40
41
42

![sampling](https://user-images.githubusercontent.com/23423619/171608981-3ad05953-a684-4c82-89f8-62a459147a07.png)
![training](https://user-images.githubusercontent.com/23423619/171608964-b3260cce-e6b4-4841-959d-7d8ba4b8d1b2.png)

Patrick von Platen's avatar
Patrick von Platen committed
43
44
**Diffusion Pipeline**: End-to-end pipeline that includes multiple diffusion models, possible text encoders, ...
*Examples*: GLIDE, Latent-Diffusion, Imagen, DALL-E 2
Patrick von Platen's avatar
Patrick von Platen committed
45
46

![imagen](https://user-images.githubusercontent.com/23423619/171609001-c3f2c1c9-f597-4a16-9843-749bf3f9431c.png)
Patrick von Platen's avatar
Patrick von Platen committed
47

Patrick von Platen's avatar
Patrick von Platen committed
48
49
50
51
52
53
54

## Philosophy

- Readability and clarity is prefered over highly optimized code. A strong importance is put on providing readable, intuitive and elementary code desgin. *E.g.*, the provided [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) are separated from the provided [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and provide well-commented code that can be read alongside the original paper.
- Diffusers is **modality independent** and focusses on providing pretrained models and tools to build systems that generate **continous outputs**, *e.g.* vision and audio.
- Diffusion models and schedulers are provided as consise, elementary building blocks whereas diffusion pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box, should stay as close as possible to their original implementation and can include components of other library, such as text-encoders. Examples for diffusion pipelines are [Glide](https://github.com/openai/glide-text2im) and [Latent Diffusion](https://github.com/CompVis/latent-diffusion).

Patrick von Platen's avatar
Patrick von Platen committed
55
56
## Quickstart

Patrick von Platen's avatar
Patrick von Platen committed
57
58
59
```
git clone https://github.com/huggingface/diffusers.git
cd diffusers && pip install -e .
Patrick von Platen's avatar
Patrick von Platen committed
60
```
Patrick von Platen's avatar
Patrick von Platen committed
61

Patrick von Platen's avatar
Patrick von Platen committed
62
### 1. `diffusers` as a central modular diffusion and sampler library
Patrick von Platen's avatar
Patrick von Platen committed
63

Patrick von Platen's avatar
Patrick von Platen committed
64
65
`diffusers` is more modularized than `transformers`. The idea is that researchers and engineers can use only parts of the library easily for the own use cases.
It could become a central place for all kinds of models, schedulers, training utils and processors that one can mix and match for one's own use case.
Patrick von Platen's avatar
Patrick von Platen committed
66
Both models and schedulers should be load- and saveable from the Hub.
Patrick von Platen's avatar
Patrick von Platen committed
67

Patrick von Platen's avatar
Patrick von Platen committed
68
#### **Example for [DDPM](https://arxiv.org/abs/2006.11239):**
Patrick von Platen's avatar
Patrick von Platen committed
69
70
71

```python
import torch
Patrick von Platen's avatar
Patrick von Platen committed
72
from diffusers import UNetModel, DDPMScheduler
Patrick von Platen's avatar
Patrick von Platen committed
73
74
import PIL
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
75
import tqdm
Patrick von Platen's avatar
Patrick von Platen committed
76

Patrick von Platen's avatar
Patrick von Platen committed
77
generator = torch.manual_seed(0)
Patrick von Platen's avatar
Patrick von Platen committed
78
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
Patrick von Platen's avatar
Patrick von Platen committed
79
80

# 1. Load models
Patrick von Platen's avatar
Patrick von Platen committed
81
noise_scheduler = DDPMScheduler.from_config("fusing/ddpm-lsun-church", tensor_format="pt")
Patrick von Platen's avatar
Patrick von Platen committed
82
unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church").to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
83
84

# 2. Sample gaussian noise
Patrick von Platen's avatar
Patrick von Platen committed
85
image = torch.randn(
Patrick von Platen's avatar
Patrick von Platen committed
86
87
	(1, unet.in_channels, unet.resolution, unet.resolution),
	generator=generator,
Patrick von Platen's avatar
Patrick von Platen committed
88
89
)
image = image.to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
90

Patrick von Platen's avatar
Patrick von Platen committed
91
# 3. Denoise
Patrick von Platen's avatar
Patrick von Platen committed
92
93
num_prediction_steps = len(noise_scheduler)
for t in tqdm.tqdm(reversed(range(num_prediction_steps)), total=num_prediction_steps):
Patrick von Platen's avatar
Patrick von Platen committed
94
95
	# predict noise residual
	with torch.no_grad():
Patrick von Platen's avatar
Patrick von Platen committed
96
		residual = unet(image, t)
Patrick von Platen's avatar
Patrick von Platen committed
97

Patrick von Platen's avatar
Patrick von Platen committed
98
	# predict previous mean of image x_t-1
Patrick von Platen's avatar
Patrick von Platen committed
99
	pred_prev_image = noise_scheduler.step(residual, image, t)
Patrick von Platen's avatar
Patrick von Platen committed
100

Patrick von Platen's avatar
Patrick von Platen committed
101
102
103
	# optionally sample variance
	variance = 0
	if t > 0:
Patrick von Platen's avatar
Patrick von Platen committed
104
		noise = torch.randn(image.shape, generator=generator).to(image.device)
Patrick von Platen's avatar
Patrick von Platen committed
105
		variance = noise_scheduler.get_variance(t).sqrt() * noise
Patrick von Platen's avatar
Patrick von Platen committed
106

Patrick von Platen's avatar
Patrick von Platen committed
107
108
	# set current image to prev_image: x_t -> x_t-1
	image = pred_prev_image + variance
Patrick von Platen's avatar
Patrick von Platen committed
109
110
111
112
113
114
115
116
117
118
119

# 5. process image to PIL
image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])

# 6. save image
image_pil.save("test.png")
```

Patrick von Platen's avatar
Patrick von Platen committed
120
#### **Example for [DDIM](https://arxiv.org/abs/2010.02502):**
Patrick von Platen's avatar
Patrick von Platen committed
121
122
123
124
125
126

```python
import torch
from diffusers import UNetModel, DDIMScheduler
import PIL
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
127
import tqdm
Patrick von Platen's avatar
Patrick von Platen committed
128
129
130
131
132

generator = torch.manual_seed(0)
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load models
Patrick von Platen's avatar
Patrick von Platen committed
133
noise_scheduler = DDIMScheduler.from_config("fusing/ddpm-celeba-hq", tensor_format="pt")
Patrick von Platen's avatar
Patrick von Platen committed
134
unet = UNetModel.from_pretrained("fusing/ddpm-celeba-hq").to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
135
136

# 2. Sample gaussian noise
Patrick von Platen's avatar
Patrick von Platen committed
137
image = torch.randn(
Patrick von Platen's avatar
Patrick von Platen committed
138
139
	(1, unet.in_channels, unet.resolution, unet.resolution),
	generator=generator,
Patrick von Platen's avatar
Patrick von Platen committed
140
141
)
image = image.to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
142
143
144
145
146
147

# 3. Denoise                                                                                                                                           
num_inference_steps = 50
eta = 0.0  # <- deterministic sampling

for t in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):
Patrick von Platen's avatar
Patrick von Platen committed
148
149
150
151
152
153
	# 1. predict noise residual
	orig_t = noise_scheduler.get_orig_t(t, num_inference_steps)
	with torch.no_grad():
	    residual = unet(image, orig_t)

	# 2. predict previous mean of image x_t-1
Patrick von Platen's avatar
Patrick von Platen committed
154
	pred_prev_image = noise_scheduler.step(residual, image, t, num_inference_steps, eta)
Patrick von Platen's avatar
Patrick von Platen committed
155
156
157
158

	# 3. optionally sample variance
	variance = 0
	if eta > 0:
Patrick von Platen's avatar
Patrick von Platen committed
159
		noise = torch.randn(image.shape, generator=generator).to(image.device)
Patrick von Platen's avatar
Patrick von Platen committed
160
161
162
163
		variance = noise_scheduler.get_variance(t).sqrt() * eta * noise

	# 4. set current image to prev_image: x_t -> x_t-1
	image = pred_prev_image + variance
Patrick von Platen's avatar
Patrick von Platen committed
164
165

# 5. process image to PIL
Patrick von Platen's avatar
Patrick von Platen committed
166
167
168
169
image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])
Patrick von Platen's avatar
Patrick von Platen committed
170

Patrick von Platen's avatar
Patrick von Platen committed
171
# 6. save image
Patrick von Platen's avatar
Patrick von Platen committed
172
image_pil.save("test.png")
Patrick von Platen's avatar
Patrick von Platen committed
173
174
```

Patrick von Platen's avatar
Patrick von Platen committed
175
### 2. `diffusers` as a collection of most important Diffusion systems (GLIDE, Dalle, ...)
Patrick von Platen's avatar
Patrick von Platen committed
176
`models` directory in repository hosts the complete code necessary for running a diffusion system as well as to train it. A `DiffusionPipeline` class allows to easily run the diffusion model in inference:
Patrick von Platen's avatar
Patrick von Platen committed
177

Patrick von Platen's avatar
Patrick von Platen committed
178
#### **Example image generation with DDPM**
Patrick von Platen's avatar
Patrick von Platen committed
179
180

```python
Suraj Patil's avatar
Suraj Patil committed
181
from diffusers import DiffusionPipeline
Patrick von Platen's avatar
Patrick von Platen committed
182
183
import PIL.Image
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
184

Patrick von Platen's avatar
Patrick von Platen committed
185
# load model and scheduler
Suraj Patil's avatar
Suraj Patil committed
186
ddpm = DiffusionPipeline.from_pretrained("fusing/ddpm-lsun-bedroom")
Patrick von Platen's avatar
Patrick von Platen committed
187
188

# run pipeline in inference (sample random noise and denoise)
Patrick von Platen's avatar
Patrick von Platen committed
189
190
image = ddpm()

Patrick von Platen's avatar
Patrick von Platen committed
191
# process image to PIL
Patrick von Platen's avatar
Patrick von Platen committed
192
193
194
195
image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])
Patrick von Platen's avatar
Patrick von Platen committed
196
197

# save image
Patrick von Platen's avatar
Patrick von Platen committed
198
image_pil.save("test.png")
Patrick von Platen's avatar
Patrick von Platen committed
199
200
```

Suraj Patil's avatar
Suraj Patil committed
201
#### **Text to Image generation with Latent Diffusion**
202

patil-suraj's avatar
patil-suraj committed
203
204
_Note: To use latent diffusion install transformers from [this branch](https://github.com/patil-suraj/transformers/tree/ldm-bert)._

205
206
207
208
209
```python
from diffusers import DiffusionPipeline

ldm = DiffusionPipeline.from_pretrained("fusing/latent-diffusion-text2im-large")

patil-suraj's avatar
patil-suraj committed
210
generator = torch.manual_seed(42)
211
212
213
214
215
216
217
218
219
220
221
222
223

prompt = "A painting of a squirrel eating a burger"
image = ldm([prompt], generator=generator, eta=0.3, guidance_scale=6.0, num_inference_steps=50)

image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = image_processed  * 255.
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])

# save image
image_pil.save("test.png")
```

Suraj Patil's avatar
Suraj Patil committed
224
 #### **Text to speech with BDDM**
Suraj Patil's avatar
Suraj Patil committed
225
226
227
228
229
230
231
232
233
234

_Follow the isnstructions [here](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) to load tacotron2 model._

```python
import torch
from diffusers import BDDM, DiffusionPipeline

torch_device = "cuda"

# load the BDDM pipeline
patil-suraj's avatar
patil-suraj committed
235
bddm = DiffusionPipeline.from_pretrained("fusing/diffwave-vocoder-ljspeech")
Suraj Patil's avatar
Suraj Patil committed
236
237
238
239
240
241
242
243
244
245

# load tacotron2 to get the mel spectograms
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to(torch_device).eval()

text = "Hello world, I missed you so much."

utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

Suraj Patil's avatar
Suraj Patil committed
246
# generate mel spectograms using text
Suraj Patil's avatar
Suraj Patil committed
247
with torch.no_grad():
Suraj Patil's avatar
Suraj Patil committed
248
    mel_spec, _, _ = tacotron2.infer(sequences, lengths)
Suraj Patil's avatar
Suraj Patil committed
249

Suraj Patil's avatar
Suraj Patil committed
250
# generate the speech by passing mel spectograms to BDDM pipeline
Suraj Patil's avatar
Suraj Patil committed
251
generator = torch.manual_seed(0)
Suraj Patil's avatar
Suraj Patil committed
252
audio = bddm(mel_spec, generator, torch_device)
Suraj Patil's avatar
Suraj Patil committed
253

Suraj Patil's avatar
Suraj Patil committed
254
# save generated audio
Suraj Patil's avatar
Suraj Patil committed
255
256
257
258
259
from scipy.io.wavfile import write as wavwrite
sampling_rate = 22050
wavwrite("generated_audio.wav", sampling_rate, audio.squeeze().cpu().numpy())
```

Patrick von Platen's avatar
Patrick von Platen committed
260
261
262
## Library structure:

```
Suraj Patil's avatar
Suraj Patil committed
263
264
├── LICENSE
├── Makefile
Patrick von Platen's avatar
Patrick von Platen committed
265
├── README.md
Suraj Patil's avatar
Suraj Patil committed
266
├── pyproject.toml
Patrick von Platen's avatar
Patrick von Platen committed
267
268
├── setup.cfg
├── setup.py
Patrick von Platen's avatar
Patrick von Platen committed
269
├── src
Suraj Patil's avatar
Suraj Patil committed
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
│   ├── diffusers
│       ├── __init__.py
│       ├── configuration_utils.py
│       ├── dependency_versions_check.py
│       ├── dependency_versions_table.py
│       ├── dynamic_modules_utils.py
│       ├── modeling_utils.py
│       ├── models
│       │   ├── __init__.py
│       │   ├── unet.py
│       │   ├── unet_glide.py
│       │   └── unet_ldm.py
│       ├── pipeline_utils.py
│       ├── pipelines
│       │   ├── __init__.py
│       │   ├── configuration_ldmbert.py
│       │   ├── conversion_glide.py
│       │   ├── modeling_vae.py
│       │   ├── pipeline_bddm.py
│       │   ├── pipeline_ddim.py
│       │   ├── pipeline_ddpm.py
│       │   ├── pipeline_glide.py
│       │   └── pipeline_latent_diffusion.py
│       ├── schedulers
│       │   ├── __init__.py
│       │   ├── classifier_free_guidance.py
│       │   ├── scheduling_ddim.py
│       │   ├── scheduling_ddpm.py
│       │   ├── scheduling_plms.py
│       │   └── scheduling_utils.py
│       ├── testing_utils.py
│       └── utils
│           ├── __init__.py
│           └── logging.py
Patrick von Platen's avatar
Patrick von Platen committed
304
├── tests
Suraj Patil's avatar
Suraj Patil committed
305
306
307
308
309
310
311
312
313
314
315
│   ├── __init__.py
│   ├── test_modeling_utils.py
│   └── test_scheduler.py
└── utils
    ├── check_config_docstrings.py
    ├── check_copies.py
    ├── check_dummies.py
    ├── check_inits.py
    ├── check_repo.py
    ├── check_table.py
    └── check_tf_ops.py
Patrick von Platen's avatar
Patrick von Platen committed
316
```