README.md 11.1 KB
Newer Older
Patrick von Platen's avatar
Patrick von Platen committed
1
2
<p align="center">
    <br>
Anton Lozhkov's avatar
Anton Lozhkov committed
3
    <img src="docs/source/imgs/diffusers_library.jpg" width="400"/>
Patrick von Platen's avatar
Patrick von Platen committed
4
5
6
    <br>
<p>
<p align="center">
Anton Lozhkov's avatar
Anton Lozhkov committed
7
    <a href="https://github.com/huggingface/diffusers/blob/main/LICENSE">
Patrick von Platen's avatar
Patrick von Platen committed
8
9
10
        <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
    </a>
    <a href="https://github.com/huggingface/diffusers/releases">
Anton Lozhkov's avatar
Anton Lozhkov committed
11
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
Patrick von Platen's avatar
Patrick von Platen committed
12
13
14
15
16
17
18
19
20
21
22
23
24
    </a>
    <a href="CODE_OF_CONDUCT.md">
        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
    </a>
</p>

🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves
as a modular toolbox for inference and training of diffusion models.

More precisely, 🤗 Diffusers offers:

- State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code (see [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines)).
- Various noise schedulers that can be used interchangeably for the prefered speed vs. quality trade-off in inference (see [src/diffusers/schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers)).
Suraj Patil's avatar
Suraj Patil committed
25
- Multiple types of models, such as UNet, that can be used as building blocks in an end-to-end diffusion system (see [src/diffusers/models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)).
Patrick von Platen's avatar
up  
Patrick von Platen committed
26
- Training examples to show how to train the most popular diffusion models (see [examples](https://github.com/huggingface/diffusers/tree/main/examples)).
Patrick von Platen's avatar
Patrick von Platen committed
27

Patrick von Platen's avatar
Patrick von Platen committed
28
## Definitions
Patrick von Platen's avatar
Patrick von Platen committed
29

Patrick von Platen's avatar
Patrick von Platen committed
30
31
**Models**: Neural network that models **p_θ(x_t-1|x_t)** (see image below) and is trained end-to-end to *denoise* a noisy input to an image.
*Examples*: UNet, Conditioned UNet, 3D UNet, Transformer UNet
Patrick von Platen's avatar
Patrick von Platen committed
32
33
34

![model_diff_1_50](https://user-images.githubusercontent.com/23423619/171610307-dab0cd8b-75da-4d4e-9f5a-5922072e2bb5.png)

Patrick von Platen's avatar
Patrick von Platen committed
35
36
37
**Schedulers**: Algorithm class for both **inference** and **training**.
The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training.
*Examples*: [DDPM](https://arxiv.org/abs/2006.11239), [DDIM](https://arxiv.org/abs/2010.02502), [PNDM](https://arxiv.org/abs/2202.09778), [DEIS](https://arxiv.org/abs/2204.13902)
Patrick von Platen's avatar
Patrick von Platen committed
38
39
40
41

![sampling](https://user-images.githubusercontent.com/23423619/171608981-3ad05953-a684-4c82-89f8-62a459147a07.png)
![training](https://user-images.githubusercontent.com/23423619/171608964-b3260cce-e6b4-4841-959d-7d8ba4b8d1b2.png)

Patrick von Platen's avatar
Patrick von Platen committed
42
43
**Diffusion Pipeline**: End-to-end pipeline that includes multiple diffusion models, possible text encoders, ...
*Examples*: GLIDE, Latent-Diffusion, Imagen, DALL-E 2
Patrick von Platen's avatar
Patrick von Platen committed
44
45

![imagen](https://user-images.githubusercontent.com/23423619/171609001-c3f2c1c9-f597-4a16-9843-749bf3f9431c.png)
Patrick von Platen's avatar
Patrick von Platen committed
46

Patrick von Platen's avatar
Patrick von Platen committed
47
48
49
50
51
52
53

## Philosophy

- Readability and clarity is prefered over highly optimized code. A strong importance is put on providing readable, intuitive and elementary code desgin. *E.g.*, the provided [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) are separated from the provided [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and provide well-commented code that can be read alongside the original paper.
- Diffusers is **modality independent** and focusses on providing pretrained models and tools to build systems that generate **continous outputs**, *e.g.* vision and audio.
- Diffusion models and schedulers are provided as consise, elementary building blocks whereas diffusion pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box, should stay as close as possible to their original implementation and can include components of other library, such as text-encoders. Examples for diffusion pipelines are [Glide](https://github.com/openai/glide-text2im) and [Latent Diffusion](https://github.com/CompVis/latent-diffusion).

Patrick von Platen's avatar
Patrick von Platen committed
54
55
## Quickstart

Patrick von Platen's avatar
Patrick von Platen committed
56
57
### Installation

Patrick von Platen's avatar
Patrick von Platen committed
58
```
Patrick von Platen's avatar
Patrick von Platen committed
59
pip install diffusers  # should install diffusers 0.0.4
Patrick von Platen's avatar
Patrick von Platen committed
60
```
Patrick von Platen's avatar
Patrick von Platen committed
61

Patrick von Platen's avatar
Patrick von Platen committed
62
### 1. `diffusers` as a toolbox for schedulers and models.
Patrick von Platen's avatar
Patrick von Platen committed
63

Patrick von Platen's avatar
Patrick von Platen committed
64
65
`diffusers` is more modularized than `transformers`. The idea is that researchers and engineers can use only parts of the library easily for the own use cases.
It could become a central place for all kinds of models, schedulers, training utils and processors that one can mix and match for one's own use case.
Patrick von Platen's avatar
Patrick von Platen committed
66
Both models and schedulers should be load- and saveable from the Hub.
Patrick von Platen's avatar
Patrick von Platen committed
67

Patrick von Platen's avatar
Patrick von Platen committed
68
69
For more examples see [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) and [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)

Patrick von Platen's avatar
Patrick von Platen committed
70
#### **Example for [DDPM](https://arxiv.org/abs/2006.11239):**
Patrick von Platen's avatar
Patrick von Platen committed
71
72
73

```python
import torch
Patrick von Platen's avatar
Patrick von Platen committed
74
from diffusers import UNetModel, DDPMScheduler
Patrick von Platen's avatar
Patrick von Platen committed
75
76
import PIL
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
77
import tqdm
Patrick von Platen's avatar
Patrick von Platen committed
78

Patrick von Platen's avatar
Patrick von Platen committed
79
generator = torch.manual_seed(0)
Patrick von Platen's avatar
Patrick von Platen committed
80
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
Patrick von Platen's avatar
Patrick von Platen committed
81
82

# 1. Load models
Patrick von Platen's avatar
Patrick von Platen committed
83
noise_scheduler = DDPMScheduler.from_config("fusing/ddpm-lsun-church", tensor_format="pt")
Patrick von Platen's avatar
Patrick von Platen committed
84
unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church").to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
85
86

# 2. Sample gaussian noise
Patrick von Platen's avatar
Patrick von Platen committed
87
image = torch.randn(
Patrick von Platen's avatar
Patrick von Platen committed
88
89
    (1, unet.in_channels, unet.resolution, unet.resolution),
    generator=generator,
Patrick von Platen's avatar
Patrick von Platen committed
90
91
)
image = image.to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
92

Patrick von Platen's avatar
Patrick von Platen committed
93
# 3. Denoise
Patrick von Platen's avatar
Patrick von Platen committed
94
95
num_prediction_steps = len(noise_scheduler)
for t in tqdm.tqdm(reversed(range(num_prediction_steps)), total=num_prediction_steps):
Patrick von Platen's avatar
Patrick von Platen committed
96
97
    # predict noise residual
    with torch.no_grad():
Patrick von Platen's avatar
Patrick von Platen committed
98
        residual = unet(image, t)
Patrick von Platen's avatar
Patrick von Platen committed
99

Patrick von Platen's avatar
Patrick von Platen committed
100
101
    # predict previous mean of image x_t-1
    pred_prev_image = noise_scheduler.step(residual, image, t)
Patrick von Platen's avatar
Patrick von Platen committed
102

Patrick von Platen's avatar
Patrick von Platen committed
103
104
105
106
    # optionally sample variance
    variance = 0
    if t > 0:
        noise = torch.randn(image.shape, generator=generator).to(image.device)
Patrick von Platen's avatar
Patrick von Platen committed
107
        variance = noise_scheduler.get_variance(t).sqrt() * noise
Patrick von Platen's avatar
Patrick von Platen committed
108

Patrick von Platen's avatar
Patrick von Platen committed
109
110
    # set current image to prev_image: x_t -> x_t-1
    image = pred_prev_image + variance
Patrick von Platen's avatar
Patrick von Platen committed
111
112
113
114
115
116
117
118
119
120
121

# 5. process image to PIL
image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])

# 6. save image
image_pil.save("test.png")
```

Patrick von Platen's avatar
Patrick von Platen committed
122
#### **Example for [DDIM](https://arxiv.org/abs/2010.02502):**
Patrick von Platen's avatar
Patrick von Platen committed
123
124
125
126
127
128

```python
import torch
from diffusers import UNetModel, DDIMScheduler
import PIL
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
129
import tqdm
Patrick von Platen's avatar
Patrick von Platen committed
130
131
132
133
134

generator = torch.manual_seed(0)
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load models
Patrick von Platen's avatar
Patrick von Platen committed
135
noise_scheduler = DDIMScheduler.from_config("fusing/ddpm-celeba-hq", tensor_format="pt")
Patrick von Platen's avatar
Patrick von Platen committed
136
unet = UNetModel.from_pretrained("fusing/ddpm-celeba-hq").to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
137
138

# 2. Sample gaussian noise
Patrick von Platen's avatar
Patrick von Platen committed
139
image = torch.randn(
Patrick von Platen's avatar
Patrick von Platen committed
140
141
	(1, unet.in_channels, unet.resolution, unet.resolution),
	generator=generator,
Patrick von Platen's avatar
Patrick von Platen committed
142
143
)
image = image.to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
144
145
146
147
148
149

# 3. Denoise                                                                                                                                           
num_inference_steps = 50
eta = 0.0  # <- deterministic sampling

for t in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):
Patrick von Platen's avatar
Patrick von Platen committed
150
151
152
153
154
155
	# 1. predict noise residual
	orig_t = noise_scheduler.get_orig_t(t, num_inference_steps)
	with torch.no_grad():
	    residual = unet(image, orig_t)

	# 2. predict previous mean of image x_t-1
Patrick von Platen's avatar
Patrick von Platen committed
156
	pred_prev_image = noise_scheduler.step(residual, image, t, num_inference_steps, eta)
Patrick von Platen's avatar
Patrick von Platen committed
157
158
159
160

	# 3. optionally sample variance
	variance = 0
	if eta > 0:
Patrick von Platen's avatar
Patrick von Platen committed
161
		noise = torch.randn(image.shape, generator=generator).to(image.device)
Patrick von Platen's avatar
Patrick von Platen committed
162
163
164
165
		variance = noise_scheduler.get_variance(t).sqrt() * eta * noise

	# 4. set current image to prev_image: x_t -> x_t-1
	image = pred_prev_image + variance
Patrick von Platen's avatar
Patrick von Platen committed
166
167

# 5. process image to PIL
Patrick von Platen's avatar
Patrick von Platen committed
168
169
170
171
image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])
Patrick von Platen's avatar
Patrick von Platen committed
172

Patrick von Platen's avatar
Patrick von Platen committed
173
# 6. save image
Patrick von Platen's avatar
Patrick von Platen committed
174
image_pil.save("test.png")
Patrick von Platen's avatar
Patrick von Platen committed
175
176
```

Patrick von Platen's avatar
Patrick von Platen committed
177
178
179
### 2. `diffusers` as a collection of popula Diffusion systems (GLIDE, Dalle, ...)

For more examples see [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines).
Patrick von Platen's avatar
Patrick von Platen committed
180

Patrick von Platen's avatar
Patrick von Platen committed
181
#### **Example image generation with PNDM**
Patrick von Platen's avatar
Patrick von Platen committed
182
183

```python
Patrick von Platen's avatar
Patrick von Platen committed
184
from diffusers import PNDM, UNetModel, PNDMScheduler
Patrick von Platen's avatar
Patrick von Platen committed
185
186
import PIL.Image
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
187
188
189
190
191
192
import torch

model_id = "fusing/ddim-celeba-hq"

model = UNetModel.from_pretrained(model_id)
scheduler = PNDMScheduler()
Patrick von Platen's avatar
Patrick von Platen committed
193

Patrick von Platen's avatar
Patrick von Platen committed
194
# load model and scheduler
Suraj Patil's avatar
Suraj Patil committed
195
pndm = PNDM(unet=model, noise_scheduler=scheduler)
Patrick von Platen's avatar
Patrick von Platen committed
196
197

# run pipeline in inference (sample random noise and denoise)
Patrick von Platen's avatar
Patrick von Platen committed
198
with torch.no_grad():
Suraj Patil's avatar
Suraj Patil committed
199
    image = pndm()
Patrick von Platen's avatar
Patrick von Platen committed
200

Patrick von Platen's avatar
Patrick von Platen committed
201
# process image to PIL
Patrick von Platen's avatar
Patrick von Platen committed
202
image_processed = image.cpu().permute(0, 2, 3, 1)
Patrick von Platen's avatar
Patrick von Platen committed
203
204
205
image_processed = (image_processed + 1.0) / 2
image_processed = torch.clamp(image_processed, 0.0, 1.0)
image_processed = image_processed * 255
Patrick von Platen's avatar
Patrick von Platen committed
206
207
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])
Patrick von Platen's avatar
Patrick von Platen committed
208
209

# save image
Patrick von Platen's avatar
Patrick von Platen committed
210
image_pil.save("test.png")
Patrick von Platen's avatar
Patrick von Platen committed
211
212
```

Suraj Patil's avatar
Suraj Patil committed
213
#### **Text to Image generation with Latent Diffusion**
214

patil-suraj's avatar
patil-suraj committed
215
216
_Note: To use latent diffusion install transformers from [this branch](https://github.com/patil-suraj/transformers/tree/ldm-bert)._

217
218
219
220
221
```python
from diffusers import DiffusionPipeline

ldm = DiffusionPipeline.from_pretrained("fusing/latent-diffusion-text2im-large")

patil-suraj's avatar
patil-suraj committed
222
generator = torch.manual_seed(42)
223
224
225
226
227
228
229
230
231
232
233
234
235

prompt = "A painting of a squirrel eating a burger"
image = ldm([prompt], generator=generator, eta=0.3, guidance_scale=6.0, num_inference_steps=50)

image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = image_processed  * 255.
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])

# save image
image_pil.save("test.png")
```

Suraj Patil's avatar
Suraj Patil committed
236
 #### **Text to speech with BDDM**
Suraj Patil's avatar
Suraj Patil committed
237

Suraj Patil's avatar
Suraj Patil committed
238
_Follow the instructions [here](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) to load tacotron2 model._
Suraj Patil's avatar
Suraj Patil committed
239
240
241
242
243
244
245
246

```python
import torch
from diffusers import BDDM, DiffusionPipeline

torch_device = "cuda"

# load the BDDM pipeline
patil-suraj's avatar
patil-suraj committed
247
bddm = DiffusionPipeline.from_pretrained("fusing/diffwave-vocoder-ljspeech")
Suraj Patil's avatar
Suraj Patil committed
248
249
250
251
252
253
254
255
256
257

# load tacotron2 to get the mel spectograms
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to(torch_device).eval()

text = "Hello world, I missed you so much."

utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

Suraj Patil's avatar
Suraj Patil committed
258
# generate mel spectograms using text
Suraj Patil's avatar
Suraj Patil committed
259
with torch.no_grad():
Suraj Patil's avatar
Suraj Patil committed
260
    mel_spec, _, _ = tacotron2.infer(sequences, lengths)
Suraj Patil's avatar
Suraj Patil committed
261

Suraj Patil's avatar
Suraj Patil committed
262
# generate the speech by passing mel spectograms to BDDM pipeline
Suraj Patil's avatar
Suraj Patil committed
263
generator = torch.manual_seed(0)
Suraj Patil's avatar
Suraj Patil committed
264
audio = bddm(mel_spec, generator, torch_device)
Suraj Patil's avatar
Suraj Patil committed
265

Suraj Patil's avatar
Suraj Patil committed
266
# save generated audio
Suraj Patil's avatar
Suraj Patil committed
267
268
269
270
from scipy.io.wavfile import write as wavwrite
sampling_rate = 22050
wavwrite("generated_audio.wav", sampling_rate, audio.squeeze().cpu().numpy())
```
Patrick von Platen's avatar
Patrick von Platen committed
271
272
273
274
275
276
277
278
279
280
281
282

## TODO

- Create common API for models [ ]
- Add tests for models [ ]
- Adapt schedulers for training [ ]
- Write google colab for training [ ]
- Write docs / Think about how to structure docs [ ]
- Add tests to circle ci [ ]
- Add more vision models [ ]
- Add more speech models [ ]
- Add RL model [ ]