README.md 11.4 KB
Newer Older
Patrick von Platen's avatar
Patrick von Platen committed
1
2
<p align="center">
    <br>
Anton Lozhkov's avatar
Anton Lozhkov committed
3
    <img src="docs/source/imgs/diffusers_library.jpg" width="400"/>
Patrick von Platen's avatar
Patrick von Platen committed
4
5
6
    <br>
<p>
<p align="center">
Anton Lozhkov's avatar
Anton Lozhkov committed
7
    <a href="https://github.com/huggingface/diffusers/blob/main/LICENSE">
Patrick von Platen's avatar
Patrick von Platen committed
8
9
10
        <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
    </a>
    <a href="https://github.com/huggingface/diffusers/releases">
Anton Lozhkov's avatar
Anton Lozhkov committed
11
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
Patrick von Platen's avatar
Patrick von Platen committed
12
13
14
15
16
17
18
19
20
21
22
23
24
    </a>
    <a href="CODE_OF_CONDUCT.md">
        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
    </a>
</p>

🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves
as a modular toolbox for inference and training of diffusion models.

More precisely, 🤗 Diffusers offers:

- State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code (see [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines)).
- Various noise schedulers that can be used interchangeably for the prefered speed vs. quality trade-off in inference (see [src/diffusers/schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers)).
Suraj Patil's avatar
Suraj Patil committed
25
- Multiple types of models, such as UNet, that can be used as building blocks in an end-to-end diffusion system (see [src/diffusers/models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)).
Patrick von Platen's avatar
up  
Patrick von Platen committed
26
- Training examples to show how to train the most popular diffusion models (see [examples](https://github.com/huggingface/diffusers/tree/main/examples)).
Patrick von Platen's avatar
Patrick von Platen committed
27

Patrick von Platen's avatar
Patrick von Platen committed
28
## Definitions
Patrick von Platen's avatar
Patrick von Platen committed
29

Kashif Rasul's avatar
Kashif Rasul committed
30
**Models**: Neural network that models $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ (see image below) and is trained end-to-end to *denoise* a noisy input to an image.
Patrick von Platen's avatar
Patrick von Platen committed
31
*Examples*: UNet, Conditioned UNet, 3D UNet, Transformer UNet
Patrick von Platen's avatar
Patrick von Platen committed
32

Nathan Lambert's avatar
Nathan Lambert committed
33
34
35
36
37
38
<p align="center">
    <img src="https://user-images.githubusercontent.com/10695622/174349667-04e9e485-793b-429a-affe-096e8199ad5b.png" width="800"/>
    <br>
    <em> Figure from DDPM paper (https://arxiv.org/abs/2006.11239). </em>
<p>
    
Patrick von Platen's avatar
Patrick von Platen committed
39
40
41
**Schedulers**: Algorithm class for both **inference** and **training**.
The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training.
*Examples*: [DDPM](https://arxiv.org/abs/2006.11239), [DDIM](https://arxiv.org/abs/2010.02502), [PNDM](https://arxiv.org/abs/2202.09778), [DEIS](https://arxiv.org/abs/2204.13902)
Patrick von Platen's avatar
Patrick von Platen committed
42

Nathan Lambert's avatar
Nathan Lambert committed
43
44
45
46
47
48
<p align="center">
    <img src="https://user-images.githubusercontent.com/10695622/174349706-53d58acc-a4d1-4cda-b3e8-432d9dc7ad38.png" width="800"/>
    <br>
    <em> Sampling and training algorithms. Figure from DDPM paper (https://arxiv.org/abs/2006.11239). </em>
<p>
    
Patrick von Platen's avatar
Patrick von Platen committed
49

Patrick von Platen's avatar
Patrick von Platen committed
50
51
**Diffusion Pipeline**: End-to-end pipeline that includes multiple diffusion models, possible text encoders, ...
*Examples*: GLIDE, Latent-Diffusion, Imagen, DALL-E 2
Patrick von Platen's avatar
Patrick von Platen committed
52

Nathan Lambert's avatar
Nathan Lambert committed
53
54
55
56
57
58
<p align="center">
    <img src="https://user-images.githubusercontent.com/10695622/174348898-481bd7c2-5457-4830-89bc-f0907756f64c.jpeg" width="550"/>
    <br>
    <em> Figure from ImageGen (https://imagen.research.google/). </em>
<p>
    
Patrick von Platen's avatar
Patrick von Platen committed
59
60
## Philosophy

milyiyo's avatar
milyiyo committed
61
- Readability and clarity is prefered over highly optimized code. A strong importance is put on providing readable, intuitive and elementary code design. *E.g.*, the provided [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) are separated from the provided [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and provide well-commented code that can be read alongside the original paper.
Patrick von Platen's avatar
Patrick von Platen committed
62
63
64
- Diffusers is **modality independent** and focusses on providing pretrained models and tools to build systems that generate **continous outputs**, *e.g.* vision and audio.
- Diffusion models and schedulers are provided as consise, elementary building blocks whereas diffusion pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box, should stay as close as possible to their original implementation and can include components of other library, such as text-encoders. Examples for diffusion pipelines are [Glide](https://github.com/openai/glide-text2im) and [Latent Diffusion](https://github.com/CompVis/latent-diffusion).

Patrick von Platen's avatar
Patrick von Platen committed
65
66
## Quickstart

Patrick von Platen's avatar
Patrick von Platen committed
67
68
### Installation

Patrick von Platen's avatar
Patrick von Platen committed
69
```
Patrick von Platen's avatar
Patrick von Platen committed
70
pip install diffusers  # should install diffusers 0.0.4
Patrick von Platen's avatar
Patrick von Platen committed
71
```
Patrick von Platen's avatar
Patrick von Platen committed
72

Kashif Rasul's avatar
Kashif Rasul committed
73
### 1. `diffusers` as a toolbox for schedulers and models
Patrick von Platen's avatar
Patrick von Platen committed
74

Patrick von Platen's avatar
Patrick von Platen committed
75
76
`diffusers` is more modularized than `transformers`. The idea is that researchers and engineers can use only parts of the library easily for the own use cases.
It could become a central place for all kinds of models, schedulers, training utils and processors that one can mix and match for one's own use case.
Patrick von Platen's avatar
Patrick von Platen committed
77
Both models and schedulers should be load- and saveable from the Hub.
Patrick von Platen's avatar
Patrick von Platen committed
78

Patrick von Platen's avatar
Patrick von Platen committed
79
80
For more examples see [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) and [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)

Patrick von Platen's avatar
Patrick von Platen committed
81
#### **Example for [DDPM](https://arxiv.org/abs/2006.11239):**
Patrick von Platen's avatar
Patrick von Platen committed
82
83
84

```python
import torch
Patrick von Platen's avatar
Patrick von Platen committed
85
from diffusers import UNetModel, DDPMScheduler
Patrick von Platen's avatar
Patrick von Platen committed
86
87
import PIL
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
88
import tqdm
Patrick von Platen's avatar
Patrick von Platen committed
89

Patrick von Platen's avatar
Patrick von Platen committed
90
generator = torch.manual_seed(0)
Patrick von Platen's avatar
Patrick von Platen committed
91
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
Patrick von Platen's avatar
Patrick von Platen committed
92
93

# 1. Load models
Patrick von Platen's avatar
Patrick von Platen committed
94
noise_scheduler = DDPMScheduler.from_config("fusing/ddpm-lsun-church", tensor_format="pt")
Patrick von Platen's avatar
Patrick von Platen committed
95
unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church").to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
96
97

# 2. Sample gaussian noise
Patrick von Platen's avatar
Patrick von Platen committed
98
image = torch.randn(
Patrick von Platen's avatar
Patrick von Platen committed
99
100
    (1, unet.in_channels, unet.resolution, unet.resolution),
    generator=generator,
Patrick von Platen's avatar
Patrick von Platen committed
101
102
)
image = image.to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
103

Patrick von Platen's avatar
Patrick von Platen committed
104
# 3. Denoise
Patrick von Platen's avatar
Patrick von Platen committed
105
106
num_prediction_steps = len(noise_scheduler)
for t in tqdm.tqdm(reversed(range(num_prediction_steps)), total=num_prediction_steps):
Patrick von Platen's avatar
Patrick von Platen committed
107
108
    # predict noise residual
    with torch.no_grad():
Patrick von Platen's avatar
Patrick von Platen committed
109
        residual = unet(image, t)
Patrick von Platen's avatar
Patrick von Platen committed
110

Patrick von Platen's avatar
Patrick von Platen committed
111
112
    # predict previous mean of image x_t-1
    pred_prev_image = noise_scheduler.step(residual, image, t)
Patrick von Platen's avatar
Patrick von Platen committed
113

Patrick von Platen's avatar
Patrick von Platen committed
114
115
116
117
    # optionally sample variance
    variance = 0
    if t > 0:
        noise = torch.randn(image.shape, generator=generator).to(image.device)
Patrick von Platen's avatar
Patrick von Platen committed
118
        variance = noise_scheduler.get_variance(t).sqrt() * noise
Patrick von Platen's avatar
Patrick von Platen committed
119

Patrick von Platen's avatar
Patrick von Platen committed
120
121
    # set current image to prev_image: x_t -> x_t-1
    image = pred_prev_image + variance
Patrick von Platen's avatar
Patrick von Platen committed
122
123
124
125
126
127
128
129
130
131
132

# 5. process image to PIL
image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])

# 6. save image
image_pil.save("test.png")
```

Patrick von Platen's avatar
Patrick von Platen committed
133
#### **Example for [DDIM](https://arxiv.org/abs/2010.02502):**
Patrick von Platen's avatar
Patrick von Platen committed
134
135
136
137
138
139

```python
import torch
from diffusers import UNetModel, DDIMScheduler
import PIL
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
140
import tqdm
Patrick von Platen's avatar
Patrick von Platen committed
141
142
143
144
145

generator = torch.manual_seed(0)
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load models
Patrick von Platen's avatar
Patrick von Platen committed
146
noise_scheduler = DDIMScheduler.from_config("fusing/ddpm-celeba-hq", tensor_format="pt")
Patrick von Platen's avatar
Patrick von Platen committed
147
unet = UNetModel.from_pretrained("fusing/ddpm-celeba-hq").to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
148
149

# 2. Sample gaussian noise
Patrick von Platen's avatar
Patrick von Platen committed
150
image = torch.randn(
Suraj Patil's avatar
Suraj Patil committed
151
152
   (1, unet.in_channels, unet.resolution, unet.resolution),
   generator=generator,
Patrick von Platen's avatar
Patrick von Platen committed
153
154
)
image = image.to(torch_device)
Patrick von Platen's avatar
Patrick von Platen committed
155
156
157
158
159
160

# 3. Denoise                                                                                                                                           
num_inference_steps = 50
eta = 0.0  # <- deterministic sampling

for t in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):
Kashif Rasul's avatar
Kashif Rasul committed
161
    # 1. predict noise residual
162
163
	orig_t = len(noise_scheduler) // num_inference_steps * t

Kashif Rasul's avatar
Kashif Rasul committed
164
165
    with torch.inference_mode():
        residual = unet(image, orig_t)
Kashif Rasul's avatar
Kashif Rasul committed
166
167
168
169
170
171
172
173
174
175
176
177

    # 2. predict previous mean of image x_t-1
    pred_prev_image = noise_scheduler.step(residual, image, t, num_inference_steps, eta)

    # 3. optionally sample variance
    variance = 0
    if eta > 0:
        noise = torch.randn(image.shape, generator=generator).to(image.device)
        variance = noise_scheduler.get_variance(t).sqrt() * eta * noise

    # 4. set current image to prev_image: x_t -> x_t-1
    image = pred_prev_image + variance
Patrick von Platen's avatar
Patrick von Platen committed
178
179

# 5. process image to PIL
Patrick von Platen's avatar
Patrick von Platen committed
180
181
182
183
image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])
Patrick von Platen's avatar
Patrick von Platen committed
184

Patrick von Platen's avatar
Patrick von Platen committed
185
# 6. save image
Patrick von Platen's avatar
Patrick von Platen committed
186
image_pil.save("test.png")
Patrick von Platen's avatar
Patrick von Platen committed
187
188
```

189
190
191
192
#### **Examples for other modalities:**

[Diffuser](https://diffusion-planning.github.io/) for planning in reinforcement learning: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1TmBmlYeKUZSkUZoJqfBmaicVTKx6nN1R?usp=sharing)

milyiyo's avatar
milyiyo committed
193
### 2. `diffusers` as a collection of popular Diffusion systems (GLIDE, Dalle, ...)
Patrick von Platen's avatar
Patrick von Platen committed
194
195

For more examples see [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines).
Patrick von Platen's avatar
Patrick von Platen committed
196

Patrick von Platen's avatar
Patrick von Platen committed
197
#### **Example image generation with PNDM**
Patrick von Platen's avatar
Patrick von Platen committed
198
199

```python
Patrick von Platen's avatar
Patrick von Platen committed
200
from diffusers import PNDM, UNetModel, PNDMScheduler
Patrick von Platen's avatar
Patrick von Platen committed
201
202
import PIL.Image
import numpy as np
Patrick von Platen's avatar
Patrick von Platen committed
203
204
205
206
207
208
import torch

model_id = "fusing/ddim-celeba-hq"

model = UNetModel.from_pretrained(model_id)
scheduler = PNDMScheduler()
Patrick von Platen's avatar
Patrick von Platen committed
209

Patrick von Platen's avatar
Patrick von Platen committed
210
# load model and scheduler
Suraj Patil's avatar
Suraj Patil committed
211
pndm = PNDM(unet=model, noise_scheduler=scheduler)
Patrick von Platen's avatar
Patrick von Platen committed
212
213

# run pipeline in inference (sample random noise and denoise)
Patrick von Platen's avatar
Patrick von Platen committed
214
with torch.no_grad():
Suraj Patil's avatar
Suraj Patil committed
215
    image = pndm()
Patrick von Platen's avatar
Patrick von Platen committed
216

Patrick von Platen's avatar
Patrick von Platen committed
217
# process image to PIL
Patrick von Platen's avatar
Patrick von Platen committed
218
image_processed = image.cpu().permute(0, 2, 3, 1)
Patrick von Platen's avatar
Patrick von Platen committed
219
220
221
image_processed = (image_processed + 1.0) / 2
image_processed = torch.clamp(image_processed, 0.0, 1.0)
image_processed = image_processed * 255
Patrick von Platen's avatar
Patrick von Platen committed
222
223
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])
Patrick von Platen's avatar
Patrick von Platen committed
224
225

# save image
Patrick von Platen's avatar
Patrick von Platen committed
226
image_pil.save("test.png")
Patrick von Platen's avatar
Patrick von Platen committed
227
228
```

Suraj Patil's avatar
Suraj Patil committed
229
#### **Text to Image generation with Latent Diffusion**
230

patil-suraj's avatar
patil-suraj committed
231
232
_Note: To use latent diffusion install transformers from [this branch](https://github.com/patil-suraj/transformers/tree/ldm-bert)._

233
234
235
236
237
```python
from diffusers import DiffusionPipeline

ldm = DiffusionPipeline.from_pretrained("fusing/latent-diffusion-text2im-large")

patil-suraj's avatar
patil-suraj committed
238
generator = torch.manual_seed(42)
239
240
241
242
243
244
245
246
247
248
249
250
251

prompt = "A painting of a squirrel eating a burger"
image = ldm([prompt], generator=generator, eta=0.3, guidance_scale=6.0, num_inference_steps=50)

image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = image_processed  * 255.
image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0])

# save image
image_pil.save("test.png")
```

Suraj Patil's avatar
Suraj Patil committed
252
#### **Text to speech with GradTTS and BDDM**
Suraj Patil's avatar
Suraj Patil committed
253
254
255

```python
import torch
Suraj Patil's avatar
Suraj Patil committed
256
from diffusers import BDDM, GradTTS
Suraj Patil's avatar
Suraj Patil committed
257
258
259

torch_device = "cuda"

Suraj Patil's avatar
Suraj Patil committed
260
261
262
# load grad tts and bddm pipelines
grad_tts = GradTTS.from_pretrained("fusing/grad-tts-libri-tts")
bddm = BDDM.from_pretrained("fusing/diffwave-vocoder-ljspeech")
Suraj Patil's avatar
Suraj Patil committed
263
264
265

text = "Hello world, I missed you so much."

Suraj Patil's avatar
Suraj Patil committed
266
# generate mel spectograms using text
Suraj Patil's avatar
Suraj Patil committed
267
mel_spec = grad_tts(text, torch_device=torch_device)
Suraj Patil's avatar
Suraj Patil committed
268

Suraj Patil's avatar
Suraj Patil committed
269
270
#  generate the speech by passing mel spectograms to BDDM pipeline
generator = torch.manual_seed(42)
Suraj Patil's avatar
Suraj Patil committed
271
audio = bddm(mel_spec, generator, torch_device=torch_device)
Suraj Patil's avatar
Suraj Patil committed
272

Suraj Patil's avatar
Suraj Patil committed
273
# save generated audio
Suraj Patil's avatar
Suraj Patil committed
274
275
276
277
from scipy.io.wavfile import write as wavwrite
sampling_rate = 22050
wavwrite("generated_audio.wav", sampling_rate, audio.squeeze().cpu().numpy())
```
Patrick von Platen's avatar
Patrick von Platen committed
278
279
280
281
282
283
284
285
286

## TODO

- Create common API for models [ ]
- Add tests for models [ ]
- Adapt schedulers for training [ ]
- Write google colab for training [ ]
- Write docs / Think about how to structure docs [ ]
- Add tests to circle ci [ ]
Muhtasham Oblokulov's avatar
Muhtasham Oblokulov committed
287
- Add [Diffusion LM models](https://arxiv.org/pdf/2205.14217.pdf) [ ]
Patrick von Platen's avatar
Patrick von Platen committed
288
289
290
- Add more vision models [ ]
- Add more speech models [ ]
- Add RL model [ ]