Unverified Commit 988369a0 authored by Suraj Patil's avatar Suraj Patil Committed by GitHub
Browse files

Merge branch 'main' into grad-tts

parents 5a3467e6 bed32182
# Diffusers <p align="center">
<br>
<img src="docs/source/imgs/diffusers_library.jpg" width="400"/>
<br>
<p>
<p align="center">
<a href="https://github.com/huggingface/diffusers/blob/main/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
</a>
<a href="https://github.com/huggingface/diffusers/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
</a>
<a href="CODE_OF_CONDUCT.md">
<img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
</a>
</p>
🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves
as a modular toolbox for inference and training of diffusion models.
More precisely, 🤗 Diffusers offers:
- State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code (see [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines)).
- Various noise schedulers that can be used interchangeably for the prefered speed vs. quality trade-off in inference (see [src/diffusers/schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers)).
- Multiple types of models, such as UNet, that can be used as building blocks in an end-to-end diffusion system (see [src/diffusers/models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)).
- Training examples to show how to train the most popular diffusion models (see [examples](https://github.com/huggingface/diffusers/tree/main/examples)).
## Definitions ## Definitions
**Models**: Single neural network that models p_θ(x_t-1|x_t) and is trained to “denoise” to image **Models**: Neural network that models $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ (see image below) and is trained end-to-end to *denoise* a noisy input to an image.
*Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet* *Examples*: UNet, Conditioned UNet, 3D UNet, Transformer UNet
![model_diff_1_50](https://user-images.githubusercontent.com/23423619/171610307-dab0cd8b-75da-4d4e-9f5a-5922072e2bb5.png) ![model_diff_1_50](https://user-images.githubusercontent.com/23423619/171610307-dab0cd8b-75da-4d4e-9f5a-5922072e2bb5.png)
**Schedulers**: Algorithm to compute previous image according to alpha, beta schedule and to sample noise. Should be used for both *training* and *inference*. **Schedulers**: Algorithm class for both **inference** and **training**.
*Example: Gaussian DDPM, DDIM, PMLS, DEIN* The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training.
*Examples*: [DDPM](https://arxiv.org/abs/2006.11239), [DDIM](https://arxiv.org/abs/2010.02502), [PNDM](https://arxiv.org/abs/2202.09778), [DEIS](https://arxiv.org/abs/2204.13902)
![sampling](https://user-images.githubusercontent.com/23423619/171608981-3ad05953-a684-4c82-89f8-62a459147a07.png) ![sampling](https://user-images.githubusercontent.com/23423619/171608981-3ad05953-a684-4c82-89f8-62a459147a07.png)
![training](https://user-images.githubusercontent.com/23423619/171608964-b3260cce-e6b4-4841-959d-7d8ba4b8d1b2.png) ![training](https://user-images.githubusercontent.com/23423619/171608964-b3260cce-e6b4-4841-959d-7d8ba4b8d1b2.png)
**Diffusion Pipeline**: End-to-end pipeline that includes multiple diffusion models, possible text encoders, CLIP **Diffusion Pipeline**: End-to-end pipeline that includes multiple diffusion models, possible text encoders, ...
*Example: GLIDE,CompVis/Latent-Diffusion, Imagen, DALL-E* *Examples*: GLIDE, Latent-Diffusion, Imagen, DALL-E 2
![imagen](https://user-images.githubusercontent.com/23423619/171609001-c3f2c1c9-f597-4a16-9843-749bf3f9431c.png) ![imagen](https://user-images.githubusercontent.com/23423619/171609001-c3f2c1c9-f597-4a16-9843-749bf3f9431c.png)
## Philosophy
- Readability and clarity is prefered over highly optimized code. A strong importance is put on providing readable, intuitive and elementary code design. *E.g.*, the provided [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) are separated from the provided [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and provide well-commented code that can be read alongside the original paper.
- Diffusers is **modality independent** and focusses on providing pretrained models and tools to build systems that generate **continous outputs**, *e.g.* vision and audio.
- Diffusion models and schedulers are provided as consise, elementary building blocks whereas diffusion pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box, should stay as close as possible to their original implementation and can include components of other library, such as text-encoders. Examples for diffusion pipelines are [Glide](https://github.com/openai/glide-text2im) and [Latent Diffusion](https://github.com/CompVis/latent-diffusion).
## Quickstart ## Quickstart
### Installation
``` ```
git clone https://github.com/huggingface/diffusers.git pip install diffusers # should install diffusers 0.0.4
cd diffusers && pip install -e .
``` ```
### 1. `diffusers` as a central modular diffusion and sampler library ### 1. `diffusers` as a toolbox for schedulers and models
`diffusers` is more modularized than `transformers`. The idea is that researchers and engineers can use only parts of the library easily for the own use cases. `diffusers` is more modularized than `transformers`. The idea is that researchers and engineers can use only parts of the library easily for the own use cases.
It could become a central place for all kinds of models, schedulers, training utils and processors that one can mix and match for one's own use case. It could become a central place for all kinds of models, schedulers, training utils and processors that one can mix and match for one's own use case.
Both models and schedulers should be load- and saveable from the Hub. Both models and schedulers should be load- and saveable from the Hub.
For more examples see [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) and [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)
#### **Example for [DDPM](https://arxiv.org/abs/2006.11239):** #### **Example for [DDPM](https://arxiv.org/abs/2006.11239):**
```python ```python
...@@ -49,29 +84,29 @@ unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church").to(torch_device) ...@@ -49,29 +84,29 @@ unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church").to(torch_device)
# 2. Sample gaussian noise # 2. Sample gaussian noise
image = torch.randn( image = torch.randn(
(1, unet.in_channels, unet.resolution, unet.resolution), (1, unet.in_channels, unet.resolution, unet.resolution),
generator=generator, generator=generator,
) )
image = image.to(torch_device) image = image.to(torch_device)
# 3. Denoise # 3. Denoise
num_prediction_steps = len(noise_scheduler) num_prediction_steps = len(noise_scheduler)
for t in tqdm.tqdm(reversed(range(num_prediction_steps)), total=num_prediction_steps): for t in tqdm.tqdm(reversed(range(num_prediction_steps)), total=num_prediction_steps):
# predict noise residual # predict noise residual
with torch.no_grad(): with torch.no_grad():
residual = unet(image, t) residual = unet(image, t)
# predict previous mean of image x_t-1 # predict previous mean of image x_t-1
pred_prev_image = noise_scheduler.step(residual, image, t) pred_prev_image = noise_scheduler.step(residual, image, t)
# optionally sample variance # optionally sample variance
variance = 0 variance = 0
if t > 0: if t > 0:
noise = torch.randn(image.shape, generator=generator).to(image.device) noise = torch.randn(image.shape, generator=generator).to(image.device)
variance = noise_scheduler.get_variance(t).sqrt() * noise variance = noise_scheduler.get_variance(t).sqrt() * noise
# set current image to prev_image: x_t -> x_t-1 # set current image to prev_image: x_t -> x_t-1
image = pred_prev_image + variance image = pred_prev_image + variance
# 5. process image to PIL # 5. process image to PIL
image_processed = image.cpu().permute(0, 2, 3, 1) image_processed = image.cpu().permute(0, 2, 3, 1)
...@@ -101,8 +136,8 @@ unet = UNetModel.from_pretrained("fusing/ddpm-celeba-hq").to(torch_device) ...@@ -101,8 +136,8 @@ unet = UNetModel.from_pretrained("fusing/ddpm-celeba-hq").to(torch_device)
# 2. Sample gaussian noise # 2. Sample gaussian noise
image = torch.randn( image = torch.randn(
(1, unet.in_channels, unet.resolution, unet.resolution), (1, unet.in_channels, unet.resolution, unet.resolution),
generator=generator, generator=generator,
) )
image = image.to(torch_device) image = image.to(torch_device)
...@@ -111,22 +146,22 @@ num_inference_steps = 50 ...@@ -111,22 +146,22 @@ num_inference_steps = 50
eta = 0.0 # <- deterministic sampling eta = 0.0 # <- deterministic sampling
for t in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps): for t in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):
# 1. predict noise residual # 1. predict noise residual
orig_t = noise_scheduler.get_orig_t(t, num_inference_steps) orig_t = noise_scheduler.get_orig_t(t, num_inference_steps)
with torch.no_grad(): with torch.inference_mode():
residual = unet(image, orig_t) residual = unet(image, orig_t)
# 2. predict previous mean of image x_t-1 # 2. predict previous mean of image x_t-1
pred_prev_image = noise_scheduler.step(residual, image, t, num_inference_steps, eta) pred_prev_image = noise_scheduler.step(residual, image, t, num_inference_steps, eta)
# 3. optionally sample variance # 3. optionally sample variance
variance = 0 variance = 0
if eta > 0: if eta > 0:
noise = torch.randn(image.shape, generator=generator).to(image.device) noise = torch.randn(image.shape, generator=generator).to(image.device)
variance = noise_scheduler.get_variance(t).sqrt() * eta * noise variance = noise_scheduler.get_variance(t).sqrt() * eta * noise
# 4. set current image to prev_image: x_t -> x_t-1 # 4. set current image to prev_image: x_t -> x_t-1
image = pred_prev_image + variance image = pred_prev_image + variance
# 5. process image to PIL # 5. process image to PIL
image_processed = image.cpu().permute(0, 2, 3, 1) image_processed = image.cpu().permute(0, 2, 3, 1)
...@@ -138,25 +173,35 @@ image_pil = PIL.Image.fromarray(image_processed[0]) ...@@ -138,25 +173,35 @@ image_pil = PIL.Image.fromarray(image_processed[0])
image_pil.save("test.png") image_pil.save("test.png")
``` ```
### 2. `diffusers` as a collection of most important Diffusion systems (GLIDE, Dalle, ...) ### 2. `diffusers` as a collection of popular Diffusion systems (GLIDE, Dalle, ...)
`models` directory in repository hosts the complete code necessary for running a diffusion system as well as to train it. A `DiffusionPipeline` class allows to easily run the diffusion model in inference:
#### **Example image generation with DDPM** For more examples see [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines).
#### **Example image generation with PNDM**
```python ```python
from diffusers import DiffusionPipeline from diffusers import PNDM, UNetModel, PNDMScheduler
import PIL.Image import PIL.Image
import numpy as np import numpy as np
import torch
model_id = "fusing/ddim-celeba-hq"
model = UNetModel.from_pretrained(model_id)
scheduler = PNDMScheduler()
# load model and scheduler # load model and scheduler
ddpm = DiffusionPipeline.from_pretrained("fusing/ddpm-lsun-bedroom") pndm = PNDM(unet=model, noise_scheduler=scheduler)
# run pipeline in inference (sample random noise and denoise) # run pipeline in inference (sample random noise and denoise)
image = ddpm() with torch.no_grad():
image = pndm()
# process image to PIL # process image to PIL
image_processed = image.cpu().permute(0, 2, 3, 1) image_processed = image.cpu().permute(0, 2, 3, 1)
image_processed = (image_processed + 1.0) * 127.5 image_processed = (image_processed + 1.0) / 2
image_processed = torch.clamp(image_processed, 0.0, 1.0)
image_processed = image_processed * 255
image_processed = image_processed.numpy().astype(np.uint8) image_processed = image_processed.numpy().astype(np.uint8)
image_pil = PIL.Image.fromarray(image_processed[0]) image_pil = PIL.Image.fromarray(image_processed[0])
...@@ -187,9 +232,9 @@ image_pil = PIL.Image.fromarray(image_processed[0]) ...@@ -187,9 +232,9 @@ image_pil = PIL.Image.fromarray(image_processed[0])
image_pil.save("test.png") image_pil.save("test.png")
``` ```
#### **Text to speech with BDDM** #### **Text to speech with BDDM**
_Follow the isnstructions [here](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) to load tacotron2 model._ _Follow the instructions [here](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) to load tacotron2 model._
```python ```python
import torch import torch
...@@ -223,60 +268,15 @@ sampling_rate = 22050 ...@@ -223,60 +268,15 @@ sampling_rate = 22050
wavwrite("generated_audio.wav", sampling_rate, audio.squeeze().cpu().numpy()) wavwrite("generated_audio.wav", sampling_rate, audio.squeeze().cpu().numpy())
``` ```
## Library structure: ## TODO
``` - Create common API for models [ ]
├── LICENSE - Add tests for models [ ]
├── Makefile - Adapt schedulers for training [ ]
├── README.md - Write google colab for training [ ]
├── pyproject.toml - Write docs / Think about how to structure docs [ ]
├── setup.cfg - Add tests to circle ci [ ]
├── setup.py - Add [Diffusion LM models](https://arxiv.org/pdf/2205.14217.pdf) [ ]
├── src - Add more vision models [ ]
│ ├── diffusers - Add more speech models [ ]
│ ├── __init__.py - Add RL model [ ]
│ ├── configuration_utils.py
│ ├── dependency_versions_check.py
│ ├── dependency_versions_table.py
│ ├── dynamic_modules_utils.py
│ ├── modeling_utils.py
│ ├── models
│ │ ├── __init__.py
│ │ ├── unet.py
│ │ ├── unet_glide.py
│ │ └── unet_ldm.py
│ ├── pipeline_utils.py
│ ├── pipelines
│ │ ├── __init__.py
│ │ ├── configuration_ldmbert.py
│ │ ├── conversion_glide.py
│ │ ├── modeling_vae.py
│ │ ├── pipeline_bddm.py
│ │ ├── pipeline_ddim.py
│ │ ├── pipeline_ddpm.py
│ │ ├── pipeline_glide.py
│ │ └── pipeline_latent_diffusion.py
│ ├── schedulers
│ │ ├── __init__.py
│ │ ├── classifier_free_guidance.py
│ │ ├── scheduling_ddim.py
│ │ ├── scheduling_ddpm.py
│ │ ├── scheduling_plms.py
│ │ └── scheduling_utils.py
│ ├── testing_utils.py
│ └── utils
│ ├── __init__.py
│ └── logging.py
├── tests
│ ├── __init__.py
│ ├── test_modeling_utils.py
│ └── test_scheduler.py
└── utils
├── check_config_docstrings.py
├── check_copies.py
├── check_dummies.py
├── check_inits.py
├── check_repo.py
├── check_table.py
└── check_tf_ops.py
```
## Training examples
### Flowers DDPM
The command to train a DDPM UNet model on the Oxford Flowers dataset:
```bash
python -m torch.distributed.launch \
--nproc_per_node 4 \
train_ddpm.py \
--dataset="huggan/flowers-102-categories" \
--resolution=64 \
--output_path="flowers-ddpm" \
--batch_size=16 \
--num_epochs=100 \
--gradient_accumulation_steps=1 \
--lr=1e-4 \
--warmup_steps=500 \
--mixed_precision=no
```
A full ltraining run takes 2 hours on 4xV100 GPUs.
<img src="https://user-images.githubusercontent.com/26864830/173855866-5628989f-856b-4725-a944-d6c09490b2df.png" width="500" />
### Pokemon DDPM
The command to train a DDPM UNet model on the Pokemon dataset:
```bash
python -m torch.distributed.launch \
--nproc_per_node 4 \
train_ddpm.py \
--dataset="huggan/pokemon" \
--resolution=64 \
--output_path="pokemon-ddpm" \
--batch_size=16 \
--num_epochs=100 \
--gradient_accumulation_steps=1 \
--lr=1e-4 \
--warmup_steps=500 \
--mixed_precision=no
```
A full ltraining run takes 2 hours on 4xV100 GPUs.
<img src="https://user-images.githubusercontent.com/26864830/173856733-4f117f8c-97bd-4f51-8002-56b488c96df9.png" width="500" />
import argparse
import os import os
import torch import torch
import PIL.Image
import argparse
import torch.nn.functional as F import torch.nn.functional as F
import PIL.Image
from accelerate import Accelerator from accelerate import Accelerator
from datasets import load_dataset from datasets import load_dataset
from diffusers import DDPM, DDPMScheduler, UNetModel from diffusers import DDPM, DDPMScheduler, UNetModel
from torchvision.transforms import ( from torchvision.transforms import (
CenterCrop,
Compose, Compose,
InterpolationMode, InterpolationMode,
Lambda, Lambda,
RandomCrop,
RandomHorizontalFlip, RandomHorizontalFlip,
Resize, Resize,
ToTensor, ToTensor,
...@@ -31,44 +31,40 @@ def main(args): ...@@ -31,44 +31,40 @@ def main(args):
dropout=0.0, dropout=0.0,
num_res_blocks=2, num_res_blocks=2,
resamp_with_conv=True, resamp_with_conv=True,
resolution=64, resolution=args.resolution,
) )
noise_scheduler = DDPMScheduler(timesteps=1000) noise_scheduler = DDPMScheduler(timesteps=1000)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
num_epochs = 100
batch_size = 16
gradient_accumulation_steps = 1
augmentations = Compose( augmentations = Compose(
[ [
Resize(64, interpolation=InterpolationMode.BILINEAR), Resize(args.resolution, interpolation=InterpolationMode.BILINEAR),
RandomCrop(64), CenterCrop(args.resolution),
RandomHorizontalFlip(), RandomHorizontalFlip(),
ToTensor(), ToTensor(),
Lambda(lambda x: x * 2 - 1), Lambda(lambda x: x * 2 - 1),
] ]
) )
dataset = load_dataset("huggan/pokemon", split="train") dataset = load_dataset(args.dataset, split="train")
def transforms(examples): def transforms(examples):
images = [augmentations(image.convert("RGB")) for image in examples["image"]] images = [augmentations(image.convert("RGB")) for image in examples["image"]]
return {"input": images} return {"input": images}
dataset.set_transform(transforms) dataset.set_transform(transforms)
train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True) train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=True)
lr_scheduler = get_linear_schedule_with_warmup( lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer, optimizer=optimizer,
num_warmup_steps=500, num_warmup_steps=args.warmup_steps,
num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps, num_training_steps=(len(train_dataloader) * args.num_epochs) // args.gradient_accumulation_steps,
) )
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, lr_scheduler model, optimizer, train_dataloader, lr_scheduler
) )
for epoch in range(num_epochs): for epoch in range(args.num_epochs):
model.train() model.train()
with tqdm(total=len(train_dataloader), unit="ba") as pbar: with tqdm(total=len(train_dataloader), unit="ba") as pbar:
pbar.set_description(f"Epoch {epoch}") pbar.set_description(f"Epoch {epoch}")
...@@ -84,14 +80,15 @@ def main(args): ...@@ -84,14 +80,15 @@ def main(args):
noise_samples[idx] = noise noise_samples[idx] = noise
noisy_images[idx] = noise_scheduler.forward_step(clean_images[idx], noise, timesteps[idx]) noisy_images[idx] = noise_scheduler.forward_step(clean_images[idx], noise, timesteps[idx])
if step % gradient_accumulation_steps != 0: if step % args.gradient_accumulation_steps != 0:
with accelerator.no_sync(model): with accelerator.no_sync(model):
output = model(noisy_images, timesteps) output = model(noisy_images, timesteps)
# predict the noise # predict the noise residual
loss = F.mse_loss(output, noise_samples) loss = F.mse_loss(output, noise_samples)
accelerator.backward(loss) accelerator.backward(loss)
else: else:
output = model(noisy_images, timesteps) output = model(noisy_images, timesteps)
# predict the noise residual
loss = F.mse_loss(output, noise_samples) loss = F.mse_loss(output, noise_samples)
accelerator.backward(loss) accelerator.backward(loss)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
...@@ -103,13 +100,18 @@ def main(args): ...@@ -103,13 +100,18 @@ def main(args):
optimizer.step() optimizer.step()
# Generate a sample image for visual inspection
torch.distributed.barrier() torch.distributed.barrier()
if args.local_rank in [-1, 0]: if args.local_rank in [-1, 0]:
model.eval() model.eval()
with torch.no_grad(): with torch.no_grad():
pipeline = DDPM(unet=model.module, noise_scheduler=noise_scheduler) if isinstance(model, torch.nn.parallel.DistributedDataParallel):
generator = torch.Generator() pipeline = DDPM(unet=model.module, noise_scheduler=noise_scheduler)
generator = generator.manual_seed(0) else:
pipeline = DDPM(unet=model, noise_scheduler=noise_scheduler)
pipeline.save_pretrained(args.output_path)
generator = torch.manual_seed(0)
# run pipeline in inference (sample random noise and denoise) # run pipeline in inference (sample random noise and denoise)
image = pipeline(generator=generator) image = pipeline(generator=generator)
...@@ -120,22 +122,33 @@ def main(args): ...@@ -120,22 +122,33 @@ def main(args):
image_pil = PIL.Image.fromarray(image_processed[0]) image_pil = PIL.Image.fromarray(image_processed[0])
# save image # save image
pipeline.save_pretrained("./pokemon-ddpm") test_dir = os.path.join(args.output_path, "test_samples")
image_pil.save(f"./pokemon-ddpm/test_{epoch}.png") os.makedirs(test_dir, exist_ok=True)
image_pil.save(f"{test_dir}/{epoch}.png")
torch.distributed.barrier() torch.distributed.barrier()
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Simple example of training script.") parser = argparse.ArgumentParser(description="Simple example of a training script.")
parser.add_argument("--local_rank", type=int) parser.add_argument("--local_rank", type=int)
parser.add_argument("--dataset", type=str, default="huggan/flowers-102-categories")
parser.add_argument("--resolution", type=int, default=64)
parser.add_argument("--output_path", type=str, default="ddpm-model")
parser.add_argument("--batch_size", type=int, default=16)
parser.add_argument("--num_epochs", type=int, default=100)
parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
parser.add_argument("--lr", type=float, default=1e-4)
parser.add_argument("--warmup_steps", type=int, default=500)
parser.add_argument( parser.add_argument(
"--mixed_precision", "--mixed_precision",
type=str, type=str,
default="no", default="no",
choices=["no", "fp16", "bf16"], choices=["no", "fp16", "bf16"],
help="Whether to use mixed precision. Choose" help=(
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10." "Whether to use mixed precision. Choose"
"and an Nvidia Ampere GPU.", "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU."
),
) )
args = parser.parse_args() args = parser.parse_args()
......
...@@ -87,7 +87,6 @@ _deps = [ ...@@ -87,7 +87,6 @@ _deps = [
"regex!=2019.12.17", "regex!=2019.12.17",
"requests", "requests",
"torch>=1.4", "torch>=1.4",
"torchvision",
] ]
# this is a lookup table with items like: # this is a lookup table with items like:
...@@ -172,13 +171,12 @@ install_requires = [ ...@@ -172,13 +171,12 @@ install_requires = [
deps["regex"], deps["regex"],
deps["requests"], deps["requests"],
deps["torch"], deps["torch"],
deps["torchvision"],
deps["Pillow"], deps["Pillow"],
] ]
setup( setup(
name="diffusers", name="diffusers",
version="0.0.2", version="0.0.4",
description="Diffusers", description="Diffusers",
long_description=open("README.md", "r", encoding="utf-8").read(), long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown", long_description_content_type="text/markdown",
......
...@@ -2,14 +2,14 @@ ...@@ -2,14 +2,14 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this # There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all. # module, but to preserve other warnings. So, don't check this module at all.
__version__ = "0.0.3" __version__ = "0.0.4"
from .modeling_utils import ModelMixin from .modeling_utils import ModelMixin
from .models.unet import UNetModel from .models.unet import UNetModel
from .models.unet_glide import GLIDEUNetModel, GLIDESuperResUNetModel, GLIDETextToImageUNetModel from .models.unet_glide import GLIDESuperResUNetModel, GLIDETextToImageUNetModel, GLIDEUNetModel
from .models.unet_ldm import UNetLDMModel
from .models.unet_grad_tts import UNetGradTTSModel from .models.unet_grad_tts import UNetGradTTSModel
from .models.unet_ldm import UNetLDMModel
from .pipeline_utils import DiffusionPipeline from .pipeline_utils import DiffusionPipeline
from .pipelines import DDIM, DDPM, GLIDE, LatentDiffusion, PNDM, BDDM, GradTTS from .pipelines import BDDM, DDIM, DDPM, GLIDE, PNDM, GradTTS, LatentDiffusion
from .schedulers import DDIMScheduler, DDPMScheduler, SchedulerMixin, PNDMScheduler, GradTTSScheduler from .schedulers import DDIMScheduler, DDPMScheduler, GradTTSScheduler, PNDMScheduler, SchedulerMixin
from .schedulers.classifier_free_guidance import ClassifierFreeGuidanceScheduler from .schedulers.classifier_free_guidance import ClassifierFreeGuidanceScheduler
...@@ -226,7 +226,7 @@ class ConfigMixin: ...@@ -226,7 +226,7 @@ class ConfigMixin:
return json.loads(text) return json.loads(text)
def __repr__(self): def __repr__(self):
return f"{self.__class__.__name__} {self.to_json_string()}" return f"{self.__class__.__name__} {self.to_json_string()}"
@property @property
def config(self) -> Dict[str, Any]: def config(self) -> Dict[str, Any]:
......
...@@ -13,5 +13,4 @@ deps = { ...@@ -13,5 +13,4 @@ deps = {
"regex": "regex!=2019.12.17", "regex": "regex!=2019.12.17",
"requests": "requests", "requests": "requests",
"torch": "torch>=1.4", "torch": "torch>=1.4",
"torchvision": "torchvision",
} }
# Models
- Models: Neural network that models $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ (see image below) and is trained end-to-end to denoise a noisy input to an image. Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet
## API
TODO(Suraj, Patrick)
## Examples
TODO(Suraj, Patrick)
...@@ -17,6 +17,6 @@ ...@@ -17,6 +17,6 @@
# limitations under the License. # limitations under the License.
from .unet import UNetModel from .unet import UNetModel
from .unet_glide import GLIDEUNetModel, GLIDESuperResUNetModel, GLIDETextToImageUNetModel from .unet_glide import GLIDESuperResUNetModel, GLIDETextToImageUNetModel, GLIDEUNetModel
from .unet_grad_tts import UNetGradTTSModel
from .unet_ldm import UNetLDMModel from .unet_ldm import UNetLDMModel
from .unet_grad_tts import UNetGradTTSModel
\ No newline at end of file
...@@ -26,7 +26,6 @@ from torch.optim import Adam ...@@ -26,7 +26,6 @@ from torch.optim import Adam
from torch.utils import data from torch.utils import data
from PIL import Image from PIL import Image
from torchvision import transforms, utils
from tqdm import tqdm from tqdm import tqdm
from ..configuration_utils import ConfigMixin from ..configuration_utils import ConfigMixin
...@@ -331,171 +330,3 @@ class UNetModel(ModelMixin, ConfigMixin): ...@@ -331,171 +330,3 @@ class UNetModel(ModelMixin, ConfigMixin):
h = nonlinearity(h) h = nonlinearity(h)
h = self.conv_out(h) h = self.conv_out(h)
return h return h
# dataset classes
class Dataset(data.Dataset):
def __init__(self, folder, image_size, exts=["jpg", "jpeg", "png"]):
super().__init__()
self.folder = folder
self.image_size = image_size
self.paths = [p for ext in exts for p in Path(f"{folder}").glob(f"**/*.{ext}")]
self.transform = transforms.Compose(
[
transforms.Resize(image_size),
transforms.RandomHorizontalFlip(),
transforms.CenterCrop(image_size),
transforms.ToTensor(),
]
)
def __len__(self):
return len(self.paths)
def __getitem__(self, index):
path = self.paths[index]
img = Image.open(path)
return self.transform(img)
# trainer class
class EMA:
def __init__(self, beta):
super().__init__()
self.beta = beta
def update_model_average(self, ma_model, current_model):
for current_params, ma_params in zip(current_model.parameters(), ma_model.parameters()):
old_weight, up_weight = ma_params.data, current_params.data
ma_params.data = self.update_average(old_weight, up_weight)
def update_average(self, old, new):
if old is None:
return new
return old * self.beta + (1 - self.beta) * new
def cycle(dl):
while True:
for data_dl in dl:
yield data_dl
def num_to_groups(num, divisor):
groups = num // divisor
remainder = num % divisor
arr = [divisor] * groups
if remainder > 0:
arr.append(remainder)
return arr
class Trainer(object):
def __init__(
self,
diffusion_model,
folder,
*,
ema_decay=0.995,
image_size=128,
train_batch_size=32,
train_lr=1e-4,
train_num_steps=100000,
gradient_accumulate_every=2,
amp=False,
step_start_ema=2000,
update_ema_every=10,
save_and_sample_every=1000,
results_folder="./results",
):
super().__init__()
self.model = diffusion_model
self.ema = EMA(ema_decay)
self.ema_model = copy.deepcopy(self.model)
self.update_ema_every = update_ema_every
self.step_start_ema = step_start_ema
self.save_and_sample_every = save_and_sample_every
self.batch_size = train_batch_size
self.image_size = diffusion_model.image_size
self.gradient_accumulate_every = gradient_accumulate_every
self.train_num_steps = train_num_steps
self.ds = Dataset(folder, image_size)
self.dl = cycle(data.DataLoader(self.ds, batch_size=train_batch_size, shuffle=True, pin_memory=True))
self.opt = Adam(diffusion_model.parameters(), lr=train_lr)
self.step = 0
self.amp = amp
self.scaler = GradScaler(enabled=amp)
self.results_folder = Path(results_folder)
self.results_folder.mkdir(exist_ok=True)
self.reset_parameters()
def reset_parameters(self):
self.ema_model.load_state_dict(self.model.state_dict())
def step_ema(self):
if self.step < self.step_start_ema:
self.reset_parameters()
return
self.ema.update_model_average(self.ema_model, self.model)
def save(self, milestone):
data = {
"step": self.step,
"model": self.model.state_dict(),
"ema": self.ema_model.state_dict(),
"scaler": self.scaler.state_dict(),
}
torch.save(data, str(self.results_folder / f"model-{milestone}.pt"))
def load(self, milestone):
data = torch.load(str(self.results_folder / f"model-{milestone}.pt"))
self.step = data["step"]
self.model.load_state_dict(data["model"])
self.ema_model.load_state_dict(data["ema"])
self.scaler.load_state_dict(data["scaler"])
def train(self):
with tqdm(initial=self.step, total=self.train_num_steps) as pbar:
while self.step < self.train_num_steps:
for i in range(self.gradient_accumulate_every):
data = next(self.dl).cuda()
with autocast(enabled=self.amp):
loss = self.model(data)
self.scaler.scale(loss / self.gradient_accumulate_every).backward()
pbar.set_description(f"loss: {loss.item():.4f}")
self.scaler.step(self.opt)
self.scaler.update()
self.opt.zero_grad()
if self.step % self.update_ema_every == 0:
self.step_ema()
if self.step != 0 and self.step % self.save_and_sample_every == 0:
self.ema_model.eval()
milestone = self.step // self.save_and_sample_every
batches = num_to_groups(36, self.batch_size)
all_images_list = list(map(lambda n: self.ema_model.sample(batch_size=n), batches))
all_images = torch.cat(all_images_list, dim=0)
utils.save_image(all_images, str(self.results_folder / f"sample-{milestone}.png"), nrow=6)
self.save(milestone)
self.step += 1
pbar.update(1)
print("training complete")
...@@ -2,6 +2,7 @@ import math ...@@ -2,6 +2,7 @@ import math
import torch import torch
try: try:
from einops import rearrange from einops import rearrange
except: except:
...@@ -11,6 +12,7 @@ except: ...@@ -11,6 +12,7 @@ except:
from ..configuration_utils import ConfigMixin from ..configuration_utils import ConfigMixin
from ..modeling_utils import ModelMixin from ..modeling_utils import ModelMixin
class Mish(torch.nn.Module): class Mish(torch.nn.Module):
def forward(self, x): def forward(self, x):
return x * torch.tanh(torch.nn.functional.softplus(x)) return x * torch.tanh(torch.nn.functional.softplus(x))
...@@ -47,9 +49,9 @@ class Rezero(torch.nn.Module): ...@@ -47,9 +49,9 @@ class Rezero(torch.nn.Module):
class Block(torch.nn.Module): class Block(torch.nn.Module):
def __init__(self, dim, dim_out, groups=8): def __init__(self, dim, dim_out, groups=8):
super(Block, self).__init__() super(Block, self).__init__()
self.block = torch.nn.Sequential(torch.nn.Conv2d(dim, dim_out, 3, self.block = torch.nn.Sequential(
padding=1), torch.nn.GroupNorm( torch.nn.Conv2d(dim, dim_out, 3, padding=1), torch.nn.GroupNorm(groups, dim_out), Mish()
groups, dim_out), Mish()) )
def forward(self, x, mask): def forward(self, x, mask):
output = self.block(x * mask) output = self.block(x * mask)
...@@ -59,8 +61,7 @@ class Block(torch.nn.Module): ...@@ -59,8 +61,7 @@ class Block(torch.nn.Module):
class ResnetBlock(torch.nn.Module): class ResnetBlock(torch.nn.Module):
def __init__(self, dim, dim_out, time_emb_dim, groups=8): def __init__(self, dim, dim_out, time_emb_dim, groups=8):
super(ResnetBlock, self).__init__() super(ResnetBlock, self).__init__()
self.mlp = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim, self.mlp = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim, dim_out))
dim_out))
self.block1 = Block(dim, dim_out, groups=groups) self.block1 = Block(dim, dim_out, groups=groups)
self.block2 = Block(dim_out, dim_out, groups=groups) self.block2 = Block(dim_out, dim_out, groups=groups)
...@@ -83,18 +84,16 @@ class LinearAttention(torch.nn.Module): ...@@ -83,18 +84,16 @@ class LinearAttention(torch.nn.Module):
self.heads = heads self.heads = heads
hidden_dim = dim_head * heads hidden_dim = dim_head * heads
self.to_qkv = torch.nn.Conv2d(dim, hidden_dim * 3, 1, bias=False) self.to_qkv = torch.nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
self.to_out = torch.nn.Conv2d(hidden_dim, dim, 1) self.to_out = torch.nn.Conv2d(hidden_dim, dim, 1)
def forward(self, x): def forward(self, x):
b, c, h, w = x.shape b, c, h, w = x.shape
qkv = self.to_qkv(x) qkv = self.to_qkv(x)
q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', q, k, v = rearrange(qkv, "b (qkv heads c) h w -> qkv b heads c (h w)", heads=self.heads, qkv=3)
heads = self.heads, qkv=3)
k = k.softmax(dim=-1) k = k.softmax(dim=-1)
context = torch.einsum('bhdn,bhen->bhde', k, v) context = torch.einsum("bhdn,bhen->bhde", k, v)
out = torch.einsum('bhde,bhdn->bhen', context, q) out = torch.einsum("bhde,bhdn->bhen", context, q)
out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', out = rearrange(out, "b heads c (h w) -> b (heads c) h w", heads=self.heads, h=h, w=w)
heads=self.heads, h=h, w=w)
return self.to_out(out) return self.to_out(out)
...@@ -124,16 +123,7 @@ class SinusoidalPosEmb(torch.nn.Module): ...@@ -124,16 +123,7 @@ class SinusoidalPosEmb(torch.nn.Module):
class UNetGradTTSModel(ModelMixin, ConfigMixin): class UNetGradTTSModel(ModelMixin, ConfigMixin):
def __init__( def __init__(self, dim, dim_mults=(1, 2, 4), groups=8, n_spks=None, spk_emb_dim=64, n_feats=80, pe_scale=1000):
self,
dim,
dim_mults=(1, 2, 4),
groups=8,
n_spks=None,
spk_emb_dim=64,
n_feats=80,
pe_scale=1000
):
super(UNetGradTTSModel, self).__init__() super(UNetGradTTSModel, self).__init__()
self.register( self.register(
...@@ -143,23 +133,23 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin): ...@@ -143,23 +133,23 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
n_spks=n_spks, n_spks=n_spks,
spk_emb_dim=spk_emb_dim, spk_emb_dim=spk_emb_dim,
n_feats=n_feats, n_feats=n_feats,
pe_scale=pe_scale pe_scale=pe_scale,
) )
self.dim = dim self.dim = dim
self.dim_mults = dim_mults self.dim_mults = dim_mults
self.groups = groups self.groups = groups
self.n_spks = n_spks if not isinstance(n_spks, type(None)) else 1 self.n_spks = n_spks if not isinstance(n_spks, type(None)) else 1
self.spk_emb_dim = spk_emb_dim self.spk_emb_dim = spk_emb_dim
self.pe_scale = pe_scale self.pe_scale = pe_scale
if n_spks > 1: if n_spks > 1:
self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim) self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)
self.spk_mlp = torch.nn.Sequential(torch.nn.Linear(spk_emb_dim, spk_emb_dim * 4), Mish(), self.spk_mlp = torch.nn.Sequential(torch.nn.Linear(spk_emb_dim, spk_emb_dim * 4), Mish(),
torch.nn.Linear(spk_emb_dim * 4, n_feats)) torch.nn.Linear(spk_emb_dim * 4, n_feats))
self.time_pos_emb = SinusoidalPosEmb(dim) self.time_pos_emb = SinusoidalPosEmb(dim)
self.mlp = torch.nn.Sequential(torch.nn.Linear(dim, dim * 4), Mish(), self.mlp = torch.nn.Sequential(torch.nn.Linear(dim, dim * 4), Mish(), torch.nn.Linear(dim * 4, dim))
torch.nn.Linear(dim * 4, dim))
dims = [2 + (1 if n_spks > 1 else 0), *map(lambda m: dim * m, dim_mults)] dims = [2 + (1 if n_spks > 1 else 0), *map(lambda m: dim * m, dim_mults)]
in_out = list(zip(dims[:-1], dims[1:])) in_out = list(zip(dims[:-1], dims[1:]))
...@@ -169,11 +159,16 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin): ...@@ -169,11 +159,16 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
for ind, (dim_in, dim_out) in enumerate(in_out): for ind, (dim_in, dim_out) in enumerate(in_out):
is_last = ind >= (num_resolutions - 1) is_last = ind >= (num_resolutions - 1)
self.downs.append(torch.nn.ModuleList([ self.downs.append(
ResnetBlock(dim_in, dim_out, time_emb_dim=dim), torch.nn.ModuleList(
ResnetBlock(dim_out, dim_out, time_emb_dim=dim), [
Residual(Rezero(LinearAttention(dim_out))), ResnetBlock(dim_in, dim_out, time_emb_dim=dim),
Downsample(dim_out) if not is_last else torch.nn.Identity()])) ResnetBlock(dim_out, dim_out, time_emb_dim=dim),
Residual(Rezero(LinearAttention(dim_out))),
Downsample(dim_out) if not is_last else torch.nn.Identity(),
]
)
)
mid_dim = dims[-1] mid_dim = dims[-1]
self.mid_block1 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim) self.mid_block1 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim)
...@@ -181,11 +176,16 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin): ...@@ -181,11 +176,16 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
self.mid_block2 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim) self.mid_block2 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim)
for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])): for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])):
self.ups.append(torch.nn.ModuleList([ self.ups.append(
ResnetBlock(dim_out * 2, dim_in, time_emb_dim=dim), torch.nn.ModuleList(
ResnetBlock(dim_in, dim_in, time_emb_dim=dim), [
Residual(Rezero(LinearAttention(dim_in))), ResnetBlock(dim_out * 2, dim_in, time_emb_dim=dim),
Upsample(dim_in)])) ResnetBlock(dim_in, dim_in, time_emb_dim=dim),
Residual(Rezero(LinearAttention(dim_in))),
Upsample(dim_in),
]
)
)
self.final_block = Block(dim, dim) self.final_block = Block(dim, dim)
self.final_conv = torch.nn.Conv2d(dim, 1, 1) self.final_conv = torch.nn.Conv2d(dim, 1, 1)
...@@ -196,7 +196,7 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin): ...@@ -196,7 +196,7 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
if not isinstance(spk, type(None)): if not isinstance(spk, type(None)):
s = self.spk_mlp(spk) s = self.spk_mlp(spk)
t = self.time_pos_emb(t, scale=self.pe_scale) t = self.time_pos_emb(t, scale=self.pe_scale)
t = self.mlp(t) t = self.mlp(t)
...@@ -235,4 +235,4 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin): ...@@ -235,4 +235,4 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
x = self.final_block(x, mask) x = self.final_block(x, mask)
output = self.final_conv(x * mask) output = self.final_conv(x * mask)
return (output * mask).squeeze(1) return (output * mask).squeeze(1)
\ No newline at end of file
...@@ -57,14 +57,14 @@ class DiffusionPipeline(ConfigMixin): ...@@ -57,14 +57,14 @@ class DiffusionPipeline(ConfigMixin):
def register_modules(self, **kwargs): def register_modules(self, **kwargs):
# import it here to avoid circular import # import it here to avoid circular import
from diffusers import pipelines from diffusers import pipelines
for name, module in kwargs.items(): for name, module in kwargs.items():
# check if the module is a pipeline module # check if the module is a pipeline module
is_pipeline_module = hasattr(pipelines, module.__module__.split(".")[-1]) is_pipeline_module = hasattr(pipelines, module.__module__.split(".")[-1])
# retrive library # retrive library
library = module.__module__.split(".")[0] library = module.__module__.split(".")[0]
# if library is not in LOADABLE_CLASSES, then it is a custom module. # if library is not in LOADABLE_CLASSES, then it is a custom module.
# Or if it's a pipeline module, then the module is inside the pipeline # Or if it's a pipeline module, then the module is inside the pipeline
# so we set the library to module name. # so we set the library to module name.
...@@ -160,10 +160,10 @@ class DiffusionPipeline(ConfigMixin): ...@@ -160,10 +160,10 @@ class DiffusionPipeline(ConfigMixin):
init_dict, _ = pipeline_class.extract_init_dict(config_dict, **kwargs) init_dict, _ = pipeline_class.extract_init_dict(config_dict, **kwargs)
init_kwargs = {} init_kwargs = {}
# import it here to avoid circular import # import it here to avoid circular import
from diffusers import pipelines from diffusers import pipelines
# 4. Load each module in the pipeline # 4. Load each module in the pipeline
for name, (library_name, class_name) in init_dict.items(): for name, (library_name, class_name) in init_dict.items():
is_pipeline_module = hasattr(pipelines, library_name) is_pipeline_module = hasattr(pipelines, library_name)
......
# Pipelines
- Pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box
- Pipelines should stay as close as possible to their original implementation
- Pipelines can include components of other library, such as text-encoders.
## API
TODO(Patrick, Anton, Suraj)
## Examples
- DDPM for unconditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddpm.py).
- DDIM for unconditional image generation in [pipeline_ddim](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddim.py).
- PNDM for unconditional image generation in [pipeline_pndm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_pndm.py).
- Latent diffusion for text to image generation / conditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
- Glide for text to image generation / conditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
- BDDM for spectrogram-to-sound vocoding in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
- Grad-TTS for text to audio generation / conditional audio generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
# Pipelines
- Pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box
- Pipelines should stay as close as possible to their original implementation
- Pipelines can include components of other library, such as text-encoders.
## API
TODO(Patrick, Anton, Suraj)
## Examples
- DDPM for unconditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddpm.py).
- DDIM for unconditional image generation in [pipeline_ddim](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddim.py).
- PNDM for unconditional image generation in [pipeline_pndm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_pndm.py).
- Latent diffusion for text to image generation / conditional image generation in [pipeline_latent_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_latent_diffusion.py).
- Glide for text to image generation / conditional image generation in [pipeline_glide](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_glide.py).
- BDDM for spectrogram-to-sound vocoding in [pipeline_bddm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
- Grad-TTS for text to audio generation / conditional audio generation in [pipeline_grad_tts](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_grad_tts.py).
from .pipeline_bddm import BDDM
from .pipeline_ddim import DDIM from .pipeline_ddim import DDIM
from .pipeline_ddpm import DDPM from .pipeline_ddpm import DDPM
from .pipeline_pndm import PNDM from .pipeline_grad_tts import GradTTS
from .pipeline_glide import GLIDE
try:
from .pipeline_glide import GLIDE
except (NameError, ImportError):
class GLIDE:
pass
from .pipeline_latent_diffusion import LatentDiffusion from .pipeline_latent_diffusion import LatentDiffusion
from .pipeline_bddm import BDDM from .pipeline_pndm import PNDM
from .pipeline_grad_tts import GradTTS
\ No newline at end of file
...@@ -283,7 +283,7 @@ class BDDM(DiffusionPipeline): ...@@ -283,7 +283,7 @@ class BDDM(DiffusionPipeline):
torch_device = "cuda" if torch.cuda.is_available() else "cpu" torch_device = "cuda" if torch.cuda.is_available() else "cpu"
self.diffwave.to(torch_device) self.diffwave.to(torch_device)
mel_spectrogram = mel_spectrogram.to(torch_device) mel_spectrogram = mel_spectrogram.to(torch_device)
audio_length = mel_spectrogram.size(-1) * 256 audio_length = mel_spectrogram.size(-1) * 256
audio_size = (1, 1, audio_length) audio_size = (1, 1, audio_length)
......
...@@ -24,15 +24,22 @@ import torch.utils.checkpoint ...@@ -24,15 +24,22 @@ import torch.utils.checkpoint
from torch import nn from torch import nn
import tqdm import tqdm
from transformers import CLIPConfig, CLIPModel, CLIPTextConfig, CLIPVisionConfig, GPT2Tokenizer
from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling try:
from transformers.modeling_utils import PreTrainedModel from transformers import CLIPConfig, CLIPModel, CLIPTextConfig, CLIPVisionConfig, GPT2Tokenizer
from transformers.utils import ModelOutput, add_start_docstrings_to_model_forward, logging, replace_return_docstrings from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import ModelOutput, add_start_docstrings_to_model_forward, replace_return_docstrings
except:
print("Transformers is not installed")
pass
from ..models import GLIDESuperResUNetModel, GLIDETextToImageUNetModel from ..models import GLIDESuperResUNetModel, GLIDETextToImageUNetModel
from ..pipeline_utils import DiffusionPipeline from ..pipeline_utils import DiffusionPipeline
from ..schedulers import ClassifierFreeGuidanceScheduler, DDIMScheduler from ..schedulers import ClassifierFreeGuidanceScheduler, DDIMScheduler
from ..utils import logging
##################### #####################
...@@ -832,9 +839,7 @@ class GLIDE(DiffusionPipeline): ...@@ -832,9 +839,7 @@ class GLIDE(DiffusionPipeline):
# 1. Sample gaussian noise # 1. Sample gaussian noise
batch_size = 2 # second image is empty for classifier-free guidance batch_size = 2 # second image is empty for classifier-free guidance
image = torch.randn( image = torch.randn((batch_size, self.text_unet.in_channels, 64, 64), generator=generator).to(torch_device)
(batch_size, self.text_unet.in_channels, 64, 64), generator=generator
).to(torch_device)
# 2. Encode tokens # 2. Encode tokens
# an empty input is needed to guide the model away from it # an empty input is needed to guide the model away from it
......
...@@ -43,14 +43,13 @@ def generate_path(duration, mask): ...@@ -43,14 +43,13 @@ def generate_path(duration, mask):
cum_duration_flat = cum_duration.view(b * t_x) cum_duration_flat = cum_duration.view(b * t_x)
path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype) path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
path = path.view(b, t_x, t_y) path = path.view(b, t_x, t_y)
path = path - torch.nn.functional.pad(path, convert_pad_shape([[0, 0], path = path - torch.nn.functional.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
[1, 0], [0, 0]]))[:, :-1]
path = path * mask path = path * mask
return path return path
def duration_loss(logw, logw_, lengths): def duration_loss(logw, logw_, lengths):
loss = torch.sum((logw - logw_)**2) / torch.sum(lengths) loss = torch.sum((logw - logw_) ** 2) / torch.sum(lengths)
return loss return loss
...@@ -66,7 +65,7 @@ class LayerNorm(nn.Module): ...@@ -66,7 +65,7 @@ class LayerNorm(nn.Module):
def forward(self, x): def forward(self, x):
n_dims = len(x.shape) n_dims = len(x.shape)
mean = torch.mean(x, 1, keepdim=True) mean = torch.mean(x, 1, keepdim=True)
variance = torch.mean((x - mean)**2, 1, keepdim=True) variance = torch.mean((x - mean) ** 2, 1, keepdim=True)
x = (x - mean) * torch.rsqrt(variance + self.eps) x = (x - mean) * torch.rsqrt(variance + self.eps)
...@@ -76,8 +75,7 @@ class LayerNorm(nn.Module): ...@@ -76,8 +75,7 @@ class LayerNorm(nn.Module):
class ConvReluNorm(nn.Module): class ConvReluNorm(nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout):
n_layers, p_dropout):
super(ConvReluNorm, self).__init__() super(ConvReluNorm, self).__init__()
self.in_channels = in_channels self.in_channels = in_channels
self.hidden_channels = hidden_channels self.hidden_channels = hidden_channels
...@@ -88,13 +86,13 @@ class ConvReluNorm(nn.Module): ...@@ -88,13 +86,13 @@ class ConvReluNorm(nn.Module):
self.conv_layers = torch.nn.ModuleList() self.conv_layers = torch.nn.ModuleList()
self.norm_layers = torch.nn.ModuleList() self.norm_layers = torch.nn.ModuleList()
self.conv_layers.append(torch.nn.Conv1d(in_channels, hidden_channels, self.conv_layers.append(torch.nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size // 2))
kernel_size, padding=kernel_size//2))
self.norm_layers.append(LayerNorm(hidden_channels)) self.norm_layers.append(LayerNorm(hidden_channels))
self.relu_drop = torch.nn.Sequential(torch.nn.ReLU(), torch.nn.Dropout(p_dropout)) self.relu_drop = torch.nn.Sequential(torch.nn.ReLU(), torch.nn.Dropout(p_dropout))
for _ in range(n_layers - 1): for _ in range(n_layers - 1):
self.conv_layers.append(torch.nn.Conv1d(hidden_channels, hidden_channels, self.conv_layers.append(
kernel_size, padding=kernel_size//2)) torch.nn.Conv1d(hidden_channels, hidden_channels, kernel_size, padding=kernel_size // 2)
)
self.norm_layers.append(LayerNorm(hidden_channels)) self.norm_layers.append(LayerNorm(hidden_channels))
self.proj = torch.nn.Conv1d(hidden_channels, out_channels, 1) self.proj = torch.nn.Conv1d(hidden_channels, out_channels, 1)
self.proj.weight.data.zero_() self.proj.weight.data.zero_()
...@@ -118,11 +116,9 @@ class DurationPredictor(nn.Module): ...@@ -118,11 +116,9 @@ class DurationPredictor(nn.Module):
self.p_dropout = p_dropout self.p_dropout = p_dropout
self.drop = torch.nn.Dropout(p_dropout) self.drop = torch.nn.Dropout(p_dropout)
self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size, padding=kernel_size // 2)
kernel_size, padding=kernel_size//2)
self.norm_1 = LayerNorm(filter_channels) self.norm_1 = LayerNorm(filter_channels)
self.conv_2 = torch.nn.Conv1d(filter_channels, filter_channels, self.conv_2 = torch.nn.Conv1d(filter_channels, filter_channels, kernel_size, padding=kernel_size // 2)
kernel_size, padding=kernel_size//2)
self.norm_2 = LayerNorm(filter_channels) self.norm_2 = LayerNorm(filter_channels)
self.proj = torch.nn.Conv1d(filter_channels, 1, 1) self.proj = torch.nn.Conv1d(filter_channels, 1, 1)
...@@ -140,9 +136,17 @@ class DurationPredictor(nn.Module): ...@@ -140,9 +136,17 @@ class DurationPredictor(nn.Module):
class MultiHeadAttention(nn.Module): class MultiHeadAttention(nn.Module):
def __init__(self, channels, out_channels, n_heads, window_size=None, def __init__(
heads_share=True, p_dropout=0.0, proximal_bias=False, self,
proximal_init=False): channels,
out_channels,
n_heads,
window_size=None,
heads_share=True,
p_dropout=0.0,
proximal_bias=False,
proximal_init=False,
):
super(MultiHeadAttention, self).__init__() super(MultiHeadAttention, self).__init__()
assert channels % n_heads == 0 assert channels % n_heads == 0
...@@ -162,10 +166,12 @@ class MultiHeadAttention(nn.Module): ...@@ -162,10 +166,12 @@ class MultiHeadAttention(nn.Module):
if window_size is not None: if window_size is not None:
n_heads_rel = 1 if heads_share else n_heads n_heads_rel = 1 if heads_share else n_heads
rel_stddev = self.k_channels**-0.5 rel_stddev = self.k_channels**-0.5
self.emb_rel_k = torch.nn.Parameter(torch.randn(n_heads_rel, self.emb_rel_k = torch.nn.Parameter(
window_size * 2 + 1, self.k_channels) * rel_stddev) torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev
self.emb_rel_v = torch.nn.Parameter(torch.randn(n_heads_rel, )
window_size * 2 + 1, self.k_channels) * rel_stddev) self.emb_rel_v = torch.nn.Parameter(
torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev
)
self.conv_o = torch.nn.Conv1d(channels, out_channels, 1) self.conv_o = torch.nn.Conv1d(channels, out_channels, 1)
self.drop = torch.nn.Dropout(p_dropout) self.drop = torch.nn.Dropout(p_dropout)
...@@ -175,12 +181,12 @@ class MultiHeadAttention(nn.Module): ...@@ -175,12 +181,12 @@ class MultiHeadAttention(nn.Module):
self.conv_k.weight.data.copy_(self.conv_q.weight.data) self.conv_k.weight.data.copy_(self.conv_q.weight.data)
self.conv_k.bias.data.copy_(self.conv_q.bias.data) self.conv_k.bias.data.copy_(self.conv_q.bias.data)
torch.nn.init.xavier_uniform_(self.conv_v.weight) torch.nn.init.xavier_uniform_(self.conv_v.weight)
def forward(self, x, c, attn_mask=None): def forward(self, x, c, attn_mask=None):
q = self.conv_q(x) q = self.conv_q(x)
k = self.conv_k(c) k = self.conv_k(c)
v = self.conv_v(c) v = self.conv_v(c)
x, self.attn = self.attention(q, k, v, mask=attn_mask) x, self.attn = self.attention(q, k, v, mask=attn_mask)
x = self.conv_o(x) x = self.conv_o(x)
...@@ -202,8 +208,7 @@ class MultiHeadAttention(nn.Module): ...@@ -202,8 +208,7 @@ class MultiHeadAttention(nn.Module):
scores = scores + scores_local scores = scores + scores_local
if self.proximal_bias: if self.proximal_bias:
assert t_s == t_t, "Proximal bias is only available for self-attention." assert t_s == t_t, "Proximal bias is only available for self-attention."
scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype)
dtype=scores.dtype)
if mask is not None: if mask is not None:
scores = scores.masked_fill(mask == 0, -1e4) scores = scores.masked_fill(mask == 0, -1e4)
p_attn = torch.nn.functional.softmax(scores, dim=-1) p_attn = torch.nn.functional.softmax(scores, dim=-1)
...@@ -212,8 +217,7 @@ class MultiHeadAttention(nn.Module): ...@@ -212,8 +217,7 @@ class MultiHeadAttention(nn.Module):
if self.window_size is not None: if self.window_size is not None:
relative_weights = self._absolute_position_to_relative_position(p_attn) relative_weights = self._absolute_position_to_relative_position(p_attn)
value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s) value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s)
output = output + self._matmul_with_relative_values(relative_weights, output = output + self._matmul_with_relative_values(relative_weights, value_relative_embeddings)
value_relative_embeddings)
output = output.transpose(2, 3).contiguous().view(b, d, t_t) output = output.transpose(2, 3).contiguous().view(b, d, t_t)
return output, p_attn return output, p_attn
...@@ -231,28 +235,27 @@ class MultiHeadAttention(nn.Module): ...@@ -231,28 +235,27 @@ class MultiHeadAttention(nn.Module):
slice_end_position = slice_start_position + 2 * length - 1 slice_end_position = slice_start_position + 2 * length - 1
if pad_length > 0: if pad_length > 0:
padded_relative_embeddings = torch.nn.functional.pad( padded_relative_embeddings = torch.nn.functional.pad(
relative_embeddings, convert_pad_shape([[0, 0], relative_embeddings, convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]])
[pad_length, pad_length], [0, 0]])) )
else: else:
padded_relative_embeddings = relative_embeddings padded_relative_embeddings = relative_embeddings
used_relative_embeddings = padded_relative_embeddings[:, used_relative_embeddings = padded_relative_embeddings[:, slice_start_position:slice_end_position]
slice_start_position:slice_end_position]
return used_relative_embeddings return used_relative_embeddings
def _relative_position_to_absolute_position(self, x): def _relative_position_to_absolute_position(self, x):
batch, heads, length, _ = x.size() batch, heads, length, _ = x.size()
x = torch.nn.functional.pad(x, convert_pad_shape([[0,0],[0,0],[0,0],[0,1]])) x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
x_flat = x.view([batch, heads, length * 2 * length]) x_flat = x.view([batch, heads, length * 2 * length])
x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0,0],[0,0],[0,length-1]])) x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [0, length - 1]]))
x_final = x_flat.view([batch, heads, length+1, 2*length-1])[:, :, :length, length-1:] x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[:, :, :length, length - 1 :]
return x_final return x_final
def _absolute_position_to_relative_position(self, x): def _absolute_position_to_relative_position(self, x):
batch, heads, length, _ = x.size() batch, heads, length, _ = x.size()
x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length-1]])) x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]]))
x_flat = x.view([batch, heads, length**2 + length*(length - 1)]) x_flat = x.view([batch, heads, length**2 + length * (length - 1)])
x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [length, 0]])) x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
x_final = x_flat.view([batch, heads, length, 2*length])[:,:,:,1:] x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
return x_final return x_final
def _attention_bias_proximal(self, length): def _attention_bias_proximal(self, length):
...@@ -262,8 +265,7 @@ class MultiHeadAttention(nn.Module): ...@@ -262,8 +265,7 @@ class MultiHeadAttention(nn.Module):
class FFN(nn.Module): class FFN(nn.Module):
def __init__(self, in_channels, out_channels, filter_channels, kernel_size, def __init__(self, in_channels, out_channels, filter_channels, kernel_size, p_dropout=0.0):
p_dropout=0.0):
super(FFN, self).__init__() super(FFN, self).__init__()
self.in_channels = in_channels self.in_channels = in_channels
self.out_channels = out_channels self.out_channels = out_channels
...@@ -271,10 +273,8 @@ class FFN(nn.Module): ...@@ -271,10 +273,8 @@ class FFN(nn.Module):
self.kernel_size = kernel_size self.kernel_size = kernel_size
self.p_dropout = p_dropout self.p_dropout = p_dropout
self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size, self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size, padding=kernel_size // 2)
padding=kernel_size//2) self.conv_2 = torch.nn.Conv1d(filter_channels, out_channels, kernel_size, padding=kernel_size // 2)
self.conv_2 = torch.nn.Conv1d(filter_channels, out_channels, kernel_size,
padding=kernel_size//2)
self.drop = torch.nn.Dropout(p_dropout) self.drop = torch.nn.Dropout(p_dropout)
def forward(self, x, x_mask): def forward(self, x, x_mask):
...@@ -286,8 +286,17 @@ class FFN(nn.Module): ...@@ -286,8 +286,17 @@ class FFN(nn.Module):
class Encoder(nn.Module): class Encoder(nn.Module):
def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, def __init__(
kernel_size=1, p_dropout=0.0, window_size=None, **kwargs): self,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size=1,
p_dropout=0.0,
window_size=None,
**kwargs,
):
super(Encoder, self).__init__() super(Encoder, self).__init__()
self.hidden_channels = hidden_channels self.hidden_channels = hidden_channels
self.filter_channels = filter_channels self.filter_channels = filter_channels
...@@ -303,11 +312,15 @@ class Encoder(nn.Module): ...@@ -303,11 +312,15 @@ class Encoder(nn.Module):
self.ffn_layers = torch.nn.ModuleList() self.ffn_layers = torch.nn.ModuleList()
self.norm_layers_2 = torch.nn.ModuleList() self.norm_layers_2 = torch.nn.ModuleList()
for _ in range(self.n_layers): for _ in range(self.n_layers):
self.attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels, self.attn_layers.append(
n_heads, window_size=window_size, p_dropout=p_dropout)) MultiHeadAttention(
hidden_channels, hidden_channels, n_heads, window_size=window_size, p_dropout=p_dropout
)
)
self.norm_layers_1.append(LayerNorm(hidden_channels)) self.norm_layers_1.append(LayerNorm(hidden_channels))
self.ffn_layers.append(FFN(hidden_channels, hidden_channels, self.ffn_layers.append(
filter_channels, kernel_size, p_dropout=p_dropout)) FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout)
)
self.norm_layers_2.append(LayerNorm(hidden_channels)) self.norm_layers_2.append(LayerNorm(hidden_channels))
def forward(self, x, x_mask): def forward(self, x, x_mask):
...@@ -325,9 +338,21 @@ class Encoder(nn.Module): ...@@ -325,9 +338,21 @@ class Encoder(nn.Module):
class TextEncoder(ModelMixin, ConfigMixin): class TextEncoder(ModelMixin, ConfigMixin):
def __init__(self, n_vocab, n_feats, n_channels, filter_channels, def __init__(
filter_channels_dp, n_heads, n_layers, kernel_size, self,
p_dropout, window_size=None, spk_emb_dim=64, n_spks=1): n_vocab,
n_feats,
n_channels,
filter_channels,
filter_channels_dp,
n_heads,
n_layers,
kernel_size,
p_dropout,
window_size=None,
spk_emb_dim=64,
n_spks=1,
):
super(TextEncoder, self).__init__() super(TextEncoder, self).__init__()
self.register( self.register(
...@@ -342,10 +367,9 @@ class TextEncoder(ModelMixin, ConfigMixin): ...@@ -342,10 +367,9 @@ class TextEncoder(ModelMixin, ConfigMixin):
p_dropout=p_dropout, p_dropout=p_dropout,
window_size=window_size, window_size=window_size,
spk_emb_dim=spk_emb_dim, spk_emb_dim=spk_emb_dim,
n_spks=n_spks n_spks=n_spks,
) )
self.n_vocab = n_vocab self.n_vocab = n_vocab
self.n_feats = n_feats self.n_feats = n_feats
self.n_channels = n_channels self.n_channels = n_channels
...@@ -362,15 +386,22 @@ class TextEncoder(ModelMixin, ConfigMixin): ...@@ -362,15 +386,22 @@ class TextEncoder(ModelMixin, ConfigMixin):
self.emb = torch.nn.Embedding(n_vocab, n_channels) self.emb = torch.nn.Embedding(n_vocab, n_channels)
torch.nn.init.normal_(self.emb.weight, 0.0, n_channels**-0.5) torch.nn.init.normal_(self.emb.weight, 0.0, n_channels**-0.5)
self.prenet = ConvReluNorm(n_channels, n_channels, n_channels, self.prenet = ConvReluNorm(n_channels, n_channels, n_channels, kernel_size=5, n_layers=3, p_dropout=0.5)
kernel_size=5, n_layers=3, p_dropout=0.5)
self.encoder = Encoder(n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels, n_heads, n_layers, self.encoder = Encoder(
kernel_size, p_dropout, window_size=window_size) n_channels + (spk_emb_dim if n_spks > 1 else 0),
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
window_size=window_size,
)
self.proj_m = torch.nn.Conv1d(n_channels + (spk_emb_dim if n_spks > 1 else 0), n_feats, 1) self.proj_m = torch.nn.Conv1d(n_channels + (spk_emb_dim if n_spks > 1 else 0), n_feats, 1)
self.proj_w = DurationPredictor(n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels_dp, self.proj_w = DurationPredictor(
kernel_size, p_dropout) n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels_dp, kernel_size, p_dropout
)
def forward(self, x, x_lengths, spk=None): def forward(self, x, x_lengths, spk=None):
x = self.emb(x) * math.sqrt(self.n_channels) x = self.emb(x) * math.sqrt(self.n_channels)
......
# Schedulers
- Schedulers are the algorithms to use diffusion models in inference as well as for training. They include the noise schedules and define algorithm-specific diffusion steps.
- Schedulers can be used interchangable between diffusion models in inference to find the preferred tradef-off between speed and generation quality.
- Schedulers are available in numpy, but can easily be transformed into PyTorch.
## API
- Schedulers should provide one or more `def step(...)` functions that should be called iteratively to unroll the diffusion loop during
the forward pass.
- Schedulers should be framework-agonstic, but provide a simple functionality to convert the scheduler into a specific framework, such as PyTorch
with a `set_format(...)` method.
## Examples
- The DDPM scheduler was proposed in [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) and can be found in [scheduling_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddpm.py). An example of how to use this scheduler can be found in [pipeline_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddpm.py).
- The DDIM scheduler was proposed in [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) and can be found in [scheduling_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py). An example of how to use this scheduler can be found in [pipeline_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddim.py).
- The PNMD scheduler was proposed in [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) and can be found in [scheduling_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py). An example of how to use this scheduler can be found in [pipeline_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_pndm.py).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment