Merge branch 'main' into grad-tts

988369a0 · Suraj Patil · GitHub · 5a3467e6 · bed32182 · 988369a0
Unverified Commit 988369a0 authored Jun 16, 2022 by Suraj Patil Committed by GitHub Jun 16, 2022
20 changed files
--- a/README.md
+++ b/README.md
-# Diffusers
+<p align="center">
+    <br>
+    <img src="docs/source/imgs/diffusers_library.jpg" width="400"/>
+    <br>
+<p>
+<p align="center">
+    <a href="https://github.com/huggingface/diffusers/blob/main/LICENSE">
+        <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
+    </a>
+    <a href="https://github.com/huggingface/diffusers/releases">
+        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg">
+    </a>
+    <a href="CODE_OF_CONDUCT.md">
+        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg">
+    </a>
+</p>
+🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves
+as a modular toolbox for inference and training of diffusion models.
+More precisely, 🤗 Diffusers offers:
+- State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code (see [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines)).
+- Various noise schedulers that can be used interchangeably for the prefered speed vs. quality trade-off in inference (see [src/diffusers/schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers)).
+- Multiple types of models, such as UNet, that can be used as building blocks in an end-to-end diffusion system (see [src/diffusers/models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)).
+- Training examples to show how to train the most popular diffusion models (see [examples](https://github.com/huggingface/diffusers/tree/main/examples)).
 ## Definitions
-**Models**: Single neural network that models p_θ(x_t-1|x_t) and is trained to “denoise” to image
+**Models**: Neural network that models $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ (see image below) and is trained end-to-end to *denoise* a noisy input to an image.
-*Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet*
+*Examples*: UNet, Conditioned UNet, 3D UNet, Transformer UNet
 ![model_diff_1_50](https://user-images.githubusercontent.com/23423619/171610307-dab0cd8b-75da-4d4e-9f5a-5922072e2bb5.png)
-**Schedulers**: Algorithm to compute previous image according to alpha, beta schedule and to sample noise. Should be used for both *training* and *inference*.
+**Schedulers**: Algorithm class for both **inference** and **training**.
-*Example: Gaussian DDPM, DDIM, PMLS, DEIN*
+The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training.
+*Examples*: [DDPM](https://arxiv.org/abs/2006.11239), [DDIM](https://arxiv.org/abs/2010.02502), [PNDM](https://arxiv.org/abs/2202.09778), [DEIS](https://arxiv.org/abs/2204.13902)
 ![sampling](https://user-images.githubusercontent.com/23423619/171608981-3ad05953-a684-4c82-89f8-62a459147a07.png)
 ![training](https://user-images.githubusercontent.com/23423619/171608964-b3260cce-e6b4-4841-959d-7d8ba4b8d1b2.png)
-**Diffusion Pipeline**: End-to-end pipeline that includes multiple diffusion models, possible text encoders, CLIP
+**Diffusion Pipeline**: End-to-end pipeline that includes multiple diffusion models, possible text encoders, ...
-*Example: GLIDE,CompVis/Latent-Diffusion, Imagen, DALL-E*
+*Examples*: GLIDE, Latent-Diffusion, Imagen, DALL-E 2
 ![imagen](https://user-images.githubusercontent.com/23423619/171609001-c3f2c1c9-f597-4a16-9843-749bf3f9431c.png)
+## Philosophy
+- Readability and clarity is prefered over highly optimized code. A strong importance is put on providing readable, intuitive and elementary code design. *E.g.*, the provided [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) are separated from the provided [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and provide well-commented code that can be read alongside the original paper.
+- Diffusers is **modality independent** and focusses on providing pretrained models and tools to build systems that generate **continous outputs**, *e.g.* vision and audio.
+- Diffusion models and schedulers are provided as consise, elementary building blocks whereas diffusion pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box, should stay as close as possible to their original implementation and can include components of other library, such as text-encoders. Examples for diffusion pipelines are [Glide](https://github.com/openai/glide-text2im) and [Latent Diffusion](https://github.com/CompVis/latent-diffusion).
 ## Quickstart
+### Installation
 ```
-git clone https://github.com/huggingface/diffusers.git
+pip install diffusers  # should install diffusers 0.0.4
-cd diffusers && pip install -e .
 ```
-### 1. `diffusers` as a central modular diffusion and sampler library
+### 1. `diffusers` as a toolbox for schedulers and models
 `diffusers` is more modularized than `transformers`. The idea is that researchers and engineers can use only parts of the library easily for the own use cases.
 It could become a central place for all kinds of models, schedulers, training utils and processors that one can mix and match for one's own use case.
 Both models and schedulers should be load- and saveable from the Hub.
+For more examples see [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers) and [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models)
 #### **Example for [DDPM](https://arxiv.org/abs/2006.11239):**
 ```python
@@ -49,29 +84,29 @@ unet = UNetModel.from_pretrained("fusing/ddpm-lsun-church").to(torch_device)
 # 2. Sample gaussian noise
 image = torch.randn(
-	(1, unet.in_channels, unet.resolution, unet.resolution),
+    (1, unet.in_channels, unet.resolution, unet.resolution),
-	generator=generator,
+    generator=generator,
 )
 image = image.to(torch_device)
 # 3. Denoise
 num_prediction_steps = len(noise_scheduler)
 for t in tqdm.tqdm(reversed(range(num_prediction_steps)), total=num_prediction_steps):
-	# predict noise residual
+    # predict noise residual
-	with torch.no_grad():
+    with torch.no_grad():
-		residual = unet(image, t)
+        residual = unet(image, t)
-	# predict previous mean of image x_t-1
+    # predict previous mean of image x_t-1
-	pred_prev_image = noise_scheduler.step(residual, image, t)
+    pred_prev_image = noise_scheduler.step(residual, image, t)
-	# optionally sample variance
+    # optionally sample variance
-	variance = 0
+    variance = 0
-	if t > 0:
+    if t > 0:
-		noise = torch.randn(image.shape, generator=generator).to(image.device)
+        noise = torch.randn(image.shape, generator=generator).to(image.device)
-		variance = noise_scheduler.get_variance(t).sqrt() * noise
+        variance = noise_scheduler.get_variance(t).sqrt() * noise
-	# set current image to prev_image: x_t -> x_t-1
+    # set current image to prev_image: x_t -> x_t-1
-	image = pred_prev_image + variance
+    image = pred_prev_image + variance
 # 5. process image to PIL
 image_processed = image.cpu().permute(0, 2, 3, 1)
@@ -101,8 +136,8 @@ unet = UNetModel.from_pretrained("fusing/ddpm-celeba-hq").to(torch_device)
 # 2. Sample gaussian noise
 image = torch.randn(
-	(1, unet.in_channels, unet.resolution, unet.resolution),
+   (1, unet.in_channels, unet.resolution, unet.resolution),
-	generator=generator,
+   generator=generator,
 )
 image = image.to(torch_device)
@@ -111,22 +146,22 @@ num_inference_steps = 50
 eta = 0.0  # <- deterministic sampling
 for t in tqdm.tqdm(reversed(range(num_inference_steps)), total=num_inference_steps):
-	# 1. predict noise residual
+    # 1. predict noise residual
-	orig_t = noise_scheduler.get_orig_t(t, num_inference_steps)
+    orig_t = noise_scheduler.get_orig_t(t, num_inference_steps)
-	with torch.no_grad():
+    with torch.inference_mode():
-	    residual = unet(image, orig_t)
+        residual = unet(image, orig_t)
-	# 2. predict previous mean of image x_t-1
+    # 2. predict previous mean of image x_t-1
-	pred_prev_image = noise_scheduler.step(residual, image, t, num_inference_steps, eta)
+    pred_prev_image = noise_scheduler.step(residual, image, t, num_inference_steps, eta)
-	# 3. optionally sample variance
+    # 3. optionally sample variance
-	variance = 0
+    variance = 0
-	if eta > 0:
+    if eta > 0:
-		noise = torch.randn(image.shape, generator=generator).to(image.device)
+        noise = torch.randn(image.shape, generator=generator).to(image.device)
-		variance = noise_scheduler.get_variance(t).sqrt() * eta * noise
+        variance = noise_scheduler.get_variance(t).sqrt() * eta * noise
-	# 4. set current image to prev_image: x_t -> x_t-1
+    # 4. set current image to prev_image: x_t -> x_t-1
-	image = pred_prev_image + variance
+    image = pred_prev_image + variance
 # 5. process image to PIL
 image_processed = image.cpu().permute(0, 2, 3, 1)
@@ -138,25 +173,35 @@ image_pil = PIL.Image.fromarray(image_processed[0])
 image_pil.save("test.png")
 ```
-### 2. `diffusers` as a collection of most important Diffusion systems (GLIDE, Dalle, ...)
+### 2. `diffusers` as a collection of popular Diffusion systems (GLIDE, Dalle, ...)
-`models` directory in repository hosts the complete code necessary for running a diffusion system as well as to train it. A `DiffusionPipeline` class allows to easily run the diffusion model in inference:
-#### **Example image generation with DDPM**
+For more examples see [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines).
+#### **Example image generation with PNDM**
 ```python
-from diffusers import DiffusionPipeline
+from diffusers import PNDM, UNetModel, PNDMScheduler
 import PIL.Image
 import numpy as np
+import torch
+model_id = "fusing/ddim-celeba-hq"
+model = UNetModel.from_pretrained(model_id)
+scheduler = PNDMScheduler()
 # load model and scheduler
-ddpm = DiffusionPipeline.from_pretrained("fusing/ddpm-lsun-bedroom")
+pndm = PNDM(unet=model, noise_scheduler=scheduler)
 # run pipeline in inference (sample random noise and denoise)
-image = ddpm()
+with torch.no_grad():
+    image = pndm()
 # process image to PIL
 image_processed = image.cpu().permute(0, 2, 3, 1)
-image_processed = (image_processed + 1.0) * 127.5
+image_processed = (image_processed + 1.0) / 2
+image_processed = torch.clamp(image_processed, 0.0, 1.0)
+image_processed = image_processed * 255
 image_processed = image_processed.numpy().astype(np.uint8)
 image_pil = PIL.Image.fromarray(image_processed[0])
@@ -187,9 +232,9 @@ image_pil = PIL.Image.fromarray(image_processed[0])
 image_pil.save("test.png")
 ```
- #### **Text to speech with BDDM**
+#### **Text to speech with BDDM**
-_Follow the isnstructions [here](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) to load tacotron2 model._
+_Follow the instructions [here](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) to load tacotron2 model._
 ```python
 import torch
@@ -223,60 +268,15 @@ sampling_rate = 22050
 wavwrite("generated_audio.wav", sampling_rate, audio.squeeze().cpu().numpy())
 ```
-## Library structure:
+## TODO
-```
+- Create common API for models [ ]
-├── LICENSE
+- Add tests for models [ ]
-├── Makefile
+- Adapt schedulers for training [ ]
-├── README.md
+- Write google colab for training [ ]
-├── pyproject.toml
+- Write docs / Think about how to structure docs [ ]
-├── setup.cfg
+- Add tests to circle ci [ ]
-├── setup.py
+- Add [Diffusion LM models](https://arxiv.org/pdf/2205.14217.pdf) [ ]
-├── src
+- Add more vision models [ ]
-│   ├── diffusers
+- Add more speech models [ ]
-│       ├── __init__.py
+- Add RL model [ ]
-│       ├── configuration_utils.py
-│       ├── dependency_versions_check.py
-│       ├── dependency_versions_table.py
-│       ├── dynamic_modules_utils.py
-│       ├── modeling_utils.py
-│       ├── models
-│       │   ├── __init__.py
-│       │   ├── unet.py
-│       │   ├── unet_glide.py
-│       │   └── unet_ldm.py
-│       ├── pipeline_utils.py
-│       ├── pipelines
-│       │   ├── __init__.py
-│       │   ├── configuration_ldmbert.py
-│       │   ├── conversion_glide.py
-│       │   ├── modeling_vae.py
-│       │   ├── pipeline_bddm.py
-│       │   ├── pipeline_ddim.py
-│       │   ├── pipeline_ddpm.py
-│       │   ├── pipeline_glide.py
-│       │   └── pipeline_latent_diffusion.py
-│       ├── schedulers
-│       │   ├── __init__.py
-│       │   ├── classifier_free_guidance.py
-│       │   ├── scheduling_ddim.py
-│       │   ├── scheduling_ddpm.py
-│       │   ├── scheduling_plms.py
-│       │   └── scheduling_utils.py
-│       ├── testing_utils.py
-│       └── utils
-│           ├── __init__.py
-│           └── logging.py
-├── tests
-│   ├── __init__.py
-│   ├── test_modeling_utils.py
-│   └── test_scheduler.py
-└── utils
-    ├── check_config_docstrings.py
-    ├── check_copies.py
-    ├── check_dummies.py
-    ├── check_inits.py
-    ├── check_repo.py
-    ├── check_table.py
-    └── check_tf_ops.py
-```
--- a/docs/source/imgs/diffusers_library.jpg
+++ b/docs/source/imgs/diffusers_library.jpg
--- a/examples/README.md
+++ b/examples/README.md
+## Training examples
+### Flowers DDPM 
+The command to train a DDPM UNet model on the Oxford Flowers dataset:
+```bash
+python -m torch.distributed.launch \
+  --nproc_per_node 4 \
+  train_ddpm.py \
+  --dataset="huggan/flowers-102-categories" \
+  --resolution=64 \
+  --output_path="flowers-ddpm" \
+  --batch_size=16 \
+  --num_epochs=100 \
+  --gradient_accumulation_steps=1 \
+  --lr=1e-4 \
+  --warmup_steps=500 \
+  --mixed_precision=no
+```
+A full ltraining run takes 2 hours on 4xV100 GPUs.
+<img src="https://user-images.githubusercontent.com/26864830/173855866-5628989f-856b-4725-a944-d6c09490b2df.png" width="500" />
+### Pokemon DDPM 
+The command to train a DDPM UNet model on the Pokemon dataset:
+```bash
+python -m torch.distributed.launch \
+  --nproc_per_node 4 \
+  train_ddpm.py \
+  --dataset="huggan/pokemon" \
+  --resolution=64 \
+  --output_path="pokemon-ddpm" \
+  --batch_size=16 \
+  --num_epochs=100 \
+  --gradient_accumulation_steps=1 \
+  --lr=1e-4 \
+  --warmup_steps=500 \
+  --mixed_precision=no
+```
+A full ltraining run takes 2 hours on 4xV100 GPUs.
+<img src="https://user-images.githubusercontent.com/26864830/173856733-4f117f8c-97bd-4f51-8002-56b488c96df9.png" width="500" />
--- a/examples/training_ddpm.py
+++ b/examples/training_ddpm.py
+import argparse
 import os
 import torch
-import PIL.Image
-import argparse
 import torch.nn.functional as F
+import PIL.Image
 from accelerate import Accelerator
 from datasets import load_dataset
 from diffusers import DDPM, DDPMScheduler, UNetModel
 from torchvision.transforms import (
+    CenterCrop,
    Compose,
    InterpolationMode,
    Lambda,
-    RandomCrop,
    RandomHorizontalFlip,
    Resize,
    ToTensor,
@@ -31,44 +31,40 @@ def main(args):
        dropout=0.0,
        num_res_blocks=2,
        resamp_with_conv=True,
-        resolution=64,
+        resolution=args.resolution,
    )
    noise_scheduler = DDPMScheduler(timesteps=1000)
-    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
+    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
-    num_epochs = 100
-    batch_size = 16
-    gradient_accumulation_steps = 1
    augmentations = Compose(
        [
-            Resize(64, interpolation=InterpolationMode.BILINEAR),
+            Resize(args.resolution, interpolation=InterpolationMode.BILINEAR),
-            RandomCrop(64),
+            CenterCrop(args.resolution),
            RandomHorizontalFlip(),
            ToTensor(),
            Lambda(lambda x: x * 2 - 1),
        ]
    )
-    dataset = load_dataset("huggan/pokemon", split="train")
+    dataset = load_dataset(args.dataset, split="train")
    def transforms(examples):
        images = [augmentations(image.convert("RGB")) for image in examples["image"]]
        return {"input": images}
    dataset.set_transform(transforms)
-    train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
+    train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size, shuffle=True)
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
-        num_warmup_steps=500,
+        num_warmup_steps=args.warmup_steps,
-        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
+        num_training_steps=(len(train_dataloader) * args.num_epochs) // args.gradient_accumulation_steps,
    )
    model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, lr_scheduler
    )
-    for epoch in range(num_epochs):
+    for epoch in range(args.num_epochs):
        model.train()
        with tqdm(total=len(train_dataloader), unit="ba") as pbar:
            pbar.set_description(f"Epoch {epoch}")
@@ -84,14 +80,15 @@ def main(args):
                    noise_samples[idx] = noise
                    noisy_images[idx] = noise_scheduler.forward_step(clean_images[idx], noise, timesteps[idx])
-                if step % gradient_accumulation_steps != 0:
+                if step % args.gradient_accumulation_steps != 0:
                    with accelerator.no_sync(model):
                        output = model(noisy_images, timesteps)
-                        # predict the noise
+                        # predict the noise residual
                        loss = F.mse_loss(output, noise_samples)
                        accelerator.backward(loss)
                else:
                    output = model(noisy_images, timesteps)
+                    # predict the noise residual
                    loss = F.mse_loss(output, noise_samples)
                    accelerator.backward(loss)
                    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
@@ -103,13 +100,18 @@ def main(args):
                optimizer.step()
+        # Generate a sample image for visual inspection
        torch.distributed.barrier()
        if args.local_rank in [-1, 0]:
            model.eval()
            with torch.no_grad():
-                pipeline = DDPM(unet=model.module, noise_scheduler=noise_scheduler)
+                if isinstance(model, torch.nn.parallel.DistributedDataParallel):
-                generator = torch.Generator()
+                    pipeline = DDPM(unet=model.module, noise_scheduler=noise_scheduler)
-                generator = generator.manual_seed(0)
+                else:
+                    pipeline = DDPM(unet=model, noise_scheduler=noise_scheduler)
+                pipeline.save_pretrained(args.output_path)
+                generator = torch.manual_seed(0)
                # run pipeline in inference (sample random noise and denoise)
                image = pipeline(generator=generator)
@@ -120,22 +122,33 @@ def main(args):
                image_pil = PIL.Image.fromarray(image_processed[0])
                # save image
-                pipeline.save_pretrained("./pokemon-ddpm")
+                test_dir = os.path.join(args.output_path, "test_samples")
-                image_pil.save(f"./pokemon-ddpm/test_{epoch}.png")
+                os.makedirs(test_dir, exist_ok=True)
+                image_pil.save(f"{test_dir}/{epoch}.png")
        torch.distributed.barrier()
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Simple example of training script.")
+    parser = argparse.ArgumentParser(description="Simple example of a training script.")
    parser.add_argument("--local_rank", type=int)
+    parser.add_argument("--dataset", type=str, default="huggan/flowers-102-categories")
+    parser.add_argument("--resolution", type=int, default=64)
+    parser.add_argument("--output_path", type=str, default="ddpm-model")
+    parser.add_argument("--batch_size", type=int, default=16)
+    parser.add_argument("--num_epochs", type=int, default=100)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
+    parser.add_argument("--lr", type=float, default=1e-4)
+    parser.add_argument("--warmup_steps", type=int, default=500)
    parser.add_argument(
        "--mixed_precision",
        type=str,
        default="no",
        choices=["no", "fp16", "bf16"],
-        help="Whether to use mixed precision. Choose"
+        help=(
-             "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+            "Whether to use mixed precision. Choose"
-             "and an Nvidia Ampere GPU.",
+            "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
+            "and an Nvidia Ampere GPU."
+        ),
    )
    args = parser.parse_args()

--- a/setup.py
+++ b/setup.py
@@ -87,7 +87,6 @@ _deps = [
    "regex!=2019.12.17",
    "requests",
    "torch>=1.4",
-    "torchvision",
 ]
 # this is a lookup table with items like:
@@ -172,13 +171,12 @@ install_requires = [
    deps["regex"],
    deps["requests"],
    deps["torch"],
-    deps["torchvision"],
    deps["Pillow"],
 ]
 setup(
    name="diffusers",
-    version="0.0.2",
+    version="0.0.4",
    description="Diffusers",
    long_description=open("README.md", "r", encoding="utf-8").read(),
    long_description_content_type="text/markdown",

--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -2,14 +2,14 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.
-__version__ = "0.0.3"
+__version__ = "0.0.4"
 from .modeling_utils import ModelMixin
 from .models.unet import UNetModel
-from .models.unet_glide import GLIDEUNetModel, GLIDESuperResUNetModel, GLIDETextToImageUNetModel
+from .models.unet_glide import GLIDESuperResUNetModel, GLIDETextToImageUNetModel, GLIDEUNetModel
-from .models.unet_ldm import UNetLDMModel
 from .models.unet_grad_tts import UNetGradTTSModel
+from .models.unet_ldm import UNetLDMModel
 from .pipeline_utils import DiffusionPipeline
-from .pipelines import DDIM, DDPM, GLIDE, LatentDiffusion, PNDM, BDDM, GradTTS
+from .pipelines import BDDM, DDIM, DDPM, GLIDE, PNDM, GradTTS, LatentDiffusion
-from .schedulers import DDIMScheduler, DDPMScheduler, SchedulerMixin, PNDMScheduler, GradTTSScheduler
+from .schedulers import DDIMScheduler, DDPMScheduler, GradTTSScheduler, PNDMScheduler, SchedulerMixin
 from .schedulers.classifier_free_guidance import ClassifierFreeGuidanceScheduler
--- a/src/diffusers/configuration_utils.py
+++ b/src/diffusers/configuration_utils.py
@@ -226,7 +226,7 @@ class ConfigMixin:
        return json.loads(text)
    def __repr__(self):
-       return f"{self.__class__.__name__} {self.to_json_string()}"
+        return f"{self.__class__.__name__} {self.to_json_string()}"
    @property
    def config(self) -> Dict[str, Any]:

--- a/src/diffusers/dependency_versions_table.py
+++ b/src/diffusers/dependency_versions_table.py
@@ -13,5 +13,4 @@ deps = {
    "regex": "regex!=2019.12.17",
    "requests": "requests",
    "torch": "torch>=1.4",
-    "torchvision": "torchvision",
 }
--- a/src/diffusers/models/README.md
+++ b/src/diffusers/models/README.md
+# Models
+- Models: Neural network that models $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ (see image below) and is trained end-to-end to denoise a noisy input to an image. Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet
+## API
+TODO(Suraj, Patrick)
+## Examples
+TODO(Suraj, Patrick)
--- a/src/diffusers/models/__init__.py
+++ b/src/diffusers/models/__init__.py
@@ -17,6 +17,6 @@
 # limitations under the License.
 from .unet import UNetModel
-from .unet_glide import GLIDEUNetModel, GLIDESuperResUNetModel, GLIDETextToImageUNetModel
+from .unet_glide import GLIDESuperResUNetModel, GLIDETextToImageUNetModel, GLIDEUNetModel
+from .unet_grad_tts import UNetGradTTSModel
 from .unet_ldm import UNetLDMModel
-from .unet_grad_tts import UNetGradTTSModel
\ No newline at end of file
--- a/src/diffusers/models/unet.py
+++ b/src/diffusers/models/unet.py
@@ -26,7 +26,6 @@ from torch.optim import Adam
 from torch.utils import data
 from PIL import Image
-from torchvision import transforms, utils
 from tqdm import tqdm
 from ..configuration_utils import ConfigMixin
@@ -331,171 +330,3 @@ class UNetModel(ModelMixin, ConfigMixin):
        h = nonlinearity(h)
        h = self.conv_out(h)
        return h
-# dataset classes
-class Dataset(data.Dataset):
-    def __init__(self, folder, image_size, exts=["jpg", "jpeg", "png"]):
-        super().__init__()
-        self.folder = folder
-        self.image_size = image_size
-        self.paths = [p for ext in exts for p in Path(f"{folder}").glob(f"**/*.{ext}")]
-        self.transform = transforms.Compose(
-            [
-                transforms.Resize(image_size),
-                transforms.RandomHorizontalFlip(),
-                transforms.CenterCrop(image_size),
-                transforms.ToTensor(),
-            ]
-        )
-    def __len__(self):
-        return len(self.paths)
-    def __getitem__(self, index):
-        path = self.paths[index]
-        img = Image.open(path)
-        return self.transform(img)
-# trainer class
-class EMA:
-    def __init__(self, beta):
-        super().__init__()
-        self.beta = beta
-    def update_model_average(self, ma_model, current_model):
-        for current_params, ma_params in zip(current_model.parameters(), ma_model.parameters()):
-            old_weight, up_weight = ma_params.data, current_params.data
-            ma_params.data = self.update_average(old_weight, up_weight)
-    def update_average(self, old, new):
-        if old is None:
-            return new
-        return old * self.beta + (1 - self.beta) * new
-def cycle(dl):
-    while True:
-        for data_dl in dl:
-            yield data_dl
-def num_to_groups(num, divisor):
-    groups = num // divisor
-    remainder = num % divisor
-    arr = [divisor] * groups
-    if remainder > 0:
-        arr.append(remainder)
-    return arr
-class Trainer(object):
-    def __init__(
-        self,
-        diffusion_model,
-        folder,
-        *,
-        ema_decay=0.995,
-        image_size=128,
-        train_batch_size=32,
-        train_lr=1e-4,
-        train_num_steps=100000,
-        gradient_accumulate_every=2,
-        amp=False,
-        step_start_ema=2000,
-        update_ema_every=10,
-        save_and_sample_every=1000,
-        results_folder="./results",
-    ):
-        super().__init__()
-        self.model = diffusion_model
-        self.ema = EMA(ema_decay)
-        self.ema_model = copy.deepcopy(self.model)
-        self.update_ema_every = update_ema_every
-        self.step_start_ema = step_start_ema
-        self.save_and_sample_every = save_and_sample_every
-        self.batch_size = train_batch_size
-        self.image_size = diffusion_model.image_size
-        self.gradient_accumulate_every = gradient_accumulate_every
-        self.train_num_steps = train_num_steps
-        self.ds = Dataset(folder, image_size)
-        self.dl = cycle(data.DataLoader(self.ds, batch_size=train_batch_size, shuffle=True, pin_memory=True))
-        self.opt = Adam(diffusion_model.parameters(), lr=train_lr)
-        self.step = 0
-        self.amp = amp
-        self.scaler = GradScaler(enabled=amp)
-        self.results_folder = Path(results_folder)
-        self.results_folder.mkdir(exist_ok=True)
-        self.reset_parameters()
-    def reset_parameters(self):
-        self.ema_model.load_state_dict(self.model.state_dict())
-    def step_ema(self):
-        if self.step < self.step_start_ema:
-            self.reset_parameters()
-            return
-        self.ema.update_model_average(self.ema_model, self.model)
-    def save(self, milestone):
-        data = {
-            "step": self.step,
-            "model": self.model.state_dict(),
-            "ema": self.ema_model.state_dict(),
-            "scaler": self.scaler.state_dict(),
-        }
-        torch.save(data, str(self.results_folder / f"model-{milestone}.pt"))
-    def load(self, milestone):
-        data = torch.load(str(self.results_folder / f"model-{milestone}.pt"))
-        self.step = data["step"]
-        self.model.load_state_dict(data["model"])
-        self.ema_model.load_state_dict(data["ema"])
-        self.scaler.load_state_dict(data["scaler"])
-    def train(self):
-        with tqdm(initial=self.step, total=self.train_num_steps) as pbar:
-            while self.step < self.train_num_steps:
-                for i in range(self.gradient_accumulate_every):
-                    data = next(self.dl).cuda()
-                    with autocast(enabled=self.amp):
-                        loss = self.model(data)
-                        self.scaler.scale(loss / self.gradient_accumulate_every).backward()
-                    pbar.set_description(f"loss: {loss.item():.4f}")
-                self.scaler.step(self.opt)
-                self.scaler.update()
-                self.opt.zero_grad()
-                if self.step % self.update_ema_every == 0:
-                    self.step_ema()
-                if self.step != 0 and self.step % self.save_and_sample_every == 0:
-                    self.ema_model.eval()
-                    milestone = self.step // self.save_and_sample_every
-                    batches = num_to_groups(36, self.batch_size)
-                    all_images_list = list(map(lambda n: self.ema_model.sample(batch_size=n), batches))
-                    all_images = torch.cat(all_images_list, dim=0)
-                    utils.save_image(all_images, str(self.results_folder / f"sample-{milestone}.png"), nrow=6)
-                    self.save(milestone)
-                self.step += 1
-                pbar.update(1)
-        print("training complete")
--- a/src/diffusers/models/unet_grad_tts.py
+++ b/src/diffusers/models/unet_grad_tts.py
@@ -2,6 +2,7 @@ import math
 import torch
 try:
    from einops import rearrange
 except:
@@ -11,6 +12,7 @@ except:
 from ..configuration_utils import ConfigMixin
 from ..modeling_utils import ModelMixin
 class Mish(torch.nn.Module):
    def forward(self, x):
        return x * torch.tanh(torch.nn.functional.softplus(x))
@@ -47,9 +49,9 @@ class Rezero(torch.nn.Module):
 class Block(torch.nn.Module):
    def __init__(self, dim, dim_out, groups=8):
        super(Block, self).__init__()
-        self.block = torch.nn.Sequential(torch.nn.Conv2d(dim, dim_out, 3, 
+        self.block = torch.nn.Sequential(
-                                         padding=1), torch.nn.GroupNorm(
+            torch.nn.Conv2d(dim, dim_out, 3, padding=1), torch.nn.GroupNorm(groups, dim_out), Mish()
-                                         groups, dim_out), Mish())
+        )
    def forward(self, x, mask):
        output = self.block(x * mask)
@@ -59,8 +61,7 @@ class Block(torch.nn.Module):
 class ResnetBlock(torch.nn.Module):
    def __init__(self, dim, dim_out, time_emb_dim, groups=8):
        super(ResnetBlock, self).__init__()
-        self.mlp = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim, 
+        self.mlp = torch.nn.Sequential(Mish(), torch.nn.Linear(time_emb_dim, dim_out))
-                                                               dim_out))
        self.block1 = Block(dim, dim_out, groups=groups)
        self.block2 = Block(dim_out, dim_out, groups=groups)
@@ -83,18 +84,16 @@ class LinearAttention(torch.nn.Module):
        self.heads = heads
        hidden_dim = dim_head * heads
        self.to_qkv = torch.nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
-        self.to_out = torch.nn.Conv2d(hidden_dim, dim, 1)            
+        self.to_out = torch.nn.Conv2d(hidden_dim, dim, 1)
    def forward(self, x):
        b, c, h, w = x.shape
        qkv = self.to_qkv(x)
-        q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', 
+        q, k, v = rearrange(qkv, "b (qkv heads c) h w -> qkv b heads c (h w)", heads=self.heads, qkv=3)
-                            heads = self.heads, qkv=3)            
        k = k.softmax(dim=-1)
-        context = torch.einsum('bhdn,bhen->bhde', k, v)
+        context = torch.einsum("bhdn,bhen->bhde", k, v)
-        out = torch.einsum('bhde,bhdn->bhen', context, q)
+        out = torch.einsum("bhde,bhdn->bhen", context, q)
-        out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', 
+        out = rearrange(out, "b heads c (h w) -> b (heads c) h w", heads=self.heads, h=h, w=w)
-                        heads=self.heads, h=h, w=w)
        return self.to_out(out)
@@ -124,16 +123,7 @@ class SinusoidalPosEmb(torch.nn.Module):
 class UNetGradTTSModel(ModelMixin, ConfigMixin):
-    def __init__(
+    def __init__(self, dim, dim_mults=(1, 2, 4), groups=8, n_spks=None, spk_emb_dim=64, n_feats=80, pe_scale=1000):
-        self,
-        dim,
-        dim_mults=(1, 2, 4),
-        groups=8,
-        n_spks=None,
-        spk_emb_dim=64,
-        n_feats=80,
-        pe_scale=1000
-    ):
        super(UNetGradTTSModel, self).__init__()
        self.register(
@@ -143,23 +133,23 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
            n_spks=n_spks,
            spk_emb_dim=spk_emb_dim,
            n_feats=n_feats,
-            pe_scale=pe_scale
+            pe_scale=pe_scale,
        )
        self.dim = dim
        self.dim_mults = dim_mults
        self.groups = groups
        self.n_spks = n_spks if not isinstance(n_spks, type(None)) else 1
        self.spk_emb_dim = spk_emb_dim
        self.pe_scale = pe_scale
        if n_spks > 1:
            self.spk_emb = torch.nn.Embedding(n_spks, spk_emb_dim)
            self.spk_mlp = torch.nn.Sequential(torch.nn.Linear(spk_emb_dim, spk_emb_dim * 4), Mish(),
                                               torch.nn.Linear(spk_emb_dim * 4, n_feats))
        self.time_pos_emb = SinusoidalPosEmb(dim)
-        self.mlp = torch.nn.Sequential(torch.nn.Linear(dim, dim * 4), Mish(),
+        self.mlp = torch.nn.Sequential(torch.nn.Linear(dim, dim * 4), Mish(), torch.nn.Linear(dim * 4, dim))
-                                       torch.nn.Linear(dim * 4, dim))
        dims = [2 + (1 if n_spks > 1 else 0), *map(lambda m: dim * m, dim_mults)]
        in_out = list(zip(dims[:-1], dims[1:]))
@@ -169,11 +159,16 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
        for ind, (dim_in, dim_out) in enumerate(in_out):
            is_last = ind >= (num_resolutions - 1)
-            self.downs.append(torch.nn.ModuleList([
+            self.downs.append(
-                       ResnetBlock(dim_in, dim_out, time_emb_dim=dim),
+                torch.nn.ModuleList(
-                       ResnetBlock(dim_out, dim_out, time_emb_dim=dim),
+                    [
-                       Residual(Rezero(LinearAttention(dim_out))),
+                        ResnetBlock(dim_in, dim_out, time_emb_dim=dim),
-                       Downsample(dim_out) if not is_last else torch.nn.Identity()]))
+                        ResnetBlock(dim_out, dim_out, time_emb_dim=dim),
+                        Residual(Rezero(LinearAttention(dim_out))),
+                        Downsample(dim_out) if not is_last else torch.nn.Identity(),
+                    ]
+                )
+            )
        mid_dim = dims[-1]
        self.mid_block1 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim)
@@ -181,11 +176,16 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
        self.mid_block2 = ResnetBlock(mid_dim, mid_dim, time_emb_dim=dim)
        for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])):
-            self.ups.append(torch.nn.ModuleList([
+            self.ups.append(
-                     ResnetBlock(dim_out * 2, dim_in, time_emb_dim=dim),
+                torch.nn.ModuleList(
-                     ResnetBlock(dim_in, dim_in, time_emb_dim=dim),
+                    [
-                     Residual(Rezero(LinearAttention(dim_in))),
+                        ResnetBlock(dim_out * 2, dim_in, time_emb_dim=dim),
-                     Upsample(dim_in)]))
+                        ResnetBlock(dim_in, dim_in, time_emb_dim=dim),
+                        Residual(Rezero(LinearAttention(dim_in))),
+                        Upsample(dim_in),
+                    ]
+                )
+            )
        self.final_block = Block(dim, dim)
        self.final_conv = torch.nn.Conv2d(dim, 1, 1)
@@ -196,7 +196,7 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
        if not isinstance(spk, type(None)):
            s = self.spk_mlp(spk)
        t = self.time_pos_emb(t, scale=self.pe_scale)
        t = self.mlp(t)
@@ -235,4 +235,4 @@ class UNetGradTTSModel(ModelMixin, ConfigMixin):
        x = self.final_block(x, mask)
        output = self.final_conv(x * mask)
        return (output * mask).squeeze(1)
\ No newline at end of file
--- a/src/diffusers/pipeline_utils.py
+++ b/src/diffusers/pipeline_utils.py
@@ -57,14 +57,14 @@ class DiffusionPipeline(ConfigMixin):
    def register_modules(self, **kwargs):
        # import it here to avoid circular import
        from diffusers import pipelines
        for name, module in kwargs.items():
            # check if the module is a pipeline module
            is_pipeline_module = hasattr(pipelines, module.__module__.split(".")[-1])
            # retrive library
            library = module.__module__.split(".")[0]
            # if library is not in LOADABLE_CLASSES, then it is a custom module.
            # Or if it's a pipeline module, then the module is inside the pipeline
            # so we set the library to module name.
@@ -160,10 +160,10 @@ class DiffusionPipeline(ConfigMixin):
        init_dict, _ = pipeline_class.extract_init_dict(config_dict, **kwargs)
        init_kwargs = {}
        # import it here to avoid circular import
        from diffusers import pipelines
        # 4. Load each module in the pipeline
        for name, (library_name, class_name) in init_dict.items():
            is_pipeline_module = hasattr(pipelines, library_name)

--- a/src/diffusers/pipelines/+
+++ b/src/diffusers/pipelines/+
+# Pipelines
+- Pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box
+- Pipelines should stay as close as possible to their original implementation 
+- Pipelines can include components of other library, such as text-encoders. 
+## API
+TODO(Patrick, Anton, Suraj)
+## Examples
+- DDPM for unconditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddpm.py).
+- DDIM for unconditional image generation in [pipeline_ddim](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddim.py).
+- PNDM for unconditional image generation in [pipeline_pndm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_pndm.py).
+- Latent diffusion for text to image generation / conditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
+- Glide for text to image generation / conditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
+- BDDM for spectrogram-to-sound vocoding in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
+- Grad-TTS for text to audio generation / conditional audio generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
--- a/src/diffusers/pipelines/README.md
+++ b/src/diffusers/pipelines/README.md
+# Pipelines
+- Pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box
+- Pipelines should stay as close as possible to their original implementation 
+- Pipelines can include components of other library, such as text-encoders. 
+## API
+TODO(Patrick, Anton, Suraj)
+## Examples
+- DDPM for unconditional image generation in [pipeline_ddpm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddpm.py).
+- DDIM for unconditional image generation in [pipeline_ddim](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddim.py).
+- PNDM for unconditional image generation in [pipeline_pndm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_pndm.py).
+- Latent diffusion for text to image generation / conditional image generation in [pipeline_latent_diffusion](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_latent_diffusion.py).
+- Glide for text to image generation / conditional image generation in [pipeline_glide](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_glide.py).
+- BDDM for spectrogram-to-sound vocoding in [pipeline_bddm](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_bddm.py).
+- Grad-TTS for text to audio generation / conditional audio generation in [pipeline_grad_tts](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_grad_tts.py).
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
+from .pipeline_bddm import BDDM
 from .pipeline_ddim import DDIM
 from .pipeline_ddpm import DDPM
-from .pipeline_pndm import PNDM
+from .pipeline_grad_tts import GradTTS
-from .pipeline_glide import GLIDE
+try:
+    from .pipeline_glide import GLIDE
+except (NameError, ImportError):
+    class GLIDE:
+        pass
 from .pipeline_latent_diffusion import LatentDiffusion
-from .pipeline_bddm import BDDM
+from .pipeline_pndm import PNDM
-from .pipeline_grad_tts import GradTTS
\ No newline at end of file
--- a/src/diffusers/pipelines/pipeline_bddm.py
+++ b/src/diffusers/pipelines/pipeline_bddm.py
@@ -283,7 +283,7 @@ class BDDM(DiffusionPipeline):
            torch_device = "cuda" if torch.cuda.is_available() else "cpu"
        self.diffwave.to(torch_device)
        mel_spectrogram = mel_spectrogram.to(torch_device)
        audio_length = mel_spectrogram.size(-1) * 256
        audio_size = (1, 1, audio_length)

--- a/src/diffusers/pipelines/pipeline_glide.py
+++ b/src/diffusers/pipelines/pipeline_glide.py
@@ -24,15 +24,22 @@ import torch.utils.checkpoint
 from torch import nn
 import tqdm
-from transformers import CLIPConfig, CLIPModel, CLIPTextConfig, CLIPVisionConfig, GPT2Tokenizer
-from transformers.activations import ACT2FN
-from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
+try:
-from transformers.modeling_utils import PreTrainedModel
+    from transformers import CLIPConfig, CLIPModel, CLIPTextConfig, CLIPVisionConfig, GPT2Tokenizer
-from transformers.utils import ModelOutput, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+    from transformers.activations import ACT2FN
+    from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
+    from transformers.modeling_utils import PreTrainedModel
+    from transformers.utils import ModelOutput, add_start_docstrings_to_model_forward, replace_return_docstrings
+except:
+    print("Transformers is not installed")
+    pass
 from ..models import GLIDESuperResUNetModel, GLIDETextToImageUNetModel
 from ..pipeline_utils import DiffusionPipeline
 from ..schedulers import ClassifierFreeGuidanceScheduler, DDIMScheduler
+from ..utils import logging
 #####################
@@ -832,9 +839,7 @@ class GLIDE(DiffusionPipeline):
        # 1. Sample gaussian noise
        batch_size = 2  # second image is empty for classifier-free guidance
-        image = torch.randn(
+        image = torch.randn((batch_size, self.text_unet.in_channels, 64, 64), generator=generator).to(torch_device)
-            (batch_size, self.text_unet.in_channels, 64, 64), generator=generator
-        ).to(torch_device)
        # 2. Encode tokens
        # an empty input is needed to guide the model away from it

--- a/src/diffusers/pipelines/pipeline_grad_tts.py
+++ b/src/diffusers/pipelines/pipeline_grad_tts.py
@@ -43,14 +43,13 @@ def generate_path(duration, mask):
    cum_duration_flat = cum_duration.view(b * t_x)
    path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
    path = path.view(b, t_x, t_y)
-    path = path - torch.nn.functional.pad(path, convert_pad_shape([[0, 0], 
+    path = path - torch.nn.functional.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
-                                          [1, 0], [0, 0]]))[:, :-1]
    path = path * mask
    return path
 def duration_loss(logw, logw_, lengths):
-    loss = torch.sum((logw - logw_)**2) / torch.sum(lengths)
+    loss = torch.sum((logw - logw_) ** 2) / torch.sum(lengths)
    return loss
@@ -66,7 +65,7 @@ class LayerNorm(nn.Module):
    def forward(self, x):
        n_dims = len(x.shape)
        mean = torch.mean(x, 1, keepdim=True)
-        variance = torch.mean((x - mean)**2, 1, keepdim=True)
+        variance = torch.mean((x - mean) ** 2, 1, keepdim=True)
        x = (x - mean) * torch.rsqrt(variance + self.eps)
@@ -76,8 +75,7 @@ class LayerNorm(nn.Module):
 class ConvReluNorm(nn.Module):
-    def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, 
+    def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout):
-                 n_layers, p_dropout):
        super(ConvReluNorm, self).__init__()
        self.in_channels = in_channels
        self.hidden_channels = hidden_channels
@@ -88,13 +86,13 @@ class ConvReluNorm(nn.Module):
        self.conv_layers = torch.nn.ModuleList()
        self.norm_layers = torch.nn.ModuleList()
-        self.conv_layers.append(torch.nn.Conv1d(in_channels, hidden_channels, 
+        self.conv_layers.append(torch.nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size // 2))
-                                                kernel_size, padding=kernel_size//2))
        self.norm_layers.append(LayerNorm(hidden_channels))
        self.relu_drop = torch.nn.Sequential(torch.nn.ReLU(), torch.nn.Dropout(p_dropout))
        for _ in range(n_layers - 1):
-            self.conv_layers.append(torch.nn.Conv1d(hidden_channels, hidden_channels, 
+            self.conv_layers.append(
-                                                    kernel_size, padding=kernel_size//2))
+                torch.nn.Conv1d(hidden_channels, hidden_channels, kernel_size, padding=kernel_size // 2)
+            )
            self.norm_layers.append(LayerNorm(hidden_channels))
        self.proj = torch.nn.Conv1d(hidden_channels, out_channels, 1)
        self.proj.weight.data.zero_()
@@ -118,11 +116,9 @@ class DurationPredictor(nn.Module):
        self.p_dropout = p_dropout
        self.drop = torch.nn.Dropout(p_dropout)
-        self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, 
+        self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size, padding=kernel_size // 2)
-                                      kernel_size, padding=kernel_size//2)
        self.norm_1 = LayerNorm(filter_channels)
-        self.conv_2 = torch.nn.Conv1d(filter_channels, filter_channels, 
+        self.conv_2 = torch.nn.Conv1d(filter_channels, filter_channels, kernel_size, padding=kernel_size // 2)
-                                      kernel_size, padding=kernel_size//2)
        self.norm_2 = LayerNorm(filter_channels)
        self.proj = torch.nn.Conv1d(filter_channels, 1, 1)
@@ -140,9 +136,17 @@ class DurationPredictor(nn.Module):
 class MultiHeadAttention(nn.Module):
-    def __init__(self, channels, out_channels, n_heads, window_size=None, 
+    def __init__(
-                 heads_share=True, p_dropout=0.0, proximal_bias=False, 
+        self,
-                 proximal_init=False):
+        channels,
+        out_channels,
+        n_heads,
+        window_size=None,
+        heads_share=True,
+        p_dropout=0.0,
+        proximal_bias=False,
+        proximal_init=False,
+    ):
        super(MultiHeadAttention, self).__init__()
        assert channels % n_heads == 0
@@ -162,10 +166,12 @@ class MultiHeadAttention(nn.Module):
        if window_size is not None:
            n_heads_rel = 1 if heads_share else n_heads
            rel_stddev = self.k_channels**-0.5
-            self.emb_rel_k = torch.nn.Parameter(torch.randn(n_heads_rel, 
+            self.emb_rel_k = torch.nn.Parameter(
-                             window_size * 2 + 1, self.k_channels) * rel_stddev)
+                torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev
-            self.emb_rel_v = torch.nn.Parameter(torch.randn(n_heads_rel, 
+            )
-                             window_size * 2 + 1, self.k_channels) * rel_stddev)
+            self.emb_rel_v = torch.nn.Parameter(
+                torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev
+            )
        self.conv_o = torch.nn.Conv1d(channels, out_channels, 1)
        self.drop = torch.nn.Dropout(p_dropout)
@@ -175,12 +181,12 @@ class MultiHeadAttention(nn.Module):
            self.conv_k.weight.data.copy_(self.conv_q.weight.data)
            self.conv_k.bias.data.copy_(self.conv_q.bias.data)
        torch.nn.init.xavier_uniform_(self.conv_v.weight)
    def forward(self, x, c, attn_mask=None):
        q = self.conv_q(x)
        k = self.conv_k(c)
        v = self.conv_v(c)
        x, self.attn = self.attention(q, k, v, mask=attn_mask)
        x = self.conv_o(x)
@@ -202,8 +208,7 @@ class MultiHeadAttention(nn.Module):
            scores = scores + scores_local
        if self.proximal_bias:
            assert t_s == t_t, "Proximal bias is only available for self-attention."
-            scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, 
+            scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype)
-                                                                    dtype=scores.dtype)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e4)
        p_attn = torch.nn.functional.softmax(scores, dim=-1)
@@ -212,8 +217,7 @@ class MultiHeadAttention(nn.Module):
        if self.window_size is not None:
            relative_weights = self._absolute_position_to_relative_position(p_attn)
            value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s)
-            output = output + self._matmul_with_relative_values(relative_weights, 
+            output = output + self._matmul_with_relative_values(relative_weights, value_relative_embeddings)
-                                                                value_relative_embeddings)
        output = output.transpose(2, 3).contiguous().view(b, d, t_t)
        return output, p_attn
@@ -231,28 +235,27 @@ class MultiHeadAttention(nn.Module):
        slice_end_position = slice_start_position + 2 * length - 1
        if pad_length > 0:
            padded_relative_embeddings = torch.nn.functional.pad(
-                            relative_embeddings, convert_pad_shape([[0, 0], 
+                relative_embeddings, convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]])
-                            [pad_length, pad_length], [0, 0]]))
+            )
        else:
            padded_relative_embeddings = relative_embeddings
-        used_relative_embeddings = padded_relative_embeddings[:,
+        used_relative_embeddings = padded_relative_embeddings[:, slice_start_position:slice_end_position]
-                                   slice_start_position:slice_end_position]
        return used_relative_embeddings
    def _relative_position_to_absolute_position(self, x):
        batch, heads, length, _ = x.size()
-        x = torch.nn.functional.pad(x, convert_pad_shape([[0,0],[0,0],[0,0],[0,1]]))
+        x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
        x_flat = x.view([batch, heads, length * 2 * length])
-        x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0,0],[0,0],[0,length-1]]))
+        x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [0, length - 1]]))
-        x_final = x_flat.view([batch, heads, length+1, 2*length-1])[:, :, :length, length-1:]
+        x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[:, :, :length, length - 1 :]
        return x_final
    def _absolute_position_to_relative_position(self, x):
        batch, heads, length, _ = x.size()
-        x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length-1]]))
+        x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]]))
-        x_flat = x.view([batch, heads, length**2 + length*(length - 1)])
+        x_flat = x.view([batch, heads, length**2 + length * (length - 1)])
        x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
-        x_final = x_flat.view([batch, heads, length, 2*length])[:,:,:,1:]
+        x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
        return x_final
    def _attention_bias_proximal(self, length):
@@ -262,8 +265,7 @@ class MultiHeadAttention(nn.Module):
 class FFN(nn.Module):
-    def __init__(self, in_channels, out_channels, filter_channels, kernel_size, 
+    def __init__(self, in_channels, out_channels, filter_channels, kernel_size, p_dropout=0.0):
-                 p_dropout=0.0):
        super(FFN, self).__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
@@ -271,10 +273,8 @@ class FFN(nn.Module):
        self.kernel_size = kernel_size
        self.p_dropout = p_dropout
-        self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size, 
+        self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size, padding=kernel_size // 2)
-                                      padding=kernel_size//2)
+        self.conv_2 = torch.nn.Conv1d(filter_channels, out_channels, kernel_size, padding=kernel_size // 2)
-        self.conv_2 = torch.nn.Conv1d(filter_channels, out_channels, kernel_size, 
-                                      padding=kernel_size//2)
        self.drop = torch.nn.Dropout(p_dropout)
    def forward(self, x, x_mask):
@@ -286,8 +286,17 @@ class FFN(nn.Module):
 class Encoder(nn.Module):
-    def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, 
+    def __init__(
-                 kernel_size=1, p_dropout=0.0, window_size=None, **kwargs):
+        self,
+        hidden_channels,
+        filter_channels,
+        n_heads,
+        n_layers,
+        kernel_size=1,
+        p_dropout=0.0,
+        window_size=None,
+        **kwargs,
+    ):
        super(Encoder, self).__init__()
        self.hidden_channels = hidden_channels
        self.filter_channels = filter_channels
@@ -303,11 +312,15 @@ class Encoder(nn.Module):
        self.ffn_layers = torch.nn.ModuleList()
        self.norm_layers_2 = torch.nn.ModuleList()
        for _ in range(self.n_layers):
-            self.attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels,
+            self.attn_layers.append(
-                                    n_heads, window_size=window_size, p_dropout=p_dropout))
+                MultiHeadAttention(
+                    hidden_channels, hidden_channels, n_heads, window_size=window_size, p_dropout=p_dropout
+                )
+            )
            self.norm_layers_1.append(LayerNorm(hidden_channels))
-            self.ffn_layers.append(FFN(hidden_channels, hidden_channels,
+            self.ffn_layers.append(
-                                       filter_channels, kernel_size, p_dropout=p_dropout))
+                FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout)
+            )
            self.norm_layers_2.append(LayerNorm(hidden_channels))
    def forward(self, x, x_mask):
@@ -325,9 +338,21 @@ class Encoder(nn.Module):
 class TextEncoder(ModelMixin, ConfigMixin):
-    def __init__(self, n_vocab, n_feats, n_channels, filter_channels, 
+    def __init__(
-                 filter_channels_dp, n_heads, n_layers, kernel_size, 
+        self,
-                 p_dropout, window_size=None, spk_emb_dim=64, n_spks=1):
+        n_vocab,
+        n_feats,
+        n_channels,
+        filter_channels,
+        filter_channels_dp,
+        n_heads,
+        n_layers,
+        kernel_size,
+        p_dropout,
+        window_size=None,
+        spk_emb_dim=64,
+        n_spks=1,
+    ):
        super(TextEncoder, self).__init__()
        self.register(
@@ -342,10 +367,9 @@ class TextEncoder(ModelMixin, ConfigMixin):
            p_dropout=p_dropout,
            window_size=window_size,
            spk_emb_dim=spk_emb_dim,
-            n_spks=n_spks
+            n_spks=n_spks,
        )
        self.n_vocab = n_vocab
        self.n_feats = n_feats
        self.n_channels = n_channels
@@ -362,15 +386,22 @@ class TextEncoder(ModelMixin, ConfigMixin):
        self.emb = torch.nn.Embedding(n_vocab, n_channels)
        torch.nn.init.normal_(self.emb.weight, 0.0, n_channels**-0.5)
-        self.prenet = ConvReluNorm(n_channels, n_channels, n_channels, 
+        self.prenet = ConvReluNorm(n_channels, n_channels, n_channels, kernel_size=5, n_layers=3, p_dropout=0.5)
-                                   kernel_size=5, n_layers=3, p_dropout=0.5)
-        self.encoder = Encoder(n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels, n_heads, n_layers, 
+        self.encoder = Encoder(
-                               kernel_size, p_dropout, window_size=window_size)
+            n_channels + (spk_emb_dim if n_spks > 1 else 0),
+            filter_channels,
+            n_heads,
+            n_layers,
+            kernel_size,
+            p_dropout,
+            window_size=window_size,
+        )
        self.proj_m = torch.nn.Conv1d(n_channels + (spk_emb_dim if n_spks > 1 else 0), n_feats, 1)
-        self.proj_w = DurationPredictor(n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels_dp, 
+        self.proj_w = DurationPredictor(
-                                        kernel_size, p_dropout)
+            n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels_dp, kernel_size, p_dropout
+        )
    def forward(self, x, x_lengths, spk=None):    
        x = self.emb(x) * math.sqrt(self.n_channels)

--- a/src/diffusers/schedulers/README.md
+++ b/src/diffusers/schedulers/README.md
+# Schedulers
+- Schedulers are the algorithms to use diffusion models in inference as well as for training. They include the noise schedules and define algorithm-specific diffusion steps.
+- Schedulers can be used interchangable between diffusion models in inference to find the preferred tradef-off between speed and generation quality.
+- Schedulers are available in numpy, but can easily be transformed into PyTorch.
+## API
+- Schedulers should provide one or more `def step(...)` functions that should be called iteratively to unroll the diffusion loop during 
+the forward pass.
+- Schedulers should be framework-agonstic, but provide a simple functionality to convert the scheduler into a specific framework, such as PyTorch 
+with a `set_format(...)` method.
+## Examples
+- The DDPM scheduler was proposed in [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) and can be found in [scheduling_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddpm.py). An example of how to use this scheduler can be found in [pipeline_ddpm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddpm.py).
+- The DDIM scheduler was proposed in [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) and can be found in [scheduling_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py). An example of how to use this scheduler can be found in [pipeline_ddim.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_ddim.py).
+- The PNMD scheduler was proposed in [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) and can be found in [scheduling_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py). An example of how to use this scheduler can be found in [pipeline_pndm.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_pndm.py).