Unverified Commit 45f6d52b authored by YiYi Xu's avatar YiYi Xu Committed by GitHub
Browse files

Add Shap-E (#3742)



* refactor prior_transformer

adding conversion script

add pipeline

add step_index from pipeline, + remove permute

add zero pad token

remove copy from statement for betas_for_alpha_bar function

* add

* add

* update conversion script for renderer model

* refactor camera a little bit

* clean up

* style

* fix copies

* Update src/diffusers/schedulers/scheduling_heun_discrete.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/shap_e/pipeline_shap_e.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/shap_e/pipeline_shap_e.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* alpha_transform_type

* remove step_index argument

* remove get_sigmas_karras

* remove _yiyi_sigma_to_t

* move the rescale prompt_embeds from prior_transformer to pipeline

* replace baddbmm with einsum to match origial repo

* Revert "replace baddbmm with einsum to match origial repo"

This reverts commit 3f6b435d65dad3e5514cad2f5dd9e4419ca78e0b.

* add step_index to scale_model_input

* Revert "move the rescale prompt_embeds from prior_transformer to pipeline"

This reverts commit 5b5a8e6be918fefd114a2945ed89d8e8fa8be21b.

* move rescale from prior_transformer to pipeline

* correct step_index in scale_model_input

* remove print lines

* refactor prior - reduce arguments

* make style

* add prior_image

* arg embedding_proj_norm -> norm_embedding_proj

* add pre-norm for proj_embedding

* move rescale prompt from pipeline to _encode_prompt

* add img2img pipeline

* style

* copies

* Update src/diffusers/models/prior_transformer.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py

add arg: encoder_hid_proj
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py

add new config: norm_in_type
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py

add new config: added_emb_type
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py

rename out_dim -> clip_embed_dim
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py

rename config: out_dim -> clip_embed_dim
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/models/prior_transformer.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* finish refactor prior_tranformer

* make style

* refactor renderer

* fix

* make style

* refactor img2img

* remove params_proj

* add test

* add upcast_softmax to prior_transformer

* enable num_images_per_prompt, add save_gif utility

* add

* add fast test

* make style

* add slow test

* style

* add test for img2img

* refactor

* enable batching

* style

* refactor scheduler

* update test

* style

* attempt to solve batch related tests timeout

* add doc

* Update src/diffusers/pipelines/shap_e/pipeline_shap_e.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* Update src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* hardcode rendering related config

* update betas_for_alpha_bar on ddpm_scheduler

* fix copies

* fix

* export_to_gif

* style

* second attempt to speed up batching tests

* add doc page to index

* Remove intermediate clipping

* 3rd attempt to speed up batching tests

* Remvoe time index

* simplify scheduler

* Fix more

* Fix more

* fix more

* make style

* fix schedulers

* fix some more tests

* finish

* add one more test

* Apply suggestions from code review
Co-authored-by: default avatarSayak Paul <spsayakpaul@gmail.com>
Co-authored-by: default avatarPedro Cuenca <pedro@huggingface.co>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>

* style

* apply feedbacks

* style

* fix copies

* add one example

* style

* add example for img2img

* fix doc

* fix more doc strings

* size -> frame_size

* style

* update doc

* style

* fix on doc

* update repo name

* improve the usage example in shap-e img2img

* add usage examples in the shap-e docs.

* consolidate examples.

* minor fix.

* update doc

* Apply suggestions from code review

* Apply suggestions from code review

* remove upcast

* Make sure background is white

* Update src/diffusers/pipelines/shap_e/pipeline_shap_e.py

* Apply suggestions from code review

* Finish

* Apply suggestions from code review

* Update src/diffusers/pipelines/shap_e/pipeline_shap_e.py

* Make style

---------
Co-authored-by: default avataryiyixuxu <yixu310@gmail,com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: default avatarSayak Paul <spsayakpaul@gmail.com>
Co-authored-by: default avatarPedro Cuenca <pedro@huggingface.co>
parent 74621567
...@@ -226,6 +226,8 @@ ...@@ -226,6 +226,8 @@
title: Self-Attention Guidance title: Self-Attention Guidance
- local: api/pipelines/semantic_stable_diffusion - local: api/pipelines/semantic_stable_diffusion
title: Semantic Guidance title: Semantic Guidance
- local: api/pipelines/shap_e
title: Shap-E
- local: api/pipelines/spectrogram_diffusion - local: api/pipelines/spectrogram_diffusion
title: Spectrogram Diffusion title: Spectrogram Diffusion
- sections: - sections:
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Shap-E
## Overview
The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://arxiv.org/abs/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai).
The abstract of the paper is the following:
*We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.*
The original codebase can be found [here](https://github.com/openai/shap-e).
## Available Pipelines:
| Pipeline | Tasks |
|---|---|
| [pipeline_shap_e.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e.py) | *Text-to-Image Generation* |
| [pipeline_shap_e_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py) | *Image-to-Image Generation* |
## Available checkpoints
* [`openai/shap-e`](https://huggingface.co/openai/shap-e)
* [`openai/shap-e-img2img`](https://huggingface.co/openai/shap-e-img2img)
## Usage Examples
In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.
### Text-to-3D image generation
We can use [`ShapEPipeline`] to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library's 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.
```python
import torch
from diffusers import DiffusionPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
repo = "openai/shap-e"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to(device)
guidance_scale = 15.0
prompt = ["A firecracker", "A birthday cupcake"]
images = pipe(
prompt,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
```
The output of [`ShapEPipeline`] is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let's use the `export_to_gif` utility function in diffusers to make a 3D cupcake!
```python
from diffusers.utils import export_to_gif
export_to_gif(images[0], "firecracker_3d.gif")
export_to_gif(images[1], "cake_3d.gif")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif)
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif)
### Image-to-Image generation
You can use [`ShapEImg2ImgPipeline`] along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D.
In this example, We will first genrate a cheeseburger with a simple prompt "A cheeseburger, white background"
```python
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
prompt = "A cheeseburger, white background"
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
image = t2i_pipe(
prompt,
image_embeds=image_embeds,
negative_image_embeds=negative_image_embeds,
).images[0]
image.save("burger.png")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png)
we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)
```python
from PIL import Image
from diffusers.utils import export_to_gif
repo = "openai/shap-e-img2img"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
guidance_scale = 3.0
image = Image.open("burger.png").resize((256, 256))
images = pipe(
image,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
gif_path = export_to_gif(images[0], "burger_3d.gif")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif)
## ShapEPipeline
[[autodoc]] ShapEPipeline
- all
- __call__
## ShapEImg2ImgPipeline
[[autodoc]] ShapEImg2ImgPipeline
- all
- __call__
\ No newline at end of file
This diff is collapsed.
...@@ -149,6 +149,8 @@ else: ...@@ -149,6 +149,8 @@ else:
LDMTextToImagePipeline, LDMTextToImagePipeline,
PaintByExamplePipeline, PaintByExamplePipeline,
SemanticStableDiffusionPipeline, SemanticStableDiffusionPipeline,
ShapEImg2ImgPipeline,
ShapEPipeline,
StableDiffusionAttendAndExcitePipeline, StableDiffusionAttendAndExcitePipeline,
StableDiffusionControlNetImg2ImgPipeline, StableDiffusionControlNetImg2ImgPipeline,
StableDiffusionControlNetInpaintPipeline, StableDiffusionControlNetInpaintPipeline,
......
...@@ -34,14 +34,33 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -34,14 +34,33 @@ class PriorTransformer(ModelMixin, ConfigMixin):
num_attention_heads (`int`, *optional*, defaults to 32): The number of heads to use for multi-head attention. num_attention_heads (`int`, *optional*, defaults to 32): The number of heads to use for multi-head attention.
attention_head_dim (`int`, *optional*, defaults to 64): The number of channels in each head. attention_head_dim (`int`, *optional*, defaults to 64): The number of channels in each head.
num_layers (`int`, *optional*, defaults to 20): The number of layers of Transformer blocks to use. num_layers (`int`, *optional*, defaults to 20): The number of layers of Transformer blocks to use.
embedding_dim (`int`, *optional*, defaults to 768): embedding_dim (`int`, *optional*, defaults to 768): The dimension of the model input `hidden_states`
The dimension of the CLIP embeddings. Image embeddings and text embeddings are both the same dimension. num_embeddings (`int`, *optional*, defaults to 77):
num_embeddings (`int`, *optional*, defaults to 77): The max number of CLIP embeddings allowed (the The number of embeddings of the model input `hidden_states`
length of the prompt after it has been tokenized).
additional_embeddings (`int`, *optional*, defaults to 4): The number of additional tokens appended to the additional_embeddings (`int`, *optional*, defaults to 4): The number of additional tokens appended to the
projected `hidden_states`. The actual length of the used `hidden_states` is `num_embeddings + projected `hidden_states`. The actual length of the used `hidden_states` is `num_embeddings +
additional_embeddings`. additional_embeddings`.
dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use. dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
time_embed_act_fn (`str`, *optional*, defaults to 'silu'):
The activation function to use to create timestep embeddings.
norm_in_type (`str`, *optional*, defaults to None): The normalization layer to apply on hidden states before
passing to Transformer blocks. Set it to `None` if normalization is not needed.
embedding_proj_norm_type (`str`, *optional*, defaults to None):
The normalization layer to apply on the input `proj_embedding`. Set it to `None` if normalization is not
needed.
encoder_hid_proj_type (`str`, *optional*, defaults to `linear`):
The projection layer to apply on the input `encoder_hidden_states`. Set it to `None` if
`encoder_hidden_states` is `None`.
added_emb_type (`str`, *optional*, defaults to `prd`): Additional embeddings to condition the model.
Choose from `prd` or `None`. if choose `prd`, it will prepend a token indicating the (quantized) dot
product between the text embedding and image embedding as proposed in the unclip paper
https://arxiv.org/abs/2204.06125 If it is `None`, no additional embeddings will be prepended.
time_embed_dim (`int, *optional*, defaults to None): The dimension of timestep embeddings.
If None, will be set to `num_attention_heads * attention_head_dim`
embedding_proj_dim (`int`, *optional*, default to None):
The dimension of `proj_embedding`. If None, will be set to `embedding_dim`.
clip_embed_dim (`int`, *optional*, default to None):
The dimension of the output. If None, will be set to `embedding_dim`.
""" """
@register_to_config @register_to_config
...@@ -54,6 +73,14 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -54,6 +73,14 @@ class PriorTransformer(ModelMixin, ConfigMixin):
num_embeddings=77, num_embeddings=77,
additional_embeddings=4, additional_embeddings=4,
dropout: float = 0.0, dropout: float = 0.0,
time_embed_act_fn: str = "silu",
norm_in_type: Optional[str] = None, # layer
embedding_proj_norm_type: Optional[str] = None, # layer
encoder_hid_proj_type: Optional[str] = "linear", # linear
added_emb_type: Optional[str] = "prd", # prd
time_embed_dim: Optional[int] = None,
embedding_proj_dim: Optional[int] = None,
clip_embed_dim: Optional[int] = None,
): ):
super().__init__() super().__init__()
self.num_attention_heads = num_attention_heads self.num_attention_heads = num_attention_heads
...@@ -61,17 +88,41 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -61,17 +88,41 @@ class PriorTransformer(ModelMixin, ConfigMixin):
inner_dim = num_attention_heads * attention_head_dim inner_dim = num_attention_heads * attention_head_dim
self.additional_embeddings = additional_embeddings self.additional_embeddings = additional_embeddings
time_embed_dim = time_embed_dim or inner_dim
embedding_proj_dim = embedding_proj_dim or embedding_dim
clip_embed_dim = clip_embed_dim or embedding_dim
self.time_proj = Timesteps(inner_dim, True, 0) self.time_proj = Timesteps(inner_dim, True, 0)
self.time_embedding = TimestepEmbedding(inner_dim, inner_dim) self.time_embedding = TimestepEmbedding(inner_dim, time_embed_dim, out_dim=inner_dim, act_fn=time_embed_act_fn)
self.proj_in = nn.Linear(embedding_dim, inner_dim) self.proj_in = nn.Linear(embedding_dim, inner_dim)
self.embedding_proj = nn.Linear(embedding_dim, inner_dim) if embedding_proj_norm_type is None:
self.encoder_hidden_states_proj = nn.Linear(embedding_dim, inner_dim) self.embedding_proj_norm = None
elif embedding_proj_norm_type == "layer":
self.embedding_proj_norm = nn.LayerNorm(embedding_proj_dim)
else:
raise ValueError(f"unsupported embedding_proj_norm_type: {embedding_proj_norm_type}")
self.embedding_proj = nn.Linear(embedding_proj_dim, inner_dim)
if encoder_hid_proj_type is None:
self.encoder_hidden_states_proj = None
elif encoder_hid_proj_type == "linear":
self.encoder_hidden_states_proj = nn.Linear(embedding_dim, inner_dim)
else:
raise ValueError(f"unsupported encoder_hid_proj_type: {encoder_hid_proj_type}")
self.positional_embedding = nn.Parameter(torch.zeros(1, num_embeddings + additional_embeddings, inner_dim)) self.positional_embedding = nn.Parameter(torch.zeros(1, num_embeddings + additional_embeddings, inner_dim))
self.prd_embedding = nn.Parameter(torch.zeros(1, 1, inner_dim)) if added_emb_type == "prd":
self.prd_embedding = nn.Parameter(torch.zeros(1, 1, inner_dim))
elif added_emb_type is None:
self.prd_embedding = None
else:
raise ValueError(
f"`added_emb_type`: {added_emb_type} is not supported. Make sure to choose one of `'prd'` or `None`."
)
self.transformer_blocks = nn.ModuleList( self.transformer_blocks = nn.ModuleList(
[ [
...@@ -87,8 +138,16 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -87,8 +138,16 @@ class PriorTransformer(ModelMixin, ConfigMixin):
] ]
) )
if norm_in_type == "layer":
self.norm_in = nn.LayerNorm(inner_dim)
elif norm_in_type is None:
self.norm_in = None
else:
raise ValueError(f"Unsupported norm_in_type: {norm_in_type}.")
self.norm_out = nn.LayerNorm(inner_dim) self.norm_out = nn.LayerNorm(inner_dim)
self.proj_to_clip_embeddings = nn.Linear(inner_dim, embedding_dim)
self.proj_to_clip_embeddings = nn.Linear(inner_dim, clip_embed_dim)
causal_attention_mask = torch.full( causal_attention_mask = torch.full(
[num_embeddings + additional_embeddings, num_embeddings + additional_embeddings], -10000.0 [num_embeddings + additional_embeddings, num_embeddings + additional_embeddings], -10000.0
...@@ -97,8 +156,8 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -97,8 +156,8 @@ class PriorTransformer(ModelMixin, ConfigMixin):
causal_attention_mask = causal_attention_mask[None, ...] causal_attention_mask = causal_attention_mask[None, ...]
self.register_buffer("causal_attention_mask", causal_attention_mask, persistent=False) self.register_buffer("causal_attention_mask", causal_attention_mask, persistent=False)
self.clip_mean = nn.Parameter(torch.zeros(1, embedding_dim)) self.clip_mean = nn.Parameter(torch.zeros(1, clip_embed_dim))
self.clip_std = nn.Parameter(torch.zeros(1, embedding_dim)) self.clip_std = nn.Parameter(torch.zeros(1, clip_embed_dim))
@property @property
# Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.attn_processors # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.attn_processors
...@@ -172,7 +231,7 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -172,7 +231,7 @@ class PriorTransformer(ModelMixin, ConfigMixin):
hidden_states, hidden_states,
timestep: Union[torch.Tensor, float, int], timestep: Union[torch.Tensor, float, int],
proj_embedding: torch.FloatTensor, proj_embedding: torch.FloatTensor,
encoder_hidden_states: torch.FloatTensor, encoder_hidden_states: Optional[torch.FloatTensor] = None,
attention_mask: Optional[torch.BoolTensor] = None, attention_mask: Optional[torch.BoolTensor] = None,
return_dict: bool = True, return_dict: bool = True,
): ):
...@@ -217,23 +276,61 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -217,23 +276,61 @@ class PriorTransformer(ModelMixin, ConfigMixin):
timesteps_projected = timesteps_projected.to(dtype=self.dtype) timesteps_projected = timesteps_projected.to(dtype=self.dtype)
time_embeddings = self.time_embedding(timesteps_projected) time_embeddings = self.time_embedding(timesteps_projected)
if self.embedding_proj_norm is not None:
proj_embedding = self.embedding_proj_norm(proj_embedding)
proj_embeddings = self.embedding_proj(proj_embedding) proj_embeddings = self.embedding_proj(proj_embedding)
encoder_hidden_states = self.encoder_hidden_states_proj(encoder_hidden_states) if self.encoder_hidden_states_proj is not None and encoder_hidden_states is not None:
encoder_hidden_states = self.encoder_hidden_states_proj(encoder_hidden_states)
elif self.encoder_hidden_states_proj is not None and encoder_hidden_states is None:
raise ValueError("`encoder_hidden_states_proj` requires `encoder_hidden_states` to be set")
hidden_states = self.proj_in(hidden_states) hidden_states = self.proj_in(hidden_states)
prd_embedding = self.prd_embedding.to(hidden_states.dtype).expand(batch_size, -1, -1)
positional_embeddings = self.positional_embedding.to(hidden_states.dtype) positional_embeddings = self.positional_embedding.to(hidden_states.dtype)
additional_embeds = []
additional_embeddings_len = 0
if encoder_hidden_states is not None:
additional_embeds.append(encoder_hidden_states)
additional_embeddings_len += encoder_hidden_states.shape[1]
if len(proj_embeddings.shape) == 2:
proj_embeddings = proj_embeddings[:, None, :]
if len(hidden_states.shape) == 2:
hidden_states = hidden_states[:, None, :]
additional_embeds = additional_embeds + [
proj_embeddings,
time_embeddings[:, None, :],
hidden_states,
]
if self.prd_embedding is not None:
prd_embedding = self.prd_embedding.to(hidden_states.dtype).expand(batch_size, -1, -1)
additional_embeds.append(prd_embedding)
hidden_states = torch.cat( hidden_states = torch.cat(
[ additional_embeds,
encoder_hidden_states,
proj_embeddings[:, None, :],
time_embeddings[:, None, :],
hidden_states[:, None, :],
prd_embedding,
],
dim=1, dim=1,
) )
# Allow positional_embedding to not include the `addtional_embeddings` and instead pad it with zeros for these additional tokens
additional_embeddings_len = additional_embeddings_len + proj_embeddings.shape[1] + 1
if positional_embeddings.shape[1] < hidden_states.shape[1]:
positional_embeddings = F.pad(
positional_embeddings,
(
0,
0,
additional_embeddings_len,
self.prd_embedding.shape[1] if self.prd_embedding is not None else 0,
),
value=0.0,
)
hidden_states = hidden_states + positional_embeddings hidden_states = hidden_states + positional_embeddings
if attention_mask is not None: if attention_mask is not None:
...@@ -242,11 +339,19 @@ class PriorTransformer(ModelMixin, ConfigMixin): ...@@ -242,11 +339,19 @@ class PriorTransformer(ModelMixin, ConfigMixin):
attention_mask = (attention_mask[:, None, :] + self.causal_attention_mask).to(hidden_states.dtype) attention_mask = (attention_mask[:, None, :] + self.causal_attention_mask).to(hidden_states.dtype)
attention_mask = attention_mask.repeat_interleave(self.config.num_attention_heads, dim=0) attention_mask = attention_mask.repeat_interleave(self.config.num_attention_heads, dim=0)
if self.norm_in is not None:
hidden_states = self.norm_in(hidden_states)
for block in self.transformer_blocks: for block in self.transformer_blocks:
hidden_states = block(hidden_states, attention_mask=attention_mask) hidden_states = block(hidden_states, attention_mask=attention_mask)
hidden_states = self.norm_out(hidden_states) hidden_states = self.norm_out(hidden_states)
hidden_states = hidden_states[:, -1]
if self.prd_embedding is not None:
hidden_states = hidden_states[:, -1]
else:
hidden_states = hidden_states[:, additional_embeddings_len:]
predicted_image_embedding = self.proj_to_clip_embeddings(hidden_states) predicted_image_embedding = self.proj_to_clip_embeddings(hidden_states)
if not return_dict: if not return_dict:
......
...@@ -77,6 +77,7 @@ else: ...@@ -77,6 +77,7 @@ else:
from .latent_diffusion import LDMTextToImagePipeline from .latent_diffusion import LDMTextToImagePipeline
from .paint_by_example import PaintByExamplePipeline from .paint_by_example import PaintByExamplePipeline
from .semantic_stable_diffusion import SemanticStableDiffusionPipeline from .semantic_stable_diffusion import SemanticStableDiffusionPipeline
from .shap_e import ShapEImg2ImgPipeline, ShapEPipeline
from .stable_diffusion import ( from .stable_diffusion import (
CycleDiffusionPipeline, CycleDiffusionPipeline,
StableDiffusionAttendAndExcitePipeline, StableDiffusionAttendAndExcitePipeline,
......
from ...utils import (
OptionalDependencyNotAvailable,
is_torch_available,
is_transformers_available,
is_transformers_version,
)
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils.dummy_torch_and_transformers_objects import ShapEPipeline
else:
from .camera import create_pan_cameras
from .pipeline_shap_e import ShapEPipeline
from .pipeline_shap_e_img2img import ShapEImg2ImgPipeline
from .renderer import (
BoundingBoxVolume,
ImportanceRaySampler,
MLPNeRFModelOutput,
MLPNeRSTFModel,
ShapEParamsProjModel,
ShapERenderer,
StratifiedRaySampler,
VoidNeRFModel,
)
# Copyright 2023 Open AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from typing import Tuple
import numpy as np
import torch
@dataclass
class DifferentiableProjectiveCamera:
"""
Implements a batch, differentiable, standard pinhole camera
"""
origin: torch.Tensor # [batch_size x 3]
x: torch.Tensor # [batch_size x 3]
y: torch.Tensor # [batch_size x 3]
z: torch.Tensor # [batch_size x 3]
width: int
height: int
x_fov: float
y_fov: float
shape: Tuple[int]
def __post_init__(self):
assert self.x.shape[0] == self.y.shape[0] == self.z.shape[0] == self.origin.shape[0]
assert self.x.shape[1] == self.y.shape[1] == self.z.shape[1] == self.origin.shape[1] == 3
assert len(self.x.shape) == len(self.y.shape) == len(self.z.shape) == len(self.origin.shape) == 2
def resolution(self):
return torch.from_numpy(np.array([self.width, self.height], dtype=np.float32))
def fov(self):
return torch.from_numpy(np.array([self.x_fov, self.y_fov], dtype=np.float32))
def get_image_coords(self) -> torch.Tensor:
"""
:return: coords of shape (width * height, 2)
"""
pixel_indices = torch.arange(self.height * self.width)
coords = torch.stack(
[
pixel_indices % self.width,
torch.div(pixel_indices, self.width, rounding_mode="trunc"),
],
axis=1,
)
return coords
@property
def camera_rays(self):
batch_size, *inner_shape = self.shape
inner_batch_size = int(np.prod(inner_shape))
coords = self.get_image_coords()
coords = torch.broadcast_to(coords.unsqueeze(0), [batch_size * inner_batch_size, *coords.shape])
rays = self.get_camera_rays(coords)
rays = rays.view(batch_size, inner_batch_size * self.height * self.width, 2, 3)
return rays
def get_camera_rays(self, coords: torch.Tensor) -> torch.Tensor:
batch_size, *shape, n_coords = coords.shape
assert n_coords == 2
assert batch_size == self.origin.shape[0]
flat = coords.view(batch_size, -1, 2)
res = self.resolution()
fov = self.fov()
fracs = (flat.float() / (res - 1)) * 2 - 1
fracs = fracs * torch.tan(fov / 2)
fracs = fracs.view(batch_size, -1, 2)
directions = (
self.z.view(batch_size, 1, 3)
+ self.x.view(batch_size, 1, 3) * fracs[:, :, :1]
+ self.y.view(batch_size, 1, 3) * fracs[:, :, 1:]
)
directions = directions / directions.norm(dim=-1, keepdim=True)
rays = torch.stack(
[
torch.broadcast_to(self.origin.view(batch_size, 1, 3), [batch_size, directions.shape[1], 3]),
directions,
],
dim=2,
)
return rays.view(batch_size, *shape, 2, 3)
def resize_image(self, width: int, height: int) -> "DifferentiableProjectiveCamera":
"""
Creates a new camera for the resized view assuming the aspect ratio does not change.
"""
assert width * self.height == height * self.width, "The aspect ratio should not change."
return DifferentiableProjectiveCamera(
origin=self.origin,
x=self.x,
y=self.y,
z=self.z,
width=width,
height=height,
x_fov=self.x_fov,
y_fov=self.y_fov,
)
def create_pan_cameras(size: int) -> DifferentiableProjectiveCamera:
origins = []
xs = []
ys = []
zs = []
for theta in np.linspace(0, 2 * np.pi, num=20):
z = np.array([np.sin(theta), np.cos(theta), -0.5])
z /= np.sqrt(np.sum(z**2))
origin = -z * 4
x = np.array([np.cos(theta), -np.sin(theta), 0.0])
y = np.cross(z, x)
origins.append(origin)
xs.append(x)
ys.append(y)
zs.append(z)
return DifferentiableProjectiveCamera(
origin=torch.from_numpy(np.stack(origins, axis=0)).float(),
x=torch.from_numpy(np.stack(xs, axis=0)).float(),
y=torch.from_numpy(np.stack(ys, axis=0)).float(),
z=torch.from_numpy(np.stack(zs, axis=0)).float(),
width=size,
height=size,
x_fov=0.7,
y_fov=0.7,
shape=(1, len(xs)),
)
# Copyright 2023 Open AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from dataclasses import dataclass
from typing import List, Optional, Union
import numpy as np
import PIL
import torch
from transformers import CLIPTextModelWithProjection, CLIPTokenizer
from ...models import PriorTransformer
from ...pipelines import DiffusionPipeline
from ...schedulers import HeunDiscreteScheduler
from ...utils import (
BaseOutput,
is_accelerate_available,
is_accelerate_version,
logging,
randn_tensor,
replace_example_docstring,
)
from .renderer import ShapERenderer
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
EXAMPLE_DOC_STRING = """
Examples:
```py
>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from diffusers.utils import export_to_gif
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> repo = "openai/shap-e"
>>> pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)
>>> guidance_scale = 15.0
>>> prompt = "a shark"
>>> images = pipe(
... prompt,
... guidance_scale=guidance_scale,
... num_inference_steps=64,
... frame_size=256,
... ).images
>>> gif_path = export_to_gif(images[0], "shark_3d.gif")
```
"""
@dataclass
class ShapEPipelineOutput(BaseOutput):
"""
Output class for ShapEPipeline.
Args:
images (`torch.FloatTensor`)
a list of images for 3D rendering
"""
images: Union[List[List[PIL.Image.Image]], List[List[np.ndarray]]]
class ShapEPipeline(DiffusionPipeline):
"""
Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
Args:
prior ([`PriorTransformer`]):
The canonincal unCLIP prior to approximate the image embedding from the text embedding.
text_encoder ([`CLIPTextModelWithProjection`]):
Frozen text-encoder.
tokenizer (`CLIPTokenizer`):
Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
scheduler ([`HeunDiscreteScheduler`]):
A scheduler to be used in combination with `prior` to generate image embedding.
renderer ([`ShapERenderer`]):
Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects
with the NeRF rendering method
"""
def __init__(
self,
prior: PriorTransformer,
text_encoder: CLIPTextModelWithProjection,
tokenizer: CLIPTokenizer,
scheduler: HeunDiscreteScheduler,
renderer: ShapERenderer,
):
super().__init__()
self.register_modules(
prior=prior,
text_encoder=text_encoder,
tokenizer=tokenizer,
scheduler=scheduler,
renderer=renderer,
)
# Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
def prepare_latents(self, shape, dtype, device, generator, latents, scheduler):
if latents is None:
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
else:
if latents.shape != shape:
raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
latents = latents.to(device)
latents = latents * scheduler.init_noise_sigma
return latents
def enable_sequential_cpu_offload(self, gpu_id=0):
r"""
Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline's
models have their state dicts saved to CPU and then are moved to a `torch.device('meta') and loaded to GPU only
when their specific submodule has its `forward` method called.
"""
if is_accelerate_available():
from accelerate import cpu_offload
else:
raise ImportError("Please install accelerate via `pip install accelerate`")
device = torch.device(f"cuda:{gpu_id}")
models = [self.text_encoder, self.prior]
for cpu_offloaded_model in models:
if cpu_offloaded_model is not None:
cpu_offload(cpu_offloaded_model, device)
def enable_model_cpu_offload(self, gpu_id=0):
r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward`
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`.
"""
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook
else:
raise ImportError("`enable_model_cpu_offload` requires `accelerate v0.17.0` or higher.")
device = torch.device(f"cuda:{gpu_id}")
if self.device.type != "cpu":
self.to("cpu", silence_dtype_warnings=True)
torch.cuda.empty_cache() # otherwise we don't see the memory savings (but they probably exist)
hook = None
for cpu_offloaded_model in [self.text_encoder, self.prior, self.renderer]:
_, hook = cpu_offload_with_hook(cpu_offloaded_model, device, prev_module_hook=hook)
if self.safety_checker is not None:
_, hook = cpu_offload_with_hook(self.safety_checker, device, prev_module_hook=hook)
# We'll offload the last model manually.
self.final_offload_hook = hook
@property
def _execution_device(self):
r"""
Returns the device on which the pipeline's models will be executed. After calling
`pipeline.enable_sequential_cpu_offload()` the execution device can only be inferred from Accelerate's module
hooks.
"""
if self.device != torch.device("meta") or not hasattr(self.text_encoder, "_hf_hook"):
return self.device
for module in self.text_encoder.modules():
if (
hasattr(module, "_hf_hook")
and hasattr(module._hf_hook, "execution_device")
and module._hf_hook.execution_device is not None
):
return torch.device(module._hf_hook.execution_device)
return self.device
def _encode_prompt(
self,
prompt,
device,
num_images_per_prompt,
do_classifier_free_guidance,
):
len(prompt) if isinstance(prompt, list) else 1
# YiYi Notes: set pad_token_id to be 0, not sure why I can't set in the config file
self.tokenizer.pad_token_id = 0
# get prompt text embeddings
text_inputs = self.tokenizer(
prompt,
padding="max_length",
max_length=self.tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
text_input_ids = text_inputs.input_ids
untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(text_input_ids, untruncated_ids):
removed_text = self.tokenizer.batch_decode(untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1])
logger.warning(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {self.tokenizer.model_max_length} tokens: {removed_text}"
)
text_encoder_output = self.text_encoder(text_input_ids.to(device))
prompt_embeds = text_encoder_output.text_embeds
prompt_embeds = prompt_embeds.repeat_interleave(num_images_per_prompt, dim=0)
# in Shap-E it normalize the prompt_embeds and then later rescale it
prompt_embeds = prompt_embeds / torch.linalg.norm(prompt_embeds, dim=-1, keepdim=True)
if do_classifier_free_guidance:
negative_prompt_embeds = torch.zeros_like(prompt_embeds)
# For classifier free guidance, we need to do two forward passes.
# Here we concatenate the unconditional and text embeddings into a single batch
# to avoid doing two forward passes
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
# Rescale the features to have unit variance
prompt_embeds = math.sqrt(prompt_embeds.shape[1]) * prompt_embeds
return prompt_embeds
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
prompt: str,
num_images_per_prompt: int = 1,
num_inference_steps: int = 25,
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
latents: Optional[torch.FloatTensor] = None,
guidance_scale: float = 4.0,
frame_size: int = 64,
output_type: Optional[str] = "pil", # pil, np, latent
return_dict: bool = True,
):
"""
Function invoked when calling the pipeline for generation.
Args:
prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation.
num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt.
num_inference_steps (`int`, *optional*, defaults to 25):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
to make generation deterministic.
latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`.
guidance_scale (`float`, *optional*, defaults to 4.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
`guidance_scale` is defined as `w` of equation 2. of [Imagen
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
frame_size (`int`, *optional*, default to 64):
the width and height of each image frame of the generated 3d output
output_type (`str`, *optional*, defaults to `"pt"`):
The output format of the generate image. Choose between: `"np"` (`np.array`) or `"pt"`
(`torch.Tensor`).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
Examples:
Returns:
[`ShapEPipelineOutput`] or `tuple`
"""
if isinstance(prompt, str):
batch_size = 1
elif isinstance(prompt, list):
batch_size = len(prompt)
else:
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
device = self._execution_device
batch_size = batch_size * num_images_per_prompt
do_classifier_free_guidance = guidance_scale > 1.0
prompt_embeds = self._encode_prompt(prompt, device, num_images_per_prompt, do_classifier_free_guidance)
# prior
self.scheduler.set_timesteps(num_inference_steps, device=device)
timesteps = self.scheduler.timesteps
num_embeddings = self.prior.config.num_embeddings
embedding_dim = self.prior.config.embedding_dim
latents = self.prepare_latents(
(batch_size, num_embeddings * embedding_dim),
prompt_embeds.dtype,
device,
generator,
latents,
self.scheduler,
)
# YiYi notes: for testing only to match ldm, we can directly create a latents with desired shape: batch_size, num_embeddings, embedding_dim
latents = latents.reshape(latents.shape[0], num_embeddings, embedding_dim)
for i, t in enumerate(self.progress_bar(timesteps)):
# expand the latents if we are doing classifier free guidance
latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
scaled_model_input = self.scheduler.scale_model_input(latent_model_input, t)
noise_pred = self.prior(
scaled_model_input,
timestep=t,
proj_embedding=prompt_embeds,
).predicted_image_embedding
# remove the variance
noise_pred, _ = noise_pred.split(
scaled_model_input.shape[2], dim=2
) # batch_size, num_embeddings, embedding_dim
if do_classifier_free_guidance is not None:
noise_pred_uncond, noise_pred = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)
latents = self.scheduler.step(
noise_pred,
timestep=t,
sample=latents,
).prev_sample
if output_type == "latent":
return ShapEPipelineOutput(images=latents)
images = []
for i, latent in enumerate(latents):
image = self.renderer.decode(
latent[None, :],
device,
size=frame_size,
ray_batch_size=4096,
n_coarse_samples=64,
n_fine_samples=128,
)
images.append(image)
images = torch.stack(images)
if output_type not in ["np", "pil"]:
raise ValueError(f"Only the output types `pil` and `np` are supported not output_type={output_type}")
images = images.cpu().numpy()
if output_type == "pil":
images = [self.numpy_to_pil(image) for image in images]
# Offload last model to CPU
if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
self.final_offload_hook.offload()
if not return_dict:
return (images,)
return ShapEPipelineOutput(images=images)
# Copyright 2023 Open AI and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from typing import List, Optional, Union
import numpy as np
import PIL
import torch
from transformers import CLIPImageProcessor, CLIPVisionModel
from ...models import PriorTransformer
from ...pipelines import DiffusionPipeline
from ...schedulers import HeunDiscreteScheduler
from ...utils import (
BaseOutput,
is_accelerate_available,
logging,
randn_tensor,
replace_example_docstring,
)
from .renderer import ShapERenderer
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
EXAMPLE_DOC_STRING = """
Examples:
```py
>>> from PIL import Image
>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from diffusers.utils import export_to_gif, load_image
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> repo = "openai/shap-e-img2img"
>>> pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
>>> pipe = pipe.to(device)
>>> guidance_scale = 3.0
>>> image_url = "https://hf.co/datasets/diffusers/docs-images/resolve/main/shap-e/corgi.png"
>>> image = load_image(image_url).convert("RGB")
>>> images = pipe(
... image,
... guidance_scale=guidance_scale,
... num_inference_steps=64,
... frame_size=256,
... ).images
>>> gif_path = export_to_gif(images[0], "corgi_3d.gif")
```
"""
@dataclass
class ShapEPipelineOutput(BaseOutput):
"""
Output class for ShapEPipeline.
Args:
images (`torch.FloatTensor`)
a list of images for 3D rendering
"""
images: Union[PIL.Image.Image, np.ndarray]
class ShapEImg2ImgPipeline(DiffusionPipeline):
"""
Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
Args:
prior ([`PriorTransformer`]):
The canonincal unCLIP prior to approximate the image embedding from the text embedding.
text_encoder ([`CLIPTextModelWithProjection`]):
Frozen text-encoder.
tokenizer (`CLIPTokenizer`):
Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
scheduler ([`HeunDiscreteScheduler`]):
A scheduler to be used in combination with `prior` to generate image embedding.
renderer ([`ShapERenderer`]):
Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects
with the NeRF rendering method
"""
def __init__(
self,
prior: PriorTransformer,
image_encoder: CLIPVisionModel,
image_processor: CLIPImageProcessor,
scheduler: HeunDiscreteScheduler,
renderer: ShapERenderer,
):
super().__init__()
self.register_modules(
prior=prior,
image_encoder=image_encoder,
image_processor=image_processor,
scheduler=scheduler,
renderer=renderer,
)
# Copied from diffusers.pipelines.unclip.pipeline_unclip.UnCLIPPipeline.prepare_latents
def prepare_latents(self, shape, dtype, device, generator, latents, scheduler):
if latents is None:
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
else:
if latents.shape != shape:
raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
latents = latents.to(device)
latents = latents * scheduler.init_noise_sigma
return latents
def enable_sequential_cpu_offload(self, gpu_id=0):
r"""
Offloads all models to CPU using accelerate, significantly reducing memory usage. When called, the pipeline's
models have their state dicts saved to CPU and then are moved to a `torch.device('meta') and loaded to GPU only
when their specific submodule has its `forward` method called.
"""
if is_accelerate_available():
from accelerate import cpu_offload
else:
raise ImportError("Please install accelerate via `pip install accelerate`")
device = torch.device(f"cuda:{gpu_id}")
models = [self.image_encoder, self.prior]
for cpu_offloaded_model in models:
if cpu_offloaded_model is not None:
cpu_offload(cpu_offloaded_model, device)
@property
def _execution_device(self):
r"""
Returns the device on which the pipeline's models will be executed. After calling
`pipeline.enable_sequential_cpu_offload()` the execution device can only be inferred from Accelerate's module
hooks.
"""
if self.device != torch.device("meta") or not hasattr(self.image_encoder, "_hf_hook"):
return self.device
for module in self.image_encoder.modules():
if (
hasattr(module, "_hf_hook")
and hasattr(module._hf_hook, "execution_device")
and module._hf_hook.execution_device is not None
):
return torch.device(module._hf_hook.execution_device)
return self.device
def _encode_image(
self,
image,
device,
num_images_per_prompt,
do_classifier_free_guidance,
):
if isinstance(image, List) and isinstance(image[0], torch.Tensor):
image = torch.cat(image, axis=0) if image[0].ndim == 4 else torch.stack(image, axis=0)
if not isinstance(image, torch.Tensor):
image = self.image_processor(image, return_tensors="pt").pixel_values[0].unsqueeze(0)
image = image.to(dtype=self.image_encoder.dtype, device=device)
image_embeds = self.image_encoder(image)["last_hidden_state"]
image_embeds = image_embeds[:, 1:, :].contiguous() # batch_size, dim, 256
image_embeds = image_embeds.repeat_interleave(num_images_per_prompt, dim=0)
if do_classifier_free_guidance:
negative_image_embeds = torch.zeros_like(image_embeds)
# For classifier free guidance, we need to do two forward passes.
# Here we concatenate the unconditional and text embeddings into a single batch
# to avoid doing two forward passes
image_embeds = torch.cat([negative_image_embeds, image_embeds])
return image_embeds
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
image: Union[PIL.Image.Image, List[PIL.Image.Image]],
num_images_per_prompt: int = 1,
num_inference_steps: int = 25,
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
latents: Optional[torch.FloatTensor] = None,
guidance_scale: float = 4.0,
frame_size: int = 64,
output_type: Optional[str] = "pil", # pil, np, latent
return_dict: bool = True,
):
"""
Function invoked when calling the pipeline for generation.
Args:
prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation.
num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt.
num_inference_steps (`int`, *optional*, defaults to 100):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
to make generation deterministic.
latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`.
guidance_scale (`float`, *optional*, defaults to 4.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
`guidance_scale` is defined as `w` of equation 2. of [Imagen
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
frame_size (`int`, *optional*, default to 64):
the width and height of each image frame of the generated 3d output
output_type (`str`, *optional*, defaults to `"pt"`):
The output format of the generate image. Choose between: `"np"` (`np.array`) or `"pt"`
(`torch.Tensor`).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
Examples:
Returns:
[`ShapEPipelineOutput`] or `tuple`
"""
if isinstance(image, PIL.Image.Image):
batch_size = 1
elif isinstance(image, torch.Tensor):
batch_size = image.shape[0]
elif isinstance(image, list) and isinstance(image[0], (torch.Tensor, PIL.Image.Image)):
batch_size = len(image)
else:
raise ValueError(
f"`image` has to be of type `PIL.Image.Image`, `torch.Tensor`, `List[PIL.Image.Image]` or `List[torch.Tensor]` but is {type(image)}"
)
device = self._execution_device
batch_size = batch_size * num_images_per_prompt
do_classifier_free_guidance = guidance_scale > 1.0
image_embeds = self._encode_image(image, device, num_images_per_prompt, do_classifier_free_guidance)
# prior
self.scheduler.set_timesteps(num_inference_steps, device=device)
timesteps = self.scheduler.timesteps
num_embeddings = self.prior.config.num_embeddings
embedding_dim = self.prior.config.embedding_dim
latents = self.prepare_latents(
(batch_size, num_embeddings * embedding_dim),
image_embeds.dtype,
device,
generator,
latents,
self.scheduler,
)
# YiYi notes: for testing only to match ldm, we can directly create a latents with desired shape: batch_size, num_embeddings, embedding_dim
latents = latents.reshape(latents.shape[0], num_embeddings, embedding_dim)
for i, t in enumerate(self.progress_bar(timesteps)):
# expand the latents if we are doing classifier free guidance
latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
scaled_model_input = self.scheduler.scale_model_input(latent_model_input, t)
noise_pred = self.prior(
scaled_model_input,
timestep=t,
proj_embedding=image_embeds,
).predicted_image_embedding
# remove the variance
noise_pred, _ = noise_pred.split(
scaled_model_input.shape[2], dim=2
) # batch_size, num_embeddings, embedding_dim
if do_classifier_free_guidance is not None:
noise_pred_uncond, noise_pred = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred - noise_pred_uncond)
latents = self.scheduler.step(
noise_pred,
timestep=t,
sample=latents,
).prev_sample
if output_type == "latent":
return ShapEPipelineOutput(images=latents)
images = []
for i, latent in enumerate(latents):
print()
image = self.renderer.decode(
latent[None, :],
device,
size=frame_size,
ray_batch_size=4096,
n_coarse_samples=64,
n_fine_samples=128,
)
images.append(image)
images = torch.stack(images)
if output_type not in ["np", "pil"]:
raise ValueError(f"Only the output types `pil` and `np` are supported not output_type={output_type}")
images = images.cpu().numpy()
if output_type == "pil":
images = [self.numpy_to_pil(image) for image in images]
# Offload last model to CPU
if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
self.final_offload_hook.offload()
if not return_dict:
return (images,)
return ShapEPipelineOutput(images=images)
This diff is collapsed.
...@@ -47,7 +47,11 @@ class DDIMSchedulerOutput(BaseOutput): ...@@ -47,7 +47,11 @@ class DDIMSchedulerOutput(BaseOutput):
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor: def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -60,19 +64,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor ...@@ -60,19 +64,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -46,7 +46,11 @@ class DDIMSchedulerOutput(BaseOutput): ...@@ -46,7 +46,11 @@ class DDIMSchedulerOutput(BaseOutput):
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor: def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -59,19 +63,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor ...@@ -59,19 +63,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -47,7 +47,11 @@ class DDIMParallelSchedulerOutput(BaseOutput): ...@@ -47,7 +47,11 @@ class DDIMParallelSchedulerOutput(BaseOutput):
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor: def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -60,19 +64,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor ...@@ -60,19 +64,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -44,7 +44,11 @@ class DDPMSchedulerOutput(BaseOutput): ...@@ -44,7 +44,11 @@ class DDPMSchedulerOutput(BaseOutput):
pred_original_sample: Optional[torch.FloatTensor] = None pred_original_sample: Optional[torch.FloatTensor] = None
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -57,19 +61,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): ...@@ -57,19 +61,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999):
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -46,7 +46,11 @@ class DDPMParallelSchedulerOutput(BaseOutput): ...@@ -46,7 +46,11 @@ class DDPMParallelSchedulerOutput(BaseOutput):
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -59,19 +63,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): ...@@ -59,19 +63,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999):
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -26,7 +26,11 @@ from .scheduling_utils import KarrasDiffusionSchedulers, SchedulerMixin, Schedul ...@@ -26,7 +26,11 @@ from .scheduling_utils import KarrasDiffusionSchedulers, SchedulerMixin, Schedul
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -39,19 +43,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): ...@@ -39,19 +43,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999):
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -26,7 +26,11 @@ from .scheduling_utils import KarrasDiffusionSchedulers, SchedulerMixin, Schedul ...@@ -26,7 +26,11 @@ from .scheduling_utils import KarrasDiffusionSchedulers, SchedulerMixin, Schedul
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -39,19 +43,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): ...@@ -39,19 +43,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999):
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -26,7 +26,11 @@ from .scheduling_utils import KarrasDiffusionSchedulers, SchedulerMixin, Schedul ...@@ -26,7 +26,11 @@ from .scheduling_utils import KarrasDiffusionSchedulers, SchedulerMixin, Schedul
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -39,19 +43,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999): ...@@ -39,19 +43,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999):
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
......
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
# limitations under the License. # limitations under the License.
import math import math
from collections import defaultdict
from typing import List, Optional, Tuple, Union from typing import List, Optional, Tuple, Union
import numpy as np import numpy as np
...@@ -76,7 +77,11 @@ class BrownianTreeNoiseSampler: ...@@ -76,7 +77,11 @@ class BrownianTreeNoiseSampler:
# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar # Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar
def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor: def betas_for_alpha_bar(
num_diffusion_timesteps,
max_beta=0.999,
alpha_transform_type="cosine",
):
""" """
Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of
(1-beta) over time from t = [0,1]. (1-beta) over time from t = [0,1].
...@@ -89,19 +94,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor ...@@ -89,19 +94,30 @@ def betas_for_alpha_bar(num_diffusion_timesteps, max_beta=0.999) -> torch.Tensor
num_diffusion_timesteps (`int`): the number of betas to produce. num_diffusion_timesteps (`int`): the number of betas to produce.
max_beta (`float`): the maximum beta to use; use values lower than 1 to max_beta (`float`): the maximum beta to use; use values lower than 1 to
prevent singularities. prevent singularities.
alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar.
Choose from `cosine` or `exp`
Returns: Returns:
betas (`np.ndarray`): the betas used by the scheduler to step the model outputs betas (`np.ndarray`): the betas used by the scheduler to step the model outputs
""" """
if alpha_transform_type == "cosine":
def alpha_bar(time_step): def alpha_bar_fn(t):
return math.cos((time_step + 0.008) / 1.008 * math.pi / 2) ** 2 return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
elif alpha_transform_type == "exp":
def alpha_bar_fn(t):
return math.exp(t * -12.0)
else:
raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}")
betas = [] betas = []
for i in range(num_diffusion_timesteps): for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta))
return torch.tensor(betas, dtype=torch.float32) return torch.tensor(betas, dtype=torch.float32)
...@@ -190,10 +206,16 @@ class DPMSolverSDEScheduler(SchedulerMixin, ConfigMixin): ...@@ -190,10 +206,16 @@ class DPMSolverSDEScheduler(SchedulerMixin, ConfigMixin):
indices = (schedule_timesteps == timestep).nonzero() indices = (schedule_timesteps == timestep).nonzero()
if self.state_in_first_order: # The sigma index that is taken for the **very** first `step`
pos = -1 # is always the second index (or the last index if there is only 1)
# This way we can ensure we don't accidentally skip a sigma in
# case we start in the middle of the denoising schedule (e.g. for image-to-image)
if len(self._index_counter) == 0:
pos = 1 if len(indices) > 1 else 0
else: else:
pos = 0 timestep_int = timestep.cpu().item() if torch.is_tensor(timestep) else timestep
pos = self._index_counter[timestep_int]
return indices[pos].item() return indices[pos].item()
@property @property
...@@ -292,6 +314,10 @@ class DPMSolverSDEScheduler(SchedulerMixin, ConfigMixin): ...@@ -292,6 +314,10 @@ class DPMSolverSDEScheduler(SchedulerMixin, ConfigMixin):
self.sample = None self.sample = None
self.mid_point_sigma = None self.mid_point_sigma = None
# for exp beta schedules, such as the one for `pipeline_shap_e.py`
# we need an index counter
self._index_counter = defaultdict(int)
def _second_order_timesteps(self, sigmas, log_sigmas): def _second_order_timesteps(self, sigmas, log_sigmas):
def sigma_fn(_t): def sigma_fn(_t):
return np.exp(-_t) return np.exp(-_t)
...@@ -373,6 +399,10 @@ class DPMSolverSDEScheduler(SchedulerMixin, ConfigMixin): ...@@ -373,6 +399,10 @@ class DPMSolverSDEScheduler(SchedulerMixin, ConfigMixin):
""" """
step_index = self.index_for_timestep(timestep) step_index = self.index_for_timestep(timestep)
# advance index counter by 1
timestep_int = timestep.cpu().item() if torch.is_tensor(timestep) else timestep
self._index_counter[timestep_int] += 1
# Create a noise sampler if it hasn't been created yet # Create a noise sampler if it hasn't been created yet
if self.noise_sampler is None: if self.noise_sampler is None:
min_sigma, max_sigma = self.sigmas[self.sigmas > 0].min(), self.sigmas.max() min_sigma, max_sigma = self.sigmas[self.sigmas > 0].min(), self.sigmas.max()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment