Unverified Commit 4a343077 authored by Sayak Paul's avatar Sayak Paul Committed by GitHub
Browse files

add: utility to format our docs too 📜 (#7314)

* add: utility to format our docs too 📜

* debugging saga

* fix: message

* checking

* should be fixed.

* revert pipeline_fixture

* remove empty line

* make style

* fix: setup.py

* style.
parent 8e963d1c
...@@ -511,8 +511,8 @@ class UNet3DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin) ...@@ -511,8 +511,8 @@ class UNet3DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin)
# Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
def fuse_qkv_projections(self): def fuse_qkv_projections(self):
""" """
Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
key, value) are fused. For cross-attention modules, key and value projection matrices are fused. are fused. For cross-attention modules, key and value projection matrices are fused.
<Tip warning={true}> <Tip warning={true}>
......
...@@ -99,8 +99,8 @@ class I2VGenXLTransformerTemporalEncoder(nn.Module): ...@@ -99,8 +99,8 @@ class I2VGenXLTransformerTemporalEncoder(nn.Module):
class I2VGenXLUNet(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin): class I2VGenXLUNet(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
r""" r"""
I2VGenXL UNet. It is a conditional 3D UNet model that takes a noisy sample, conditional state, and a timestep I2VGenXL UNet. It is a conditional 3D UNet model that takes a noisy sample, conditional state, and a timestep and
and returns a sample-shaped output. returns a sample-shaped output.
This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
for all models (such as downloading or saving). for all models (such as downloading or saving).
...@@ -477,8 +477,8 @@ class I2VGenXLUNet(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin): ...@@ -477,8 +477,8 @@ class I2VGenXLUNet(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
# Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
def fuse_qkv_projections(self): def fuse_qkv_projections(self):
""" """
Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
key, value) are fused. For cross-attention modules, key and value projection matrices are fused. are fused. For cross-attention modules, key and value projection matrices are fused.
<Tip warning={true}> <Tip warning={true}>
...@@ -533,7 +533,8 @@ class I2VGenXLUNet(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin): ...@@ -533,7 +533,8 @@ class I2VGenXLUNet(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
timestep (`torch.FloatTensor` or `float` or `int`): The number of timesteps to denoise an input. timestep (`torch.FloatTensor` or `float` or `int`): The number of timesteps to denoise an input.
fps (`torch.Tensor`): Frames per second for the video being generated. Used as a "micro-condition". fps (`torch.Tensor`): Frames per second for the video being generated. Used as a "micro-condition".
image_latents (`torch.FloatTensor`): Image encodings from the VAE. image_latents (`torch.FloatTensor`): Image encodings from the VAE.
image_embeddings (`torch.FloatTensor`): Projection embeddings of the conditioning image computed with a vision encoder. image_embeddings (`torch.FloatTensor`):
Projection embeddings of the conditioning image computed with a vision encoder.
encoder_hidden_states (`torch.FloatTensor`): encoder_hidden_states (`torch.FloatTensor`):
The encoder hidden states with shape `(batch, sequence_length, feature_dim)`. The encoder hidden states with shape `(batch, sequence_length, feature_dim)`.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
......
...@@ -709,8 +709,8 @@ class UNetMotionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin): ...@@ -709,8 +709,8 @@ class UNetMotionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
# Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
def fuse_qkv_projections(self): def fuse_qkv_projections(self):
""" """
Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
key, value) are fused. For cross-attention modules, key and value projection matrices are fused. are fused. For cross-attention modules, key and value projection matrices are fused.
<Tip warning={true}> <Tip warning={true}>
......
...@@ -31,8 +31,8 @@ class UNetSpatioTemporalConditionOutput(BaseOutput): ...@@ -31,8 +31,8 @@ class UNetSpatioTemporalConditionOutput(BaseOutput):
class UNetSpatioTemporalConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin): class UNetSpatioTemporalConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
r""" r"""
A conditional Spatio-Temporal UNet model that takes a noisy video frames, conditional state, and a timestep and returns a sample A conditional Spatio-Temporal UNet model that takes a noisy video frames, conditional state, and a timestep and
shaped output. returns a sample shaped output.
This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
for all models (such as downloading or saving). for all models (such as downloading or saving).
...@@ -57,7 +57,8 @@ class UNetSpatioTemporalConditionModel(ModelMixin, ConfigMixin, UNet2DConditionL ...@@ -57,7 +57,8 @@ class UNetSpatioTemporalConditionModel(ModelMixin, ConfigMixin, UNet2DConditionL
The dimension of the cross attention features. The dimension of the cross attention features.
transformer_layers_per_block (`int`, `Tuple[int]`, or `Tuple[Tuple]` , *optional*, defaults to 1): transformer_layers_per_block (`int`, `Tuple[int]`, or `Tuple[Tuple]` , *optional*, defaults to 1):
The number of transformer blocks of type [`~models.attention.BasicTransformerBlock`]. Only relevant for The number of transformer blocks of type [`~models.attention.BasicTransformerBlock`]. Only relevant for
[`~models.unet_3d_blocks.CrossAttnDownBlockSpatioTemporal`], [`~models.unet_3d_blocks.CrossAttnUpBlockSpatioTemporal`], [`~models.unet_3d_blocks.CrossAttnDownBlockSpatioTemporal`],
[`~models.unet_3d_blocks.CrossAttnUpBlockSpatioTemporal`],
[`~models.unet_3d_blocks.UNetMidBlockSpatioTemporal`]. [`~models.unet_3d_blocks.UNetMidBlockSpatioTemporal`].
num_attention_heads (`int`, `Tuple[int]`, defaults to `(5, 10, 10, 20)`): num_attention_heads (`int`, `Tuple[int]`, defaults to `(5, 10, 10, 20)`):
The number of attention heads. The number of attention heads.
...@@ -374,12 +375,12 @@ class UNetSpatioTemporalConditionModel(ModelMixin, ConfigMixin, UNet2DConditionL ...@@ -374,12 +375,12 @@ class UNetSpatioTemporalConditionModel(ModelMixin, ConfigMixin, UNet2DConditionL
The additional time ids with shape `(batch, num_additional_ids)`. These are encoded with sinusoidal The additional time ids with shape `(batch, num_additional_ids)`. These are encoded with sinusoidal
embeddings and added to the time embeddings. embeddings and added to the time embeddings.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.unet_slatio_temporal.UNetSpatioTemporalConditionOutput`] instead of a plain Whether or not to return a [`~models.unet_slatio_temporal.UNetSpatioTemporalConditionOutput`] instead
tuple. of a plain tuple.
Returns: Returns:
[`~models.unet_slatio_temporal.UNetSpatioTemporalConditionOutput`] or `tuple`: [`~models.unet_slatio_temporal.UNetSpatioTemporalConditionOutput`] or `tuple`:
If `return_dict` is True, an [`~models.unet_slatio_temporal.UNetSpatioTemporalConditionOutput`] is returned, otherwise If `return_dict` is True, an [`~models.unet_slatio_temporal.UNetSpatioTemporalConditionOutput`] is
a `tuple` is returned where the first element is the sample tensor. returned, otherwise a `tuple` is returned where the first element is the sample tensor.
""" """
# 1. time # 1. time
timesteps = timestep timesteps = timestep
......
...@@ -186,7 +186,8 @@ class StableCascadeUNet(ModelMixin, ConfigMixin, FromOriginalUNetMixin): ...@@ -186,7 +186,8 @@ class StableCascadeUNet(ModelMixin, ConfigMixin, FromOriginalUNetMixin):
block_out_channels (Tuple[int], defaults to (2048, 2048)): block_out_channels (Tuple[int], defaults to (2048, 2048)):
Tuple of output channels for each block. Tuple of output channels for each block.
num_attention_heads (Tuple[int], defaults to (32, 32)): num_attention_heads (Tuple[int], defaults to (32, 32)):
Number of attention heads in each attention block. Set to -1 to if block types in a layer do not have attention. Number of attention heads in each attention block. Set to -1 to if block types in a layer do not have
attention.
down_num_layers_per_block (Tuple[int], defaults to [8, 24]): down_num_layers_per_block (Tuple[int], defaults to [8, 24]):
Number of layers in each down block. Number of layers in each down block.
up_num_layers_per_block (Tuple[int], defaults to [24, 8]): up_num_layers_per_block (Tuple[int], defaults to [24, 8]):
...@@ -197,10 +198,9 @@ class StableCascadeUNet(ModelMixin, ConfigMixin, FromOriginalUNetMixin): ...@@ -197,10 +198,9 @@ class StableCascadeUNet(ModelMixin, ConfigMixin, FromOriginalUNetMixin):
Number of 1x1 Convolutional layers to repeat in each up block. Number of 1x1 Convolutional layers to repeat in each up block.
block_types_per_layer (Tuple[Tuple[str]], optional, block_types_per_layer (Tuple[Tuple[str]], optional,
defaults to ( defaults to (
("SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"), ("SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock"), ("SDCascadeResBlock",
("SDCascadeResBlock", "SDCascadeTimestepBlock", "SDCascadeAttnBlock") "SDCascadeTimestepBlock", "SDCascadeAttnBlock")
): ): Block types used in each layer of the up/down blocks.
Block types used in each layer of the up/down blocks.
clip_text_in_channels (`int`, *optional*, defaults to `None`): clip_text_in_channels (`int`, *optional*, defaults to `None`):
Number of input channels for CLIP based text conditioning. Number of input channels for CLIP based text conditioning.
clip_text_pooled_in_channels (`int`, *optional*, defaults to 1280): clip_text_pooled_in_channels (`int`, *optional*, defaults to 1280):
......
...@@ -30,9 +30,7 @@ EXAMPLE_DOC_STRING = """ ...@@ -30,9 +30,7 @@ EXAMPLE_DOC_STRING = """
>>> import torch >>> import torch
>>> from diffusers import AmusedPipeline >>> from diffusers import AmusedPipeline
>>> pipe = AmusedPipeline.from_pretrained( >>> pipe = AmusedPipeline.from_pretrained("amused/amused-512", variant="fp16", torch_dtype=torch.float16)
... "amused/amused-512", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda") >>> pipe = pipe.to("cuda")
>>> prompt = "a photo of an astronaut riding a horse on mars" >>> prompt = "a photo of an astronaut riding a horse on mars"
...@@ -150,10 +148,12 @@ class AmusedPipeline(DiffusionPipeline): ...@@ -150,10 +148,12 @@ class AmusedPipeline(DiffusionPipeline):
A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
[`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
micro_conditioning_aesthetic_score (`int`, *optional*, defaults to 6): micro_conditioning_aesthetic_score (`int`, *optional*, defaults to 6):
The targeted aesthetic score according to the laion aesthetic classifier. See https://laion.ai/blog/laion-aesthetics/ The targeted aesthetic score according to the laion aesthetic classifier. See
and the micro-conditioning section of https://arxiv.org/abs/2307.01952. https://laion.ai/blog/laion-aesthetics/ and the micro-conditioning section of
https://arxiv.org/abs/2307.01952.
micro_conditioning_crop_coord (`Tuple[int]`, *optional*, defaults to (0, 0)): micro_conditioning_crop_coord (`Tuple[int]`, *optional*, defaults to (0, 0)):
The targeted height, width crop coordinates. See the micro-conditioning section of https://arxiv.org/abs/2307.01952. The targeted height, width crop coordinates. See the micro-conditioning section of
https://arxiv.org/abs/2307.01952.
temperature (`Union[int, Tuple[int, int], List[int]]`, *optional*, defaults to (2, 0)): temperature (`Union[int, Tuple[int, int], List[int]]`, *optional*, defaults to (2, 0)):
Configures the temperature scheduler on `self.scheduler` see `AmusedScheduler#set_timesteps`. Configures the temperature scheduler on `self.scheduler` see `AmusedScheduler#set_timesteps`.
......
...@@ -167,10 +167,12 @@ class AmusedImg2ImgPipeline(DiffusionPipeline): ...@@ -167,10 +167,12 @@ class AmusedImg2ImgPipeline(DiffusionPipeline):
A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
[`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
micro_conditioning_aesthetic_score (`int`, *optional*, defaults to 6): micro_conditioning_aesthetic_score (`int`, *optional*, defaults to 6):
The targeted aesthetic score according to the laion aesthetic classifier. See https://laion.ai/blog/laion-aesthetics/ The targeted aesthetic score according to the laion aesthetic classifier. See
and the micro-conditioning section of https://arxiv.org/abs/2307.01952. https://laion.ai/blog/laion-aesthetics/ and the micro-conditioning section of
https://arxiv.org/abs/2307.01952.
micro_conditioning_crop_coord (`Tuple[int]`, *optional*, defaults to (0, 0)): micro_conditioning_crop_coord (`Tuple[int]`, *optional*, defaults to (0, 0)):
The targeted height, width crop coordinates. See the micro-conditioning section of https://arxiv.org/abs/2307.01952. The targeted height, width crop coordinates. See the micro-conditioning section of
https://arxiv.org/abs/2307.01952.
temperature (`Union[int, Tuple[int, int], List[int]]`, *optional*, defaults to (2, 0)): temperature (`Union[int, Tuple[int, int], List[int]]`, *optional*, defaults to (2, 0)):
Configures the temperature scheduler on `self.scheduler` see `AmusedScheduler#set_timesteps`. Configures the temperature scheduler on `self.scheduler` see `AmusedScheduler#set_timesteps`.
......
...@@ -191,10 +191,12 @@ class AmusedInpaintPipeline(DiffusionPipeline): ...@@ -191,10 +191,12 @@ class AmusedInpaintPipeline(DiffusionPipeline):
A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
[`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
micro_conditioning_aesthetic_score (`int`, *optional*, defaults to 6): micro_conditioning_aesthetic_score (`int`, *optional*, defaults to 6):
The targeted aesthetic score according to the laion aesthetic classifier. See https://laion.ai/blog/laion-aesthetics/ The targeted aesthetic score according to the laion aesthetic classifier. See
and the micro-conditioning section of https://arxiv.org/abs/2307.01952. https://laion.ai/blog/laion-aesthetics/ and the micro-conditioning section of
https://arxiv.org/abs/2307.01952.
micro_conditioning_crop_coord (`Tuple[int]`, *optional*, defaults to (0, 0)): micro_conditioning_crop_coord (`Tuple[int]`, *optional*, defaults to (0, 0)):
The targeted height, width crop coordinates. See the micro-conditioning section of https://arxiv.org/abs/2307.01952. The targeted height, width crop coordinates. See the micro-conditioning section of
https://arxiv.org/abs/2307.01952.
temperature (`Union[int, Tuple[int, int], List[int]]`, *optional*, defaults to (2, 0)): temperature (`Union[int, Tuple[int, int], List[int]]`, *optional*, defaults to (2, 0)):
Configures the temperature scheduler on `self.scheduler` see `AmusedScheduler#set_timesteps`. Configures the temperature scheduler on `self.scheduler` see `AmusedScheduler#set_timesteps`.
......
...@@ -639,10 +639,10 @@ class AnimateDiffPipeline( ...@@ -639,10 +639,10 @@ class AnimateDiffPipeline(
ip_adapter_image: (`PipelineImageInput`, *optional*): ip_adapter_image: (`PipelineImageInput`, *optional*):
Optional image input to work with IP Adapters. Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated video. Choose between `torch.FloatTensor`, `PIL.Image` or The output format of the generated video. Choose between `torch.FloatTensor`, `PIL.Image` or
`np.array`. `np.array`.
......
...@@ -52,14 +52,21 @@ EXAMPLE_DOC_STRING = """ ...@@ -52,14 +52,21 @@ EXAMPLE_DOC_STRING = """
>>> from io import BytesIO >>> from io import BytesIO
>>> from PIL import Image >>> from PIL import Image
>>> adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) >>> adapter = MotionAdapter.from_pretrained(
>>> pipe = AnimateDiffVideoToVideoPipeline.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", motion_adapter=adapter).to("cuda") ... "guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16
>>> pipe.scheduler = DDIMScheduler(beta_schedule="linear", steps_offset=1, clip_sample=False, timespace_spacing="linspace") ... )
>>> pipe = AnimateDiffVideoToVideoPipeline.from_pretrained(
... "SG161222/Realistic_Vision_V5.1_noVAE", motion_adapter=adapter
... ).to("cuda")
>>> pipe.scheduler = DDIMScheduler(
... beta_schedule="linear", steps_offset=1, clip_sample=False, timespace_spacing="linspace"
... )
>>> def load_video(file_path: str): >>> def load_video(file_path: str):
... images = [] ... images = []
...
... if file_path.startswith(('http://', 'https://')): ... if file_path.startswith(("http://", "https://")):
... # If the file_path is a URL ... # If the file_path is a URL
... response = requests.get(file_path) ... response = requests.get(file_path)
... response.raise_for_status() ... response.raise_for_status()
...@@ -68,15 +75,20 @@ EXAMPLE_DOC_STRING = """ ...@@ -68,15 +75,20 @@ EXAMPLE_DOC_STRING = """
... else: ... else:
... # Assuming it's a local file path ... # Assuming it's a local file path
... vid = imageio.get_reader(file_path) ... vid = imageio.get_reader(file_path)
...
... for frame in vid: ... for frame in vid:
... pil_image = Image.fromarray(frame) ... pil_image = Image.fromarray(frame)
... images.append(pil_image) ... images.append(pil_image)
...
... return images ... return images
>>> video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif")
>>> output = pipe(video=video, prompt="panda playing a guitar, on a boat, in the ocean, high quality", strength=0.5) >>> video = load_video(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif"
... )
>>> output = pipe(
... video=video, prompt="panda playing a guitar, on a boat, in the ocean, high quality", strength=0.5
... )
>>> frames = output.frames[0] >>> frames = output.frames[0]
>>> export_to_gif(frames, "animation.gif") >>> export_to_gif(frames, "animation.gif")
``` ```
...@@ -135,8 +147,8 @@ def retrieve_timesteps( ...@@ -135,8 +147,8 @@ def retrieve_timesteps(
scheduler (`SchedulerMixin`): scheduler (`SchedulerMixin`):
The scheduler to get timesteps from. The scheduler to get timesteps from.
num_inference_steps (`int`): num_inference_steps (`int`):
The number of diffusion steps used when generating samples with a pre-trained model. If used, The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
`timesteps` must be `None`. must be `None`.
device (`str` or `torch.device`, *optional*): device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved. The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
timesteps (`List[int]`, *optional*): timesteps (`List[int]`, *optional*):
...@@ -799,16 +811,15 @@ class AnimateDiffVideoToVideoPipeline( ...@@ -799,16 +811,15 @@ class AnimateDiffVideoToVideoPipeline(
ip_adapter_image: (`PipelineImageInput`, *optional*): ip_adapter_image: (`PipelineImageInput`, *optional*):
Optional image input to work with IP Adapters. Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated video. Choose between `torch.FloatTensor`, `PIL.Image` or The output format of the generated video. Choose between `torch.FloatTensor`, `PIL.Image` or
`np.array`. `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`AnimateDiffPipelineOutput`] instead Whether or not to return a [`AnimateDiffPipelineOutput`] instead of a plain tuple.
of a plain tuple.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
[`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
......
...@@ -15,7 +15,8 @@ class AnimateDiffPipelineOutput(BaseOutput): ...@@ -15,7 +15,8 @@ class AnimateDiffPipelineOutput(BaseOutput):
Args: Args:
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]): frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
denoised
PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
`(batch_size, num_frames, channels, height, width)` `(batch_size, num_frames, channels, height, width)`
""" """
......
...@@ -701,8 +701,8 @@ class AudioLDM2UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoad ...@@ -701,8 +701,8 @@ class AudioLDM2UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoad
Returns: Returns:
[`~models.unets.unet_2d_condition.UNet2DConditionOutput`] or `tuple`: [`~models.unets.unet_2d_condition.UNet2DConditionOutput`] or `tuple`:
If `return_dict` is True, an [`~models.unets.unet_2d_condition.UNet2DConditionOutput`] is returned, otherwise If `return_dict` is True, an [`~models.unets.unet_2d_condition.UNet2DConditionOutput`] is returned,
a `tuple` is returned where the first element is the sample tensor. otherwise a `tuple` is returned where the first element is the sample tensor.
""" """
# By default samples have to be AT least a multiple of the overall upsampling factor. # By default samples have to be AT least a multiple of the overall upsampling factor.
# The overall upsampling factor is equal to 2 ** (# num of upsampling layers). # The overall upsampling factor is equal to 2 ** (# num of upsampling layers).
......
...@@ -107,8 +107,8 @@ def retrieve_timesteps( ...@@ -107,8 +107,8 @@ def retrieve_timesteps(
scheduler (`SchedulerMixin`): scheduler (`SchedulerMixin`):
The scheduler to get timesteps from. The scheduler to get timesteps from.
num_inference_steps (`int`): num_inference_steps (`int`):
The number of diffusion steps used when generating samples with a pre-trained model. If used, The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
`timesteps` must be `None`. must be `None`.
device (`str` or `torch.device`, *optional*): device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved. The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
timesteps (`List[int]`, *optional*): timesteps (`List[int]`, *optional*):
...@@ -922,9 +922,9 @@ class StableDiffusionControlNetPipeline( ...@@ -922,9 +922,9 @@ class StableDiffusionControlNetPipeline(
accepted as an image. The dimensions of the output image defaults to `image`'s dimensions. If height accepted as an image. The dimensions of the output image defaults to `image`'s dimensions. If height
and/or width are passed, `image` is resized accordingly. If multiple ControlNets are specified in and/or width are passed, `image` is resized accordingly. If multiple ControlNets are specified in
`init`, images must be passed as a list such that each element of the list can be correctly batched for `init`, images must be passed as a list such that each element of the list can be correctly batched for
input to a single ControlNet. When `prompt` is a list, and if a list of images is passed for a single ControlNet, input to a single ControlNet. When `prompt` is a list, and if a list of images is passed for a single
each will be paired with each prompt in the `prompt` list. This also applies to multiple ControlNets, ControlNet, each will be paired with each prompt in the `prompt` list. This also applies to multiple
where a list of image lists can be passed to batch for each prompt and each ControlNet. ControlNets, where a list of image lists can be passed to batch for each prompt and each ControlNet.
height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
...@@ -962,10 +962,10 @@ class StableDiffusionControlNetPipeline( ...@@ -962,10 +962,10 @@ class StableDiffusionControlNetPipeline(
not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters. ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated image. Choose between `PIL.Image` or `np.array`. The output format of the generated image. Choose between `PIL.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
......
...@@ -978,10 +978,10 @@ class StableDiffusionControlNetImg2ImgPipeline( ...@@ -978,10 +978,10 @@ class StableDiffusionControlNetImg2ImgPipeline(
not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters. ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated image. Choose between `PIL.Image` or `np.array`. The output format of the generated image. Choose between `PIL.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
......
...@@ -1167,11 +1167,12 @@ class StableDiffusionControlNetInpaintPipeline( ...@@ -1167,11 +1167,12 @@ class StableDiffusionControlNetInpaintPipeline(
width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
padding_mask_crop (`int`, *optional*, defaults to `None`): padding_mask_crop (`int`, *optional*, defaults to `None`):
The size of margin in the crop to be applied to the image and masking. If `None`, no crop is applied to image and mask_image. If The size of margin in the crop to be applied to the image and masking. If `None`, no crop is applied to
`padding_mask_crop` is not `None`, it will first find a rectangular region with the same aspect ration of the image and image and mask_image. If `padding_mask_crop` is not `None`, it will first find a rectangular region
contains all masked area, and then expand that area based on `padding_mask_crop`. The image and mask_image will then be cropped based on with the same aspect ration of the image and contains all masked area, and then expand that area based
the expanded area before resizing to the original image size for inpainting. This is useful when the masked area is small while the image is large on `padding_mask_crop`. The image and mask_image will then be cropped based on the expanded area before
and contain information irrelevant for inpainting, such as background. resizing to the original image size for inpainting. This is useful when the masked area is small while
the image is large and contain information irrelevant for inpainting, such as background.
strength (`float`, *optional*, defaults to 1.0): strength (`float`, *optional*, defaults to 1.0):
Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
starting point and more noise is added the higher the `strength`. The number of denoising steps depends starting point and more noise is added the higher the `strength`. The number of denoising steps depends
...@@ -1207,10 +1208,10 @@ class StableDiffusionControlNetInpaintPipeline( ...@@ -1207,10 +1208,10 @@ class StableDiffusionControlNetInpaintPipeline(
not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters. ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated image. Choose between `PIL.Image` or `np.array`. The output format of the generated image. Choose between `PIL.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
......
...@@ -1194,11 +1194,12 @@ class StableDiffusionXLControlNetInpaintPipeline( ...@@ -1194,11 +1194,12 @@ class StableDiffusionXLControlNetInpaintPipeline(
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The width in pixels of the generated image. The width in pixels of the generated image.
padding_mask_crop (`int`, *optional*, defaults to `None`): padding_mask_crop (`int`, *optional*, defaults to `None`):
The size of margin in the crop to be applied to the image and masking. If `None`, no crop is applied to image and mask_image. If The size of margin in the crop to be applied to the image and masking. If `None`, no crop is applied to
`padding_mask_crop` is not `None`, it will first find a rectangular region with the same aspect ration of the image and image and mask_image. If `padding_mask_crop` is not `None`, it will first find a rectangular region
contains all masked area, and then expand that area based on `padding_mask_crop`. The image and mask_image will then be cropped based on with the same aspect ration of the image and contains all masked area, and then expand that area based
the expanded area before resizing to the original image size for inpainting. This is useful when the masked area is small while the image is large on `padding_mask_crop`. The image and mask_image will then be cropped based on the expanded area before
and contain information irrelevant for inpainting, such as background. resizing to the original image size for inpainting. This is useful when the masked area is small while
the image is large and contain information irrelevant for inpainting, such as background.
strength (`float`, *optional*, defaults to 0.9999): strength (`float`, *optional*, defaults to 0.9999):
Conceptually, indicates how much to transform the masked portion of the reference `image`. Must be Conceptually, indicates how much to transform the masked portion of the reference `image`. Must be
between 0 and 1. `image` will be used as a starting point, adding more noise to it the larger the between 0 and 1. `image` will be used as a starting point, adding more noise to it the larger the
...@@ -1247,10 +1248,10 @@ class StableDiffusionXLControlNetInpaintPipeline( ...@@ -1247,10 +1248,10 @@ class StableDiffusionXLControlNetInpaintPipeline(
argument. argument.
ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters. ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
pooled_prompt_embeds (`torch.FloatTensor`, *optional*): pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
If not provided, pooled text embeddings will be generated from `prompt` input argument. If not provided, pooled text embeddings will be generated from `prompt` input argument.
......
...@@ -1039,10 +1039,10 @@ class StableDiffusionXLControlNetPipeline( ...@@ -1039,10 +1039,10 @@ class StableDiffusionXLControlNetPipeline(
argument. argument.
ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters. ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated image. Choose between `PIL.Image` or `np.array`. The output format of the generated image. Choose between `PIL.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
......
...@@ -1178,10 +1178,10 @@ class StableDiffusionXLControlNetImg2ImgPipeline( ...@@ -1178,10 +1178,10 @@ class StableDiffusionXLControlNetImg2ImgPipeline(
input argument. input argument.
ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters. ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*): ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters. Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
if `do_classifier_free_guidance` is set to `True`. contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
If not provided, embeddings are computed from the `ip_adapter_image` input argument. provided, embeddings are computed from the `ip_adapter_image` input argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
......
...@@ -89,8 +89,8 @@ def retrieve_timesteps( ...@@ -89,8 +89,8 @@ def retrieve_timesteps(
scheduler (`SchedulerMixin`): scheduler (`SchedulerMixin`):
The scheduler to get timesteps from. The scheduler to get timesteps from.
num_inference_steps (`int`): num_inference_steps (`int`):
The number of diffusion steps used when generating samples with a pre-trained model. If used, The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
`timesteps` must be `None`. must be `None`.
device (`str` or `torch.device`, *optional*): device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved. The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
timesteps (`List[int]`, *optional*): timesteps (`List[int]`, *optional*):
......
...@@ -129,8 +129,8 @@ def retrieve_timesteps( ...@@ -129,8 +129,8 @@ def retrieve_timesteps(
scheduler (`SchedulerMixin`): scheduler (`SchedulerMixin`):
The scheduler to get timesteps from. The scheduler to get timesteps from.
num_inference_steps (`int`): num_inference_steps (`int`):
The number of diffusion steps used when generating samples with a pre-trained model. If used, The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
`timesteps` must be `None`. must be `None`.
device (`str` or `torch.device`, *optional*): device (`str` or `torch.device`, *optional*):
The device to which the timesteps should be moved to. If `None`, the timesteps are not moved. The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
timesteps (`List[int]`, *optional*): timesteps (`List[int]`, *optional*):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment