Unverified Commit a69754bb authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

[docs] Clean up pipeline apis (#3905)

* start with stable diffusion

* fix

* finish stable diffusion pipelines

* fix path to pipeline output

* fix flax paths

* fix copies

* add up to score sde ve

* finish first pass of pipelines

* fix copies

* second review

* align doc titles

* more review fixes

* final review
parent bcc570b9
...@@ -54,15 +54,14 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL ...@@ -54,15 +54,14 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
""" """
Pipeline for text-to-image generation using stable unCLIP. Pipeline for text-to-image generation using stable unCLIP.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
prior_tokenizer ([`CLIPTokenizer`]): prior_tokenizer ([`CLIPTokenizer`]):
Tokenizer of class A [`CLIPTokenizer`].
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
prior_text_encoder ([`CLIPTextModelWithProjection`]): prior_text_encoder ([`CLIPTextModelWithProjection`]):
Frozen text-encoder. Frozen [`CLIPTextModelWithProjection`] text-encoder.
prior ([`PriorTransformer`]): prior ([`PriorTransformer`]):
The canonincal unCLIP prior to approximate the image embedding from the text embedding. The canonincal unCLIP prior to approximate the image embedding from the text embedding.
prior_scheduler ([`KarrasDiffusionSchedulers`]): prior_scheduler ([`KarrasDiffusionSchedulers`]):
...@@ -72,13 +71,13 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL ...@@ -72,13 +71,13 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
embeddings after the noise has been applied. embeddings after the noise has been applied.
image_noising_scheduler ([`KarrasDiffusionSchedulers`]): image_noising_scheduler ([`KarrasDiffusionSchedulers`]):
Noise schedule for adding noise to the predicted image embeddings. The amount of noise to add is determined Noise schedule for adding noise to the predicted image embeddings. The amount of noise to add is determined
by `noise_level` in `StableUnCLIPPipeline.__call__`. by the `noise_level`.
tokenizer (`CLIPTokenizer`): tokenizer ([`CLIPTokenizer`]):
Tokenizer of class A [`CLIPTokenizer`].
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
text_encoder ([`CLIPTextModel`]): text_encoder ([`CLIPTextModel`]):
Frozen text-encoder. Frozen [`CLIPTextModel`] text-encoder.
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. unet ([`UNet2DConditionModel`]):
A [`UNet2DConditionModel`] to denoise the encoded image latents.
scheduler ([`KarrasDiffusionSchedulers`]): scheduler ([`KarrasDiffusionSchedulers`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. A scheduler to be used in combination with `unet` to denoise the encoded image latents.
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
...@@ -145,27 +144,25 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL ...@@ -145,27 +144,25 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -575,8 +572,8 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL ...@@ -575,8 +572,8 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
Add noise to the image embeddings. The amount of noise is controlled by a `noise_level` input. A higher Add noise to the image embeddings. The amount of noise is controlled by a `noise_level` input. A higher
`noise_level` increases the variance in the final un-noised images. `noise_level` increases the variance in the final un-noised images.
The noise is applied in two ways The noise is applied in two ways:
1. A noise schedule is applied directly to the embeddings 1. A noise schedule is applied directly to the embeddings.
2. A vector of sinusoidal time embeddings are appended to the output. 2. A vector of sinusoidal time embeddings are appended to the output.
In both cases, the amount of noise is controlled by the same `noise_level`. In both cases, the amount of noise is controlled by the same `noise_level`.
...@@ -639,87 +636,76 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL ...@@ -639,87 +636,76 @@ class StableUnCLIPPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraL
prior_latents: Optional[torch.FloatTensor] = None, prior_latents: Optional[torch.FloatTensor] = None,
): ):
""" """
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 20): num_inference_steps (`int`, *optional*, defaults to 20):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 10.0): guidance_scale (`float`, *optional*, defaults to 10.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
noise_level (`int`, *optional*, defaults to `0`): noise_level (`int`, *optional*, defaults to `0`):
The amount of noise to add to the image embeddings. A higher `noise_level` increases the variance in The amount of noise to add to the image embeddings. A higher `noise_level` increases the variance in
the final un-noised images. See `StableUnCLIPPipeline.noise_image_embeddings` for details. the final un-noised images. See [`StableUnCLIPPipeline.noise_image_embeddings`] for more details.
prior_num_inference_steps (`int`, *optional*, defaults to 25): prior_num_inference_steps (`int`, *optional*, defaults to 25):
The number of denoising steps in the prior denoising process. More denoising steps usually lead to a The number of denoising steps in the prior denoising process. More denoising steps usually lead to a
higher quality image at the expense of slower inference. higher quality image at the expense of slower inference.
prior_guidance_scale (`float`, *optional*, defaults to 4.0): prior_guidance_scale (`float`, *optional*, defaults to 4.0):
Guidance scale for the prior denoising process as defined in [Classifier-Free Diffusion A higher guidance scale value encourages the model to generate images closely linked to the text
Guidance](https://arxiv.org/abs/2207.12598). `prior_guidance_scale` is defined as `w` of equation 2. of `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
[Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting
`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
the text `prompt`, usually at the expense of lower image quality.
prior_latents (`torch.FloatTensor`, *optional*): prior_latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
embedding generation in the prior denoising process. Can be used to tweak the same generation with embedding generation in the prior denoising process. Can be used to tweak the same generation with
different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied different prompts. If not provided, a latents tensor is generated by sampling using the supplied random
random `generator`. `generator`.
Examples: Examples:
Returns: Returns:
[`~pipelines.ImagePipelineOutput`] or `tuple`: [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is [`~pipelines.ImagePipelineOutput`] or `tuple`:
True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images. [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple`. When returning
a tuple, the first element is a list with the generated images.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -65,10 +65,10 @@ EXAMPLE_DOC_STRING = """ ...@@ -65,10 +65,10 @@ EXAMPLE_DOC_STRING = """
class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
""" """
Pipeline for text-guided image to image generation using stable unCLIP. Pipeline for text-guided image-to-image generation using stable unCLIP.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
feature_extractor ([`CLIPImageProcessor`]): feature_extractor ([`CLIPImageProcessor`]):
...@@ -80,13 +80,13 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin ...@@ -80,13 +80,13 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin
embeddings after the noise has been applied. embeddings after the noise has been applied.
image_noising_scheduler ([`KarrasDiffusionSchedulers`]): image_noising_scheduler ([`KarrasDiffusionSchedulers`]):
Noise schedule for adding noise to the predicted image embeddings. The amount of noise to add is determined Noise schedule for adding noise to the predicted image embeddings. The amount of noise to add is determined
by `noise_level` in `StableUnCLIPPipeline.__call__`. by the `noise_level`.
tokenizer (`CLIPTokenizer`): tokenizer (`~transformers.CLIPTokenizer`):
Tokenizer of class A [`~transformers.CLIPTokenizer`)].
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). text_encoder ([`~transformers.CLIPTextModel`]):
text_encoder ([`CLIPTextModel`]): Frozen [`~transformers.CLIPTextModel`] text-encoder.
Frozen text-encoder. unet ([`UNet2DConditionModel`]):
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. A [`UNet2DConditionModel`] to denoise the encoded image latents.
scheduler ([`KarrasDiffusionSchedulers`]): scheduler ([`KarrasDiffusionSchedulers`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. A scheduler to be used in combination with `unet` to denoise the encoded image latents.
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
...@@ -147,27 +147,25 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin ...@@ -147,27 +147,25 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -542,8 +540,8 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin ...@@ -542,8 +540,8 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin
Add noise to the image embeddings. The amount of noise is controlled by a `noise_level` input. A higher Add noise to the image embeddings. The amount of noise is controlled by a `noise_level` input. A higher
`noise_level` increases the variance in the final un-noised images. `noise_level` increases the variance in the final un-noised images.
The noise is applied in two ways The noise is applied in two ways:
1. A noise schedule is applied directly to the embeddings 1. A noise schedule is applied directly to the embeddings.
2. A vector of sinusoidal time embeddings are appended to the output. 2. A vector of sinusoidal time embeddings are appended to the output.
In both cases, the amount of noise is controlled by the same `noise_level`. In both cases, the amount of noise is controlled by the same `noise_level`.
...@@ -603,82 +601,73 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin ...@@ -603,82 +601,73 @@ class StableUnCLIPImg2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin
image_embeds: Optional[torch.FloatTensor] = None, image_embeds: Optional[torch.FloatTensor] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, either `prompt_embeds` will be The prompt or prompts to guide the image generation. If not defined, either `prompt_embeds` will be
used or prompt is initialized to `""`. used or prompt is initialized to `""`.
image (`torch.FloatTensor` or `PIL.Image.Image`): image (`torch.FloatTensor` or `PIL.Image.Image`):
`Image`, or tensor representing an image batch. The image will be encoded to its CLIP embedding which `Image` or tensor representing an image batch. The image is encoded to its CLIP embedding which the
the unet will be conditioned on. Note that the image is _not_ encoded by the vae and then used as the `unet` is conditioned on. The image is _not_ encoded by the `vae` and then used as the latents in the
latents in the denoising process such as in the standard stable diffusion text guided image variation denoising process like it is in the standard Stable Diffusion text-guided image variation process.
process. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 20): num_inference_steps (`int`, *optional*, defaults to 20):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 10.0): guidance_scale (`float`, *optional*, defaults to 10.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
noise_level (`int`, *optional*, defaults to `0`): noise_level (`int`, *optional*, defaults to `0`):
The amount of noise to add to the image embeddings. A higher `noise_level` increases the variance in The amount of noise to add to the image embeddings. A higher `noise_level` increases the variance in
the final un-noised images. See `StableUnCLIPPipeline.noise_image_embeddings` for details. the final un-noised images. See [`StableUnCLIPPipeline.noise_image_embeddings`] for more details.
image_embeds (`torch.FloatTensor`, *optional*): image_embeds (`torch.FloatTensor`, *optional*):
Pre-generated CLIP embeddings to condition the unet on. Note that these are not latents to be used in Pre-generated CLIP embeddings to condition the `unet` on. These latents are not used in the denoising
the denoising process. If you want to provide pre-generated latents, pass them to `__call__` as process. If you want to provide pre-generated latents, pass them to `__call__` as `latents`.
`latents`.
Examples: Examples:
Returns: Returns:
[`~pipelines.ImagePipelineOutput`] or `tuple`: [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is [`~pipelines.ImagePipelineOutput`] or `tuple`:
True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images. [`~ pipeline_utils.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple`. When returning
a tuple, the first element is a list with the generated images.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -21,32 +21,29 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -21,32 +21,29 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class StableDiffusionPipelineSafe(DiffusionPipeline): class StableDiffusionPipelineSafe(DiffusionPipeline):
r""" r"""
Pipeline for text-to-image generation using Safe Latent Diffusion. Pipeline based on the [`StableDiffusionPipeline`] for text-to-image generation using Safe Latent Diffusion.
The implementation is based on the [`StableDiffusionPipeline`] This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -489,78 +486,82 @@ class StableDiffusionPipelineSafe(DiffusionPipeline): ...@@ -489,78 +486,82 @@ class StableDiffusionPipelineSafe(DiffusionPipeline):
sld_mom_beta: Optional[float] = 0.4, sld_mom_beta: Optional[float] = 0.4,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
sld_guidance_scale (`float`, *optional*, defaults to 1000): sld_guidance_scale (`float`, *optional*, defaults to 1000):
Safe latent guidance as defined in [Safe Latent Diffusion](https://arxiv.org/abs/2211.05105). If `sld_guidance_scale < 1`, safety guidance is disabled.
`sld_guidance_scale` is defined as sS of Eq. 6. If set to be less than 1, safety guidance will be
disabled.
sld_warmup_steps (`int`, *optional*, defaults to 10): sld_warmup_steps (`int`, *optional*, defaults to 10):
Number of warmup steps for safety guidance. SLD will only be applied for diffusion steps greater than Number of warmup steps for safety guidance. SLD is only be applied for diffusion steps greater than
`sld_warmup_steps`. `sld_warmup_steps` is defined as `delta` of [Safe Latent `sld_warmup_steps`.
Diffusion](https://arxiv.org/abs/2211.05105).
sld_threshold (`float`, *optional*, defaults to 0.01): sld_threshold (`float`, *optional*, defaults to 0.01):
Threshold that separates the hyperplane between appropriate and inappropriate images. `sld_threshold` Threshold that separates the hyperplane between appropriate and inappropriate images.
is defined as `lamda` of Eq. 5 in [Safe Latent Diffusion](https://arxiv.org/abs/2211.05105).
sld_momentum_scale (`float`, *optional*, defaults to 0.3): sld_momentum_scale (`float`, *optional*, defaults to 0.3):
Scale of the SLD momentum to be added to the safety guidance at each diffusion step. If set to 0.0 Scale of the SLD momentum to be added to the safety guidance at each diffusion step. If set to 0.0,
momentum will be disabled. Momentum is already built up during warmup, i.e. for diffusion steps smaller momentum is disabled. Momentum is built up during warmup for diffusion steps smaller than
than `sld_warmup_steps`. `sld_momentum_scale` is defined as `sm` of Eq. 7 in [Safe Latent `sld_warmup_steps`.
Diffusion](https://arxiv.org/abs/2211.05105).
sld_mom_beta (`float`, *optional*, defaults to 0.4): sld_mom_beta (`float`, *optional*, defaults to 0.4):
Defines how safety guidance momentum builds up. `sld_mom_beta` indicates how much of the previous Defines how safety guidance momentum builds up. `sld_mom_beta` indicates how much of the previous
momentum will be kept. Momentum is already built up during warmup, i.e. for diffusion steps smaller momentum is kept. Momentum is built up during warmup for diffusion steps smaller than
than `sld_warmup_steps`. `sld_mom_beta` is defined as `beta m` of Eq. 8 in [Safe Latent `sld_warmup_steps`.
Diffusion](https://arxiv.org/abs/2211.05105).
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
Examples:
```py
import torch
from diffusers import StableDiffusionPipelineSafe
pipeline = StableDiffusionPipelineSafe.from_pretrained(
"AIML-TUDA/stable-diffusion-safe", torch_dtype=torch.float16
)
prompt = "the four horsewomen of the apocalypse, painting by tom of finland, gaston bussiere, craig mullins, j. c. leyendecker"
image = pipeline(prompt=prompt, **SafetyConfig.MEDIUM).images[0]
```
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -146,17 +146,15 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad ...@@ -146,17 +146,15 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -164,17 +162,16 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad ...@@ -164,17 +162,16 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
...@@ -652,8 +649,8 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad ...@@ -652,8 +649,8 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad
The output format of the generate image. Choose between The output format of the generate image. Choose between
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
plain tuple. of a plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that will be called every `callback_steps` steps during inference. The function will be
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
...@@ -679,11 +676,9 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad ...@@ -679,11 +676,9 @@ class StableDiffusionXLPipeline(DiffusionPipeline, FromSingleFileMixin, LoraLoad
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a
`tuple. When returning a tuple, the first element is a list with the generated images, and the second `tuple`. When returning a tuple, the first element is a list with the generated images.
element is a list of `bool`s denoting whether the corresponding generated image likely represents
"not-safe-for-work" (nsfw) content, according to the `safety_checker`.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.default_sample_size * self.vae_scale_factor height = height or self.default_sample_size * self.vae_scale_factor
......
...@@ -153,17 +153,15 @@ class StableDiffusionXLImg2ImgPipeline(DiffusionPipeline, FromSingleFileMixin, L ...@@ -153,17 +153,15 @@ class StableDiffusionXLImg2ImgPipeline(DiffusionPipeline, FromSingleFileMixin, L
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -171,17 +169,16 @@ class StableDiffusionXLImg2ImgPipeline(DiffusionPipeline, FromSingleFileMixin, L ...@@ -171,17 +169,16 @@ class StableDiffusionXLImg2ImgPipeline(DiffusionPipeline, FromSingleFileMixin, L
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
......
...@@ -258,17 +258,15 @@ class StableDiffusionXLInpaintPipeline( ...@@ -258,17 +258,15 @@ class StableDiffusionXLInpaintPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -276,17 +274,16 @@ class StableDiffusionXLInpaintPipeline( ...@@ -276,17 +274,16 @@ class StableDiffusionXLInpaintPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
......
...@@ -24,17 +24,13 @@ from ..pipeline_utils import DiffusionPipeline, ImagePipelineOutput ...@@ -24,17 +24,13 @@ from ..pipeline_utils import DiffusionPipeline, ImagePipelineOutput
class KarrasVePipeline(DiffusionPipeline): class KarrasVePipeline(DiffusionPipeline):
r""" r"""
Stochastic sampling from Karras et al. [1] tailored to the Variance-Expanding (VE) models [2]. Use Algorithm 2 and Pipeline for unconditional image generation.
the VE column of Table 1 from [1] for reference.
[1] Karras, Tero, et al. "Elucidating the Design Space of Diffusion-Based Generative Models."
https://arxiv.org/abs/2206.00364 [2] Song, Yang, et al. "Score-based generative modeling through stochastic
differential equations." https://arxiv.org/abs/2011.13456
Parameters: Parameters:
unet ([`UNet2DModel`]): U-Net architecture to denoise the encoded image. unet ([`UNet2DModel`]):
A `UNet2DModel` to denoise the encoded image.
scheduler ([`KarrasVeScheduler`]): scheduler ([`KarrasVeScheduler`]):
Scheduler for the diffusion process to be used in combination with `unet` to denoise the encoded image. A scheduler to be used in combination with `unet` to denoise the encoded image.
""" """
# add type hints for linting # add type hints for linting
...@@ -56,24 +52,28 @@ class KarrasVePipeline(DiffusionPipeline): ...@@ -56,24 +52,28 @@ class KarrasVePipeline(DiffusionPipeline):
**kwargs, **kwargs,
) -> Union[Tuple, ImagePipelineOutput]: ) -> Union[Tuple, ImagePipelineOutput]:
r""" r"""
The call function to the pipeline for generation.
Args: Args:
batch_size (`int`, *optional*, defaults to 1): batch_size (`int`, *optional*, defaults to 1):
The number of images to generate. The number of images to generate.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. Whether or not to return a [`ImagePipelineOutput`] instead of a plain tuple.
Example:
Returns: Returns:
[`~pipelines.ImagePipelineOutput`] or `tuple`: [`~pipelines.utils.ImagePipelineOutput`] if `return_dict` is [`~pipelines.ImagePipelineOutput`] or `tuple`:
True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
returned where the first element is a list with the generated images.
""" """
img_size = self.unet.config.sample_size img_size = self.unet.config.sample_size
......
...@@ -203,17 +203,15 @@ class StableDiffusionAdapterPipeline(DiffusionPipeline): ...@@ -203,17 +203,15 @@ class StableDiffusionAdapterPipeline(DiffusionPipeline):
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
......
...@@ -10,13 +10,12 @@ from ...utils import BaseOutput, OptionalDependencyNotAvailable, is_torch_availa ...@@ -10,13 +10,12 @@ from ...utils import BaseOutput, OptionalDependencyNotAvailable, is_torch_availa
@dataclass @dataclass
class TextToVideoSDPipelineOutput(BaseOutput): class TextToVideoSDPipelineOutput(BaseOutput):
""" """
Output class for text to video pipelines. Output class for text-to-video pipelines.
Args: Args:
frames (`List[np.ndarray]` or `torch.FloatTensor`) frames (`List[np.ndarray]` or `torch.FloatTensor`)
List of denoised frames (essentially images) as NumPy arrays of shape `(height, width, num_channels)` or as List of denoised frames (essentially images) as NumPy arrays of shape `(height, width, num_channels)` or as
a `torch` tensor. NumPy array present the denoised images of the diffusion pipeline. The length of the list a `torch` tensor. The length of the list denotes the video length (the number of frames).
denotes the video length i.e., the number of frames.
""" """
frames: Union[List[np.ndarray], torch.FloatTensor] frames: Union[List[np.ndarray], torch.FloatTensor]
......
...@@ -77,18 +77,18 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora ...@@ -77,18 +77,18 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora
r""" r"""
Pipeline for text-to-video generation. Pipeline for text-to-video generation.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`CLIPTextModel`]):
Frozen text-encoder. Same as Stable Diffusion 2. Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
tokenizer (`CLIPTokenizer`): tokenizer (`CLIPTokenizer`):
Tokenizer of class A [`~transformers.CLIPTokenizer`] to tokenize text.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). unet ([`UNet3DConditionModel`]):
unet ([`UNet3DConditionModel`]): Conditional U-Net architecture to denoise the encoded video latents. A [`UNet3DConditionModel`] to denoise the encoded video latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
...@@ -116,17 +116,15 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora ...@@ -116,17 +116,15 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -134,27 +132,26 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora ...@@ -134,27 +132,26 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -466,15 +463,14 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora ...@@ -466,15 +463,14 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the video generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated video. The height in pixels of the generated video.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated video. The width in pixels of the generated video.
num_frames (`int`, *optional*, defaults to 16): num_frames (`int`, *optional*, defaults to 16):
The number of video frames that are generated. Defaults to 16 frames which at 8 frames per seconds The number of video frames that are generated. Defaults to 16 frames which at 8 frames per seconds
...@@ -483,55 +479,51 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora ...@@ -483,55 +479,51 @@ class TextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lora
The number of denoising steps. More denoising steps usually lead to a higher quality videos at the The number of denoising steps. More denoising steps usually lead to a higher quality videos at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate videos that are closely linked to the text `prompt`,
usually at the expense of lower video quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the video generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`). num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. Latents should be of shape tensor is generated by sampling using the supplied random `generator`. Latents should be of shape
`(batch_size, num_channel, num_frames, height, width)`. `(batch_size, num_channel, num_frames, height, width)`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"np"`): output_type (`str`, *optional*, defaults to `"np"`):
The output format of the generate video. Choose between `torch.FloatTensor` or `np.array`. The output format of the generated video. Choose between `torch.FloatTensor` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] instead of a Whether or not to return a [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] instead
plain tuple. of a plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] or `tuple`: [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] is
When returning a tuple, the first element is a list with the generated frames. returned, otherwise a `tuple` is returned where the first element is a list with the generated frames.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -137,20 +137,20 @@ def preprocess_video(video): ...@@ -137,20 +137,20 @@ def preprocess_video(video):
class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
r""" r"""
Pipeline for text-to-video generation. Pipeline for text-guided video-to-video generation.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations. Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`CLIPTextModel`]):
Frozen text-encoder. Same as Stable Diffusion 2. Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
tokenizer (`CLIPTokenizer`): tokenizer (`CLIPTokenizer`):
Tokenizer of class A [`~transformers.CLIPTokenizer`] to tokenize text.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). unet ([`UNet3DConditionModel`]):
unet ([`UNet3DConditionModel`]): Conditional U-Net architecture to denoise the encoded video latents. A [`UNet3DConditionModel`] to denoise the encoded video latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
...@@ -178,17 +178,15 @@ class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor ...@@ -178,17 +178,15 @@ class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -196,27 +194,26 @@ class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor ...@@ -196,27 +194,26 @@ class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -550,75 +547,67 @@ class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor ...@@ -550,75 +547,67 @@ class VideoToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lor
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the video generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. video (`List[np.ndarray]` or `torch.FloatTensor`):
video: (`List[np.ndarray]` or `torch.FloatTensor`): `video` frames or tensor representing a video batch to be used as the starting point for the process.
`video` frames or tensor representing a video batch, that will be used as the starting point for the Can also accpet video latents as `image`, if passing latents directly, it will not be encoded again.
process. Can also accpet video latents as `image`, if passing latents directly, it will not be encoded
again.
strength (`float`, *optional*, defaults to 0.8): strength (`float`, *optional*, defaults to 0.8):
Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` Indicates extent to transform the reference `video`. Must be between 0 and 1. `video` is used as a
will be used as a starting point, adding more noise to it the larger the `strength`. The number of starting point, adding more noise to it the larger the `strength`. The number of denoising steps
denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will depends on the amount of noise initially added. When `strength` is 1, added noise is maximum and the
be maximum and the denoising process will run for the full number of iterations specified in denoising process runs for the full number of iterations specified in `num_inference_steps`. A value of
`num_inference_steps`. A value of 1, therefore, essentially ignores `image`. 1 essentially ignores `video`.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality videos at the The number of denoising steps. More denoising steps usually lead to a higher quality videos at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate videos that are closely linked to the text `prompt`,
usually at the expense of lower video quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the video generation. If not defined, one has to pass The prompt or prompts to guide what to not include in video generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. Latents should be of shape tensor is generated by sampling using the supplied random `generator`. Latents should be of shape
`(batch_size, num_channel, num_frames, height, width)`. `(batch_size, num_channel, num_frames, height, width)`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"np"`): output_type (`str`, *optional*, defaults to `"np"`):
The output format of the generate video. Choose between `torch.FloatTensor` or `np.array`. The output format of the generated video. Choose between `torch.FloatTensor` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] instead of a Whether or not to return a [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] instead
plain tuple. of a plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] or `tuple`: [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.TextToVideoSDPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] is
When returning a tuple, the first element is a list with the generated frames. returned, otherwise a `tuple` is returned where the first element is a list with the generated frames.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
num_images_per_prompt = 1 num_images_per_prompt = 1
......
...@@ -172,6 +172,17 @@ class CrossFrameAttnProcessor2_0: ...@@ -172,6 +172,17 @@ class CrossFrameAttnProcessor2_0:
@dataclass @dataclass
class TextToVideoPipelineOutput(BaseOutput): class TextToVideoPipelineOutput(BaseOutput):
r"""
Output class for zero-shot text-to-video pipeline.
Args:
images (`[List[PIL.Image.Image]`, `np.ndarray`]):
List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
num_channels)`.
nsfw_content_detected (`[List[bool]]`):
List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or
`None` if safety checking could not be performed.
"""
images: Union[List[PIL.Image.Image], np.ndarray] images: Union[List[PIL.Image.Image], np.ndarray]
nsfw_content_detected: Optional[List[bool]] nsfw_content_detected: Optional[List[bool]]
...@@ -264,28 +275,27 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline): ...@@ -264,28 +275,27 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline):
r""" r"""
Pipeline for zero-shot text-to-video generation using Stable Diffusion. Pipeline for zero-shot text-to-video generation using Stable Diffusion.
This model inherits from [`StableDiffusionPipeline`]. Check the superclass documentation for the generic methods This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
tokenizer (`CLIPTokenizer`): tokenizer (`CLIPTokenizer`):
Tokenizer of class A [`~transformers.CLIPTokenizer`] to tokenize text.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). unet ([`UNet2DConditionModel`]):
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. A [`UNet3DConditionModel`] to denoise the encoded video latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
about a model's potential harms.
feature_extractor ([`CLIPImageProcessor`]): feature_extractor ([`CLIPImageProcessor`]):
Model that extracts features from generated images to be used as inputs for the `safety_checker`. A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`.
""" """
def __init__( def __init__(
...@@ -311,16 +321,22 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline): ...@@ -311,16 +321,22 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline):
def forward_loop(self, x_t0, t0, t1, generator): def forward_loop(self, x_t0, t0, t1, generator):
""" """
Perform ddpm forward process from time t0 to t1. This is the same as adding noise with corresponding variance. Perform DDPM forward process from time t0 to t1. This is the same as adding noise with corresponding variance.
Args: Args:
x_t0: latent code at time t0 x_t0:
t0: t0 Latent code at time t0.
t1: t1 t0:
generator: torch.Generator object Timestep at t0.
t1:
Timestamp at t1.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
generation deterministic.
Returns: Returns:
x_t1: forward process applied to x_t0 from time t0 to t1. x_t1:
Forward process applied to x_t0 from time t0 to t1.
""" """
eps = torch.randn(x_t0.size(), generator=generator, dtype=x_t0.dtype, device=x_t0.device) eps = torch.randn(x_t0.size(), generator=generator, dtype=x_t0.dtype, device=x_t0.device)
alpha_vec = torch.prod(self.scheduler.alphas[t0:t1]) alpha_vec = torch.prod(self.scheduler.alphas[t0:t1])
...@@ -340,30 +356,35 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline): ...@@ -340,30 +356,35 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline):
cross_attention_kwargs=None, cross_attention_kwargs=None,
): ):
""" """
Perform backward process given list of time steps Perform backward process given list of time steps.
Args: Args:
latents: Latents at time timesteps[0]. latents:
timesteps: time steps, along which to perform backward process. Latents at time timesteps[0].
prompt_embeds: Pre-generated text embeddings timesteps:
Time steps along which to perform backward process.
prompt_embeds:
Pre-generated text embeddings.
guidance_scale: guidance_scale:
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
extra_step_kwargs: extra_step_kwargs. extra_step_kwargs:
cross_attention_kwargs: cross_attention_kwargs. Extra_step_kwargs.
num_warmup_steps: number of warmup steps. cross_attention_kwargs:
A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
[`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
num_warmup_steps:
number of warmup steps.
Returns: Returns:
latents: latents of backward process output at time timesteps[-1] latents:
Latents of backward process output at time timesteps[-1].
""" """
do_classifier_free_guidance = guidance_scale > 1.0 do_classifier_free_guidance = guidance_scale > 1.0
num_steps = (len(timesteps) - num_warmup_steps) // self.scheduler.order num_steps = (len(timesteps) - num_warmup_steps) // self.scheduler.order
...@@ -421,53 +442,50 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline): ...@@ -421,53 +442,50 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline):
frame_ids: Optional[List[int]] = None, frame_ids: Optional[List[int]] = None,
): ):
""" """
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. video_length (`int`, *optional*, defaults to 8):
video_length (`int`, *optional*, defaults to 8): The number of generated video frames The number of generated video frames.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in video generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_videos_per_prompt (`int`, *optional*, defaults to 1): num_videos_per_prompt (`int`, *optional*, defaults to 1):
The number of videos to generate per prompt. The number of videos to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"numpy"`): output_type (`str`, *optional*, defaults to `"numpy"`):
The output format of the generated image. Choose between `"latent"` and `"numpy"`. The output format of the generated video. Choose between `"latent"` and `"numpy"`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a
plain tuple. [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput`] instead of
a plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
motion_field_strength_x (`float`, *optional*, defaults to 12): motion_field_strength_x (`float`, *optional*, defaults to 12):
Strength of motion in generated video along x-axis. See the [paper](https://arxiv.org/abs/2303.13439), Strength of motion in generated video along x-axis. See the [paper](https://arxiv.org/abs/2303.13439),
Sect. 3.3.1. Sect. 3.3.1.
...@@ -485,10 +503,10 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline): ...@@ -485,10 +503,10 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline):
chunk-by-chunk. chunk-by-chunk.
Returns: Returns:
[`~pipelines.text_to_video_synthesis.TextToVideoPipelineOutput`]: [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput`]:
The output contains a ndarray of the generated images, when output_type != 'latent', otherwise a latent The output contains a `ndarray` of the generated video, when `output_type` != `"latent"`, otherwise a
codes of generated image, and a list of `bool`s denoting whether the corresponding generated image latent code of generated videos and a list of `bool`s indicating whether the corresponding generated
likely represents "not-safe-for-work" (nsfw) content, according to the `safety_checker`. video contains "not-safe-for-work" (nsfw) content..
""" """
assert video_length > 0 assert video_length > 0
if frame_ids is None: if frame_ids is None:
......
...@@ -33,33 +33,32 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -33,33 +33,32 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class UnCLIPPipeline(DiffusionPipeline): class UnCLIPPipeline(DiffusionPipeline):
""" """
Pipeline for text-to-image generation using unCLIP Pipeline for text-to-image generation using unCLIP.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
text_encoder ([`CLIPTextModelWithProjection`]): text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
Frozen text-encoder. Frozen text-encoder.
tokenizer (`CLIPTokenizer`): tokenizer ([`~transformers.CLIPTokenizer`]):
Tokenizer of class A `CLIPTokenizer` to tokenize text.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
prior ([`PriorTransformer`]): prior ([`PriorTransformer`]):
The canonincal unCLIP prior to approximate the image embedding from the text embedding. The canonical unCLIP prior to approximate the image embedding from the text embedding.
text_proj ([`UnCLIPTextProjModel`]): text_proj ([`UnCLIPTextProjModel`]):
Utility class to prepare and combine the embeddings before they are passed to the decoder. Utility class to prepare and combine the embeddings before they are passed to the decoder.
decoder ([`UNet2DConditionModel`]): decoder ([`UNet2DConditionModel`]):
The decoder to invert the image embedding into an image. The decoder to invert the image embedding into an image.
super_res_first ([`UNet2DModel`]): super_res_first ([`UNet2DModel`]):
Super resolution unet. Used in all but the last step of the super resolution diffusion process. Super resolution UNet. Used in all but the last step of the super resolution diffusion process.
super_res_last ([`UNet2DModel`]): super_res_last ([`UNet2DModel`]):
Super resolution unet. Used in the last step of the super resolution diffusion process. Super resolution UNet. Used in the last step of the super resolution diffusion process.
prior_scheduler ([`UnCLIPScheduler`]): prior_scheduler ([`UnCLIPScheduler`]):
Scheduler used in the prior denoising process. Just a modified DDPMScheduler. Scheduler used in the prior denoising process (a modified [`DDPMScheduler`]).
decoder_scheduler ([`UnCLIPScheduler`]): decoder_scheduler ([`UnCLIPScheduler`]):
Scheduler used in the decoder denoising process. Just a modified DDPMScheduler. Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]).
super_res_scheduler ([`UnCLIPScheduler`]): super_res_scheduler ([`UnCLIPScheduler`]):
Scheduler used in the super resolution denoising process. Just a modified DDPMScheduler. Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]).
""" """
...@@ -227,12 +226,12 @@ class UnCLIPPipeline(DiffusionPipeline): ...@@ -227,12 +226,12 @@ class UnCLIPPipeline(DiffusionPipeline):
return_dict: bool = True, return_dict: bool = True,
): ):
""" """
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. This can only be left undefined if The prompt or prompts to guide image generation. This can only be left undefined if `text_model_output`
`text_model_output` and `text_attention_mask` is passed. and `text_attention_mask` is passed.
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
prior_num_inference_steps (`int`, *optional*, defaults to 25): prior_num_inference_steps (`int`, *optional*, defaults to 25):
...@@ -245,8 +244,8 @@ class UnCLIPPipeline(DiffusionPipeline): ...@@ -245,8 +244,8 @@ class UnCLIPPipeline(DiffusionPipeline):
The number of denoising steps for super resolution. More denoising steps usually lead to a higher The number of denoising steps for super resolution. More denoising steps usually lead to a higher
quality image at the expense of slower inference. quality image at the expense of slower inference.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
prior_latents (`torch.FloatTensor` of shape (batch size, embeddings dimension), *optional*): prior_latents (`torch.FloatTensor` of shape (batch size, embeddings dimension), *optional*):
Pre-generated noisy latents to be used as inputs for the prior. Pre-generated noisy latents to be used as inputs for the prior.
decoder_latents (`torch.FloatTensor` of shape (batch size, channels, height, width), *optional*): decoder_latents (`torch.FloatTensor` of shape (batch size, channels, height, width), *optional*):
...@@ -254,29 +253,27 @@ class UnCLIPPipeline(DiffusionPipeline): ...@@ -254,29 +253,27 @@ class UnCLIPPipeline(DiffusionPipeline):
super_res_latents (`torch.FloatTensor` of shape (batch size, channels, super res height, super res width), *optional*): super_res_latents (`torch.FloatTensor` of shape (batch size, channels, super res height, super res width), *optional*):
Pre-generated noisy latents to be used as inputs for the decoder. Pre-generated noisy latents to be used as inputs for the decoder.
prior_guidance_scale (`float`, *optional*, defaults to 4.0): prior_guidance_scale (`float`, *optional*, defaults to 4.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
decoder_guidance_scale (`float`, *optional*, defaults to 4.0): decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
text_model_output (`CLIPTextModelOutput`, *optional*): text_model_output (`CLIPTextModelOutput`, *optional*):
Pre-defined CLIPTextModel outputs that can be derived from the text encoder. Pre-defined text outputs Pre-defined [`CLIPTextModel`] outputs that can be derived from the text encoder. Pre-defined text
can be passed for tasks like text embedding interpolations. Make sure to also pass outputs can be passed for tasks like text embedding interpolations. Make sure to also pass
`text_attention_mask` in this case. `prompt` can the be left to `None`. `text_attention_mask` in this case. `prompt` can the be left `None`.
text_attention_mask (`torch.Tensor`, *optional*): text_attention_mask (`torch.Tensor`, *optional*):
Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention
masks are necessary when passing `text_model_output`. masks are necessary when passing `text_model_output`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
Returns:
[`~pipelines.ImagePipelineOutput`] or `tuple`:
If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
returned where the first element is a list with the generated images.
""" """
if prompt is not None: if prompt is not None:
if isinstance(prompt, str): if isinstance(prompt, str):
......
...@@ -37,36 +37,32 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -37,36 +37,32 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class UnCLIPImageVariationPipeline(DiffusionPipeline): class UnCLIPImageVariationPipeline(DiffusionPipeline):
""" """
Pipeline to generate variations from an input image using unCLIP Pipeline to generate image variations from an input image using UnCLIP.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
text_encoder ([`CLIPTextModelWithProjection`]): text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
Frozen text-encoder. Frozen text-encoder.
tokenizer (`CLIPTokenizer`): tokenizer ([`~transformers.CLIPTokenizer`]):
Tokenizer of class A `CLIPTokenizer` to tokenize text.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). feature_extractor ([`~transformers.CLIPImageProcessor`]):
feature_extractor ([`CLIPImageProcessor`]):
Model that extracts features from generated images to be used as inputs for the `image_encoder`. Model that extracts features from generated images to be used as inputs for the `image_encoder`.
image_encoder ([`CLIPVisionModelWithProjection`]): image_encoder ([`~transformers.CLIPVisionModelWithProjection`]):
Frozen CLIP image-encoder. unCLIP Image Variation uses the vision portion of Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModelWithProjection),
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
text_proj ([`UnCLIPTextProjModel`]): text_proj ([`UnCLIPTextProjModel`]):
Utility class to prepare and combine the embeddings before they are passed to the decoder. Utility class to prepare and combine the embeddings before they are passed to the decoder.
decoder ([`UNet2DConditionModel`]): decoder ([`UNet2DConditionModel`]):
The decoder to invert the image embedding into an image. The decoder to invert the image embedding into an image.
super_res_first ([`UNet2DModel`]): super_res_first ([`UNet2DModel`]):
Super resolution unet. Used in all but the last step of the super resolution diffusion process. Super resolution UNet. Used in all but the last step of the super resolution diffusion process.
super_res_last ([`UNet2DModel`]): super_res_last ([`UNet2DModel`]):
Super resolution unet. Used in the last step of the super resolution diffusion process. Super resolution UNet. Used in the last step of the super resolution diffusion process.
decoder_scheduler ([`UnCLIPScheduler`]): decoder_scheduler ([`UnCLIPScheduler`]):
Scheduler used in the decoder denoising process. Just a modified DDPMScheduler. Scheduler used in the decoder denoising process (a modified [`DDPMScheduler`]).
super_res_scheduler ([`UnCLIPScheduler`]): super_res_scheduler ([`UnCLIPScheduler`]):
Scheduler used in the super resolution denoising process. Just a modified DDPMScheduler. Scheduler used in the super resolution denoising process (a modified [`DDPMScheduler`]).
""" """
decoder: UNet2DConditionModel decoder: UNet2DConditionModel
...@@ -214,14 +210,14 @@ class UnCLIPImageVariationPipeline(DiffusionPipeline): ...@@ -214,14 +210,14 @@ class UnCLIPImageVariationPipeline(DiffusionPipeline):
return_dict: bool = True, return_dict: bool = True,
): ):
""" """
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`): image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`):
The image or images to guide the image generation. If you provide a tensor, it needs to comply with the `Image` or tensor representing an image batch to be used as the starting point. If you provide a
configuration of tensor, it needs to be compatible with the [`CLIPImageProcessor`]
[this](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json) [configuration](https://huggingface.co/fusing/karlo-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json).
`CLIPImageProcessor`. Can be left to `None` only when `image_embeddings` are passed. Can be left as `None` only when `image_embeddings` are passed.
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
decoder_num_inference_steps (`int`, *optional*, defaults to 25): decoder_num_inference_steps (`int`, *optional*, defaults to 25):
...@@ -231,26 +227,27 @@ class UnCLIPImageVariationPipeline(DiffusionPipeline): ...@@ -231,26 +227,27 @@ class UnCLIPImageVariationPipeline(DiffusionPipeline):
The number of denoising steps for super resolution. More denoising steps usually lead to a higher The number of denoising steps for super resolution. More denoising steps usually lead to a higher
quality image at the expense of slower inference. quality image at the expense of slower inference.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
decoder_latents (`torch.FloatTensor` of shape (batch size, channels, height, width), *optional*): decoder_latents (`torch.FloatTensor` of shape (batch size, channels, height, width), *optional*):
Pre-generated noisy latents to be used as inputs for the decoder. Pre-generated noisy latents to be used as inputs for the decoder.
super_res_latents (`torch.FloatTensor` of shape (batch size, channels, super res height, super res width), *optional*): super_res_latents (`torch.FloatTensor` of shape (batch size, channels, super res height, super res width), *optional*):
Pre-generated noisy latents to be used as inputs for the decoder. Pre-generated noisy latents to be used as inputs for the decoder.
decoder_guidance_scale (`float`, *optional*, defaults to 4.0): decoder_guidance_scale (`float`, *optional*, defaults to 4.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
image_embeddings (`torch.Tensor`, *optional*): image_embeddings (`torch.Tensor`, *optional*):
Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings
can be passed for tasks like image interpolations. `image` can the be left to `None`. can be passed for tasks like image interpolations. `image` can be left as `None`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
Returns:
[`~pipelines.ImagePipelineOutput`] or `tuple`:
If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
returned where the first element is a list with the generated images.
""" """
if image is not None: if image is not None:
if isinstance(image, PIL.Image.Image): if isinstance(image, PIL.Image.Image):
......
...@@ -81,43 +81,33 @@ class ImageTextPipelineOutput(BaseOutput): ...@@ -81,43 +81,33 @@ class ImageTextPipelineOutput(BaseOutput):
class UniDiffuserPipeline(DiffusionPipeline): class UniDiffuserPipeline(DiffusionPipeline):
r""" r"""
Pipeline for a bimodal image-text [UniDiffuser](https://arxiv.org/pdf/2303.06555.pdf) model, which supports Pipeline for a bimodal image-text model which supports unconditional text and image generation, text-conditioned
unconditional text and image generation, text-conditioned image generation, image-conditioned text generation, and image generation, image-conditioned text generation, and joint image-text generation.
joint image-text generation.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. This Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. This
is part of the UniDiffuser image representation, along with the CLIP vision encoding. is part of the UniDiffuser image representation along with the CLIP vision encoding.
text_encoder ([`CLIPTextModel`]): text_encoder ([`CLIPTextModel`]):
Frozen text-encoder. Similar to Stable Diffusion, UniDiffuser uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel) to encode text
prompts.
image_encoder ([`CLIPVisionModel`]): image_encoder ([`CLIPVisionModel`]):
UniDiffuser uses the vision portion of A [`~transformers.CLIPVisionModel`] to encode images as part of its image representation along with the VAE
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModel) to encode latent representation.
images as part of its image representation, along with the VAE latent representation.
image_processor ([`CLIPImageProcessor`]): image_processor ([`CLIPImageProcessor`]):
CLIP image processor of class [`~transformers.CLIPImageProcessor`] to preprocess an image before CLIP encoding it with `image_encoder`.
[CLIPImageProcessor](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPImageProcessor),
used to preprocess the image before CLIP encoding it with `image_encoder`.
clip_tokenizer ([`CLIPTokenizer`]): clip_tokenizer ([`CLIPTokenizer`]):
Tokenizer of class A [`~transformers.CLIPTokenizer`] to tokenize the prompt before encoding it with `text_encoder`.
[CLIPTokenizer](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTokenizer) which
is used to tokenizer a prompt before encoding it with `text_encoder`.
text_decoder ([`UniDiffuserTextDecoder`]): text_decoder ([`UniDiffuserTextDecoder`]):
Frozen text decoder. This is a GPT-style model which is used to generate text from the UniDiffuser Frozen text decoder. This is a GPT-style model which is used to generate text from the UniDiffuser
embedding. embedding.
text_tokenizer ([`GPT2Tokenizer`]): text_tokenizer ([`GPT2Tokenizer`]):
Tokenizer of class A [`~transformers.GPT2Tokenizer`] to decode text for text generation; used along with the `text_decoder`.
[GPT2Tokenizer](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Tokenizer) which
is used along with the `text_decoder` to decode text for text generation.
unet ([`UniDiffuserModel`]): unet ([`UniDiffuserModel`]):
UniDiffuser uses a [U-ViT](https://github.com/baofff/U-ViT) model architecture, which is similar to a A [U-ViT](https://github.com/baofff/U-ViT) model with UNNet-style skip connections between transformer
[`Transformer2DModel`] with U-Net-style skip connections between transformer layers. layers to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image and/or text latents. The A scheduler to be used in combination with `unet` to denoise the encoded image and/or text latents. The
original UniDiffuser paper uses the [`DPMSolverMultistepScheduler`] scheduler. original UniDiffuser paper uses the [`DPMSolverMultistepScheduler`] scheduler.
...@@ -1062,14 +1052,14 @@ class UniDiffuserPipeline(DiffusionPipeline): ...@@ -1062,14 +1052,14 @@ class UniDiffuserPipeline(DiffusionPipeline):
callback_steps: int = 1, callback_steps: int = 1,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds` The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. Required for text-conditioned image generation (`text2img`) mode. Required for text-conditioned image generation (`text2img`) mode.
image (`torch.FloatTensor` or `PIL.Image.Image`, *optional*): image (`torch.FloatTensor` or `PIL.Image.Image`, *optional*):
`Image`, or tensor representing an image batch. Required for image-conditioned text generation `Image` or tensor representing an image batch. Required for image-conditioned text generation
(`img2text`) mode. (`img2text`) mode.
height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
...@@ -1077,78 +1067,74 @@ class UniDiffuserPipeline(DiffusionPipeline): ...@@ -1077,78 +1067,74 @@ class UniDiffuserPipeline(DiffusionPipeline):
The width in pixels of the generated image. The width in pixels of the generated image.
data_type (`int`, *optional*, defaults to 1): data_type (`int`, *optional*, defaults to 1):
The data type (either 0 or 1). Only used if you are loading a checkpoint which supports a data type The data type (either 0 or 1). Only used if you are loading a checkpoint which supports a data type
embedding; this is added for compatibility with the UniDiffuser-v1 checkpoint. embedding; this is added for compatibility with the
[UniDiffuser-v1](https://huggingface.co/thu-ml/unidiffuser-v1) checkpoint.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 8.0): guidance_scale (`float`, *optional*, defaults to 8.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality. Note that the original [UniDiffuser
paper](https://arxiv.org/pdf/2303.06555.pdf) uses a different definition of the guidance scale `w'`,
which satisfies `w = w' + 1`.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). Used in
less than `1`). Used in text-conditioned image generation (`text2img`) mode. text-conditioned image generation (`text2img`) mode.
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. Used in `text2img` (text-conditioned image generation) and The number of images to generate per prompt. Used in `text2img` (text-conditioned image generation) and
`img` mode. If the mode is joint and both `num_images_per_prompt` and `num_prompts_per_image` are `img` mode. If the mode is joint and both `num_images_per_prompt` and `num_prompts_per_image` are
supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples will be generated. supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples are generated.
num_prompts_per_image (`int`, *optional*, defaults to 1): num_prompts_per_image (`int`, *optional*, defaults to 1):
The number of prompts to generate per image. Used in `img2text` (image-conditioned text generation) and The number of prompts to generate per image. Used in `img2text` (image-conditioned text generation) and
`text` mode. If the mode is joint and both `num_images_per_prompt` and `num_prompts_per_image` are `text` mode. If the mode is joint and both `num_images_per_prompt` and `num_prompts_per_image` are
supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples will be generated. supplied, `min(num_images_per_prompt, num_prompts_per_image)` samples are generated.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for joint Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for joint
image-text generation. Can be used to tweak the same generation with different prompts. If not image-text generation. Can be used to tweak the same generation with different prompts. If not
provided, a latents tensor will be generated by sampling using the supplied random `generator`. Note provided, a latents tensor is generated by sampling using the supplied random `generator`. This assumes
that this is assumed to be a full set of VAE, CLIP, and text latents, if supplied, this will override a full set of VAE, CLIP, and text latents, if supplied, overrides the value of `prompt_latents`,
the value of `prompt_latents`, `vae_latents`, and `clip_latents`. `vae_latents`, and `clip_latents`.
prompt_latents (`torch.FloatTensor`, *optional*): prompt_latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for text Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for text
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will be generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
vae_latents (`torch.FloatTensor`, *optional*): vae_latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will be generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
clip_latents (`torch.FloatTensor`, *optional*): clip_latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will be generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. Used in text-conditioned provided, text embeddings are generated from the `prompt` input argument. Used in text-conditioned
image generation (`text2img`) mode. image generation (`text2img`) mode.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are be generated from the `negative_prompt` input argument. Used
argument. Used in text-conditioned image generation (`text2img`) mode. in text-conditioned image generation (`text2img`) mode.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.unidiffuser.ImageTextPipelineOutput`] instead of a plain tuple. Whether or not to return a [`~pipelines.ImageTextPipelineOutput`] instead of a plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Returns: Returns:
[`~pipelines.unidiffuser.ImageTextPipelineOutput`] or `tuple`: [`~pipelines.unidiffuser.ImageTextPipelineOutput`] or `tuple`:
[`pipelines.unidiffuser.ImageTextPipelineOutput`] if `return_dict` is True, otherwise a `tuple`. When If `return_dict` is `True`, [`~pipelines.unidiffuser.ImageTextPipelineOutput`] is returned, otherwise a
returning a tuple, the first element is a list with the generated images, and the second element is a list `tuple` is returned where the first element is a list with the generated images and the second element
of generated texts. is a list of generated texts.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
......
...@@ -21,28 +21,27 @@ class VersatileDiffusionPipeline(DiffusionPipeline): ...@@ -21,28 +21,27 @@ class VersatileDiffusionPipeline(DiffusionPipeline):
r""" r"""
Pipeline for text-to-image generation using Stable Diffusion. Pipeline for text-to-image generation using Stable Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionMegaSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
tokenizer: CLIPTokenizer tokenizer: CLIPTokenizer
...@@ -98,51 +97,47 @@ class VersatileDiffusionPipeline(DiffusionPipeline): ...@@ -98,51 +97,47 @@ class VersatileDiffusionPipeline(DiffusionPipeline):
callback_steps: int = 1, callback_steps: int = 1,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
image (`PIL.Image.Image`, `List[PIL.Image.Image]` or `torch.Tensor`): image (`PIL.Image.Image`, `List[PIL.Image.Image]` or `torch.Tensor`):
The image prompt or prompts to guide the image generation. The image prompt or prompts to guide the image generation.
height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
...@@ -171,10 +166,10 @@ class VersatileDiffusionPipeline(DiffusionPipeline): ...@@ -171,10 +166,10 @@ class VersatileDiffusionPipeline(DiffusionPipeline):
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
expected_components = inspect.signature(VersatileDiffusionImageVariationPipeline.__init__).parameters.keys() expected_components = inspect.signature(VersatileDiffusionImageVariationPipeline.__init__).parameters.keys()
components = {name: component for name, component in self.components.items() if name in expected_components} components = {name: component for name, component in self.components.items() if name in expected_components}
...@@ -214,51 +209,47 @@ class VersatileDiffusionPipeline(DiffusionPipeline): ...@@ -214,51 +209,47 @@ class VersatileDiffusionPipeline(DiffusionPipeline):
callback_steps: int = 1, callback_steps: int = 1,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
...@@ -278,10 +269,10 @@ class VersatileDiffusionPipeline(DiffusionPipeline): ...@@ -278,10 +269,10 @@ class VersatileDiffusionPipeline(DiffusionPipeline):
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
expected_components = inspect.signature(VersatileDiffusionTextToImagePipeline.__init__).parameters.keys() expected_components = inspect.signature(VersatileDiffusionTextToImagePipeline.__init__).parameters.keys()
components = {name: component for name, component in self.components.items() if name in expected_components} components = {name: component for name, component in self.components.items() if name in expected_components}
...@@ -327,51 +318,47 @@ class VersatileDiffusionPipeline(DiffusionPipeline): ...@@ -327,51 +318,47 @@ class VersatileDiffusionPipeline(DiffusionPipeline):
callback_steps: int = 1, callback_steps: int = 1,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
...@@ -404,9 +391,9 @@ class VersatileDiffusionPipeline(DiffusionPipeline): ...@@ -404,9 +391,9 @@ class VersatileDiffusionPipeline(DiffusionPipeline):
``` ```
Returns: Returns:
[`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: [`~pipelines.ImagePipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple. When If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
returning a tuple, the first element is a list with the generated images. returned where the first element is a list with the generated images.
""" """
expected_components = inspect.signature(VersatileDiffusionDualGuidedPipeline.__init__).parameters.keys() expected_components = inspect.signature(VersatileDiffusionDualGuidedPipeline.__init__).parameters.keys()
......
...@@ -40,18 +40,20 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -40,18 +40,20 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline): class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline):
r""" r"""
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the Pipeline for image-text dual-guided generation using Versatile Diffusion.
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Parameters: Parameters:
vqvae ([`VQModel`]): vqvae ([`VQModel`]):
Vector-quantized (VQ) Model to encode and decode images to and from latent representations. Vector-quantized (VQ) model to encode and decode images to and from latent representations.
bert ([`LDMBertModel`]): bert ([`LDMBertModel`]):
Text-encoder model based on [BERT](https://huggingface.co/docs/transformers/model_doc/bert) architecture. Text-encoder model based on [`~transformers.BERT`].
tokenizer (`transformers.BertTokenizer`): tokenizer ([`~transformers.BertTokenizer`]):
Tokenizer of class A `BertTokenizer` to tokenize text.
[BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer). unet ([`UNet2DConditionModel`]):
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. A `UNet2DConditionModel` to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
...@@ -395,51 +397,46 @@ class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline): ...@@ -395,51 +397,46 @@ class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline):
**kwargs, **kwargs,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
...@@ -473,9 +470,9 @@ class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline): ...@@ -473,9 +470,9 @@ class VersatileDiffusionDualGuidedPipeline(DiffusionPipeline):
``` ```
Returns: Returns:
[`~pipelines.stable_diffusion.ImagePipelineOutput`] or `tuple`: [`~pipelines.ImagePipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.ImagePipelineOutput`] if `return_dict` is True, otherwise a `tuple. When If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
returning a tuple, the first element is a list with the generated images. returned where the first element is a list with the generated images.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.image_unet.config.sample_size * self.vae_scale_factor height = height or self.image_unet.config.sample_size * self.vae_scale_factor
......
...@@ -34,18 +34,20 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -34,18 +34,20 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class VersatileDiffusionImageVariationPipeline(DiffusionPipeline): class VersatileDiffusionImageVariationPipeline(DiffusionPipeline):
r""" r"""
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the Pipeline for image variation using Versatile Diffusion.
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Parameters: Parameters:
vqvae ([`VQModel`]): vqvae ([`VQModel`]):
Vector-quantized (VQ) Model to encode and decode images to and from latent representations. Vector-quantized (VQ) model to encode and decode images to and from latent representations.
bert ([`LDMBertModel`]): bert ([`LDMBertModel`]):
Text-encoder model based on [BERT](https://huggingface.co/docs/transformers/model_doc/bert) architecture. Text-encoder model based on [`~transformers.BERT`].
tokenizer (`transformers.BertTokenizer`): tokenizer ([`~transformers.BertTokenizer`]):
Tokenizer of class A `BertTokenizer` to tokenize text.
[BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer). unet ([`UNet2DConditionModel`]):
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. A `UNet2DConditionModel` to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
...@@ -247,51 +249,47 @@ class VersatileDiffusionImageVariationPipeline(DiffusionPipeline): ...@@ -247,51 +249,47 @@ class VersatileDiffusionImageVariationPipeline(DiffusionPipeline):
**kwargs, **kwargs,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
image (`PIL.Image.Image`, `List[PIL.Image.Image]` or `torch.Tensor`): image (`PIL.Image.Image`, `List[PIL.Image.Image]` or `torch.Tensor`):
The image prompt or prompts to guide the image generation. The image prompt or prompts to guide the image generation.
height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
...@@ -320,10 +318,8 @@ class VersatileDiffusionImageVariationPipeline(DiffusionPipeline): ...@@ -320,10 +318,8 @@ class VersatileDiffusionImageVariationPipeline(DiffusionPipeline):
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images.
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
(nsfw) content, according to the `safety_checker`.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.image_unet.config.sample_size * self.vae_scale_factor height = height or self.image_unet.config.sample_size * self.vae_scale_factor
......
...@@ -33,18 +33,20 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -33,18 +33,20 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class VersatileDiffusionTextToImagePipeline(DiffusionPipeline): class VersatileDiffusionTextToImagePipeline(DiffusionPipeline):
r""" r"""
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the Pipeline for text-to-image generation using Versatile Diffusion.
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Parameters: Parameters:
vqvae ([`VQModel`]): vqvae ([`VQModel`]):
Vector-quantized (VQ) Model to encode and decode images to and from latent representations. Vector-quantized (VQ) model to encode and decode images to and from latent representations.
bert ([`LDMBertModel`]): bert ([`LDMBertModel`]):
Text-encoder model based on [BERT](https://huggingface.co/docs/transformers/model_doc/bert) architecture. Text-encoder model based on [`~transformers.BERT`].
tokenizer (`transformers.BertTokenizer`): tokenizer ([`~transformers.BertTokenizer`]):
Tokenizer of class A `BertTokenizer` to tokenize text.
[BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer). unet ([`UNet2DConditionModel`]):
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. A `UNet2DConditionModel` to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
...@@ -329,51 +331,47 @@ class VersatileDiffusionTextToImagePipeline(DiffusionPipeline): ...@@ -329,51 +331,47 @@ class VersatileDiffusionTextToImagePipeline(DiffusionPipeline):
**kwargs, **kwargs,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
height (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.image_unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.image_unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
...@@ -394,10 +392,8 @@ class VersatileDiffusionTextToImagePipeline(DiffusionPipeline): ...@@ -394,10 +392,8 @@ class VersatileDiffusionTextToImagePipeline(DiffusionPipeline):
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images.
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
(nsfw) content, according to the `safety_checker`.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.image_unet.config.sample_size * self.vae_scale_factor height = height or self.image_unet.config.sample_size * self.vae_scale_factor
......
...@@ -51,24 +51,21 @@ class LearnedClassifierFreeSamplingEmbeddings(ModelMixin, ConfigMixin): ...@@ -51,24 +51,21 @@ class LearnedClassifierFreeSamplingEmbeddings(ModelMixin, ConfigMixin):
class VQDiffusionPipeline(DiffusionPipeline): class VQDiffusionPipeline(DiffusionPipeline):
r""" r"""
Pipeline for text-to-image generation using VQ Diffusion Pipeline for text-to-image generation using VQ Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vqvae ([`VQModel`]): vqvae ([`VQModel`]):
Vector Quantized Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent Vector Quantized Variational Auto-Encoder (VAE) model to encode and decode images to and from latent
representations. representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. VQ Diffusion uses the text portion of Frozen text-encoder ([clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`):
Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
transformer ([`Transformer2DModel`]): transformer ([`Transformer2DModel`]):
Conditional transformer to denoise the encoded image latents. A conditional `Transformer2DModel` to denoise the encoded image latents.
scheduler ([`VQDiffusionScheduler`]): scheduler ([`VQDiffusionScheduler`]):
A scheduler to be used in combination with `transformer` to denoise the encoded image latents. A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
""" """
...@@ -179,20 +176,17 @@ class VQDiffusionPipeline(DiffusionPipeline): ...@@ -179,20 +176,17 @@ class VQDiffusionPipeline(DiffusionPipeline):
callback_steps: int = 1, callback_steps: int = 1,
) -> Union[ImagePipelineOutput, Tuple]: ) -> Union[ImagePipelineOutput, Tuple]:
""" """
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
num_inference_steps (`int`, *optional*, defaults to 100): num_inference_steps (`int`, *optional*, defaults to 100):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
truncation_rate (`float`, *optional*, defaults to 1.0 (equivalent to no truncation)): truncation_rate (`float`, *optional*, defaults to 1.0 (equivalent to no truncation)):
Used to "truncate" the predicted classes for x_0 such that the cumulative probability for a pixel is at Used to "truncate" the predicted classes for x_0 such that the cumulative probability for a pixel is at
most `truncation_rate`. The lowest probabilities that would increase the cumulative probability above most `truncation_rate`. The lowest probabilities that would increase the cumulative probability above
...@@ -200,27 +194,27 @@ class VQDiffusionPipeline(DiffusionPipeline): ...@@ -200,27 +194,27 @@ class VQDiffusionPipeline(DiffusionPipeline):
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor` of shape (batch), *optional*): latents (`torch.FloatTensor` of shape (batch), *optional*):
Pre-generated noisy latents to be used as inputs for image generation. Must be valid embedding indices. Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will generation. Must be valid embedding indices.If not provided, a latents tensor will be generated of
be generated of completely masked latent pixels. completely masked latent pixels.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generated image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple. Whether or not to return a [`~pipelines.ImagePipelineOutput`] instead of a plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Returns: Returns:
[`~pipelines.ImagePipelineOutput`] or `tuple`: [`~ pipeline_utils.ImagePipelineOutput `] if `return_dict` [`~pipelines.ImagePipelineOutput`] or `tuple`:
is True, otherwise a `tuple. When returning a tuple, the first element is a list with the generated images. If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
returned where the first element is a list with the generated images.
""" """
if isinstance(prompt, str): if isinstance(prompt, str):
batch_size = 1 batch_size = 1
...@@ -309,8 +303,9 @@ class VQDiffusionPipeline(DiffusionPipeline): ...@@ -309,8 +303,9 @@ class VQDiffusionPipeline(DiffusionPipeline):
def truncate(self, log_p_x_0: torch.FloatTensor, truncation_rate: float) -> torch.FloatTensor: def truncate(self, log_p_x_0: torch.FloatTensor, truncation_rate: float) -> torch.FloatTensor:
""" """
Truncates log_p_x_0 such that for each column vector, the total cumulative probability is `truncation_rate` The Truncates `log_p_x_0` such that for each column vector, the total cumulative probability is `truncation_rate`
lowest probabilities that would increase the cumulative probability above `truncation_rate` are set to zero. The lowest probabilities that would increase the cumulative probability above `truncation_rate` are set to
zero.
""" """
sorted_log_p_x_0, indices = torch.sort(log_p_x_0, 1, descending=True) sorted_log_p_x_0, indices = torch.sort(log_p_x_0, 1, descending=True)
sorted_p_x_0 = torch.exp(sorted_log_p_x_0) sorted_p_x_0 = torch.exp(sorted_log_p_x_0)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment