Unverified Commit a69754bb authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

[docs] Clean up pipeline apis (#3905)

* start with stable diffusion

* fix

* finish stable diffusion pipelines

* fix path to pipeline output

* fix flax paths

* fix copies

* add up to score sde ve

* finish first pass of pipelines

* fix copies

* second review

* align doc titles

* more review fixes

* final review
parent bcc570b9
...@@ -80,31 +80,30 @@ EXAMPLE_DOC_STRING = """ ...@@ -80,31 +80,30 @@ EXAMPLE_DOC_STRING = """
class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline):
r""" r"""
Pipeline for text-to-image generation using Stable Diffusion. Flax-based pipeline for text-to-image generation using Stable Diffusion.
This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`FlaxAutoencoderKL`]): vae ([`FlaxAutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`FlaxCLIPTextModel`]): text_encoder ([`~transformers.FlaxCLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.FlaxCLIPTextModel), tokenizer ([`~transformers.CLIPTokenizer`]):
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`FlaxUNet2DConditionModel`]):
Tokenizer of class A `FlaxUNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`FlaxUNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or
[`FlaxDPMSolverMultistepScheduler`]. [`FlaxDPMSolverMultistepScheduler`].
safety_checker ([`FlaxStableDiffusionSafetyChecker`]): safety_checker ([`FlaxStableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
def __init__( def __init__(
...@@ -324,31 +323,35 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): ...@@ -324,31 +323,35 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline):
jit: bool = False, jit: bool = False,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
latents (`jnp.array`, *optional*): latents (`jnp.array`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. tensor will ge generated generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
by sampling using the supplied random `generator`. array is generated by sampling using the supplied random `generator`.
jit (`bool`, defaults to `False`): jit (`bool`, defaults to `False`):
Whether to run `pmap` versions of the generation and safety scoring functions. NOTE: This argument Whether to run `pmap` versions of the generation and safety scoring functions.
exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a future release.
<Tip warning={true}>
This argument exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a
future release.
</Tip>
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of
a plain tuple. a plain tuple.
...@@ -357,10 +360,10 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline): ...@@ -357,10 +360,10 @@ class FlaxStableDiffusionPipeline(FlaxDiffusionPipeline):
Returns: Returns:
[`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a If `return_dict` is `True`, [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] is
`tuple. When returning a tuple, the first element is a list with the generated images, and the second returned, otherwise a `tuple` is returned where the first element is a list with the generated images
element is a list of `bool`s denoting whether the corresponding generated image likely represents and the second element is a list of `bool`s indicating whether the corresponding generated image
"not-safe-for-work" (nsfw) content, according to the `safety_checker`. contains "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -104,31 +104,30 @@ EXAMPLE_DOC_STRING = """ ...@@ -104,31 +104,30 @@ EXAMPLE_DOC_STRING = """
class FlaxStableDiffusionImg2ImgPipeline(FlaxDiffusionPipeline): class FlaxStableDiffusionImg2ImgPipeline(FlaxDiffusionPipeline):
r""" r"""
Pipeline for image-to-image generation using Stable Diffusion. Flax-based pipeline for text-guided image-to-image generation using Stable Diffusion.
This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`FlaxAutoencoderKL`]): vae ([`FlaxAutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`FlaxCLIPTextModel`]): text_encoder ([`~transformers.FlaxCLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.FlaxCLIPTextModel), tokenizer ([`~transformers.CLIPTokenizer`]):
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`FlaxUNet2DConditionModel`]):
Tokenizer of class A `FlaxUNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`FlaxUNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or
[`FlaxDPMSolverMultistepScheduler`]. [`FlaxDPMSolverMultistepScheduler`].
safety_checker ([`FlaxStableDiffusionSafetyChecker`]): safety_checker ([`FlaxStableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
def __init__( def __init__(
...@@ -353,52 +352,58 @@ class FlaxStableDiffusionImg2ImgPipeline(FlaxDiffusionPipeline): ...@@ -353,52 +352,58 @@ class FlaxStableDiffusionImg2ImgPipeline(FlaxDiffusionPipeline):
jit: bool = False, jit: bool = False,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt_ids (`jnp.array`): prompt_ids (`jnp.array`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
image (`jnp.array`): image (`jnp.array`):
Array representing an image batch, that will be used as the starting point for the process. Array representing an image batch to be used as the starting point.
params (`Dict` or `FrozenDict`): Dictionary containing the model parameters/weights params (`Dict` or `FrozenDict`):
prng_seed (`jax.random.KeyArray` or `jax.Array`): Array containing random number generator key Dictionary containing the model parameters/weights.
prng_seed (`jax.random.KeyArray` or `jax.Array`):
Array containing random number generator key.
strength (`float`, *optional*, defaults to 0.8): strength (`float`, *optional*, defaults to 0.8):
Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
will be used as a starting point, adding more noise to it the larger the `strength`. The number of starting point and more noise is added the higher the `strength`. The number of denoising steps depends
denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
be maximum and the denoising process will run for the full number of iterations specified in process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
essentially ignores `image`.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference. This parameter is modulated by `strength`.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
noise (`jnp.array`, *optional*): noise (`jnp.array`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. tensor will ge generated generation. Can be used to tweak the same generation with different prompts. The array is generated by
by sampling using the supplied random `generator`. sampling using the supplied random `generator`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of
a plain tuple. a plain tuple.
jit (`bool`, defaults to `False`): jit (`bool`, defaults to `False`):
Whether to run `pmap` versions of the generation and safety scoring functions. NOTE: This argument Whether to run `pmap` versions of the generation and safety scoring functions.
exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a future release.
<Tip warning={true}>
This argument exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a
future release.
</Tip>
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a If `return_dict` is `True`, [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] is
`tuple. When returning a tuple, the first element is a list with the generated images, and the second returned, otherwise a `tuple` is returned where the first element is a list with the generated images
element is a list of `bool`s denoting whether the corresponding generated image likely represents and the second element is a list of `bool`s indicating whether the corresponding generated image
"not-safe-for-work" (nsfw) content, according to the `safety_checker`. contains "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -101,31 +101,36 @@ EXAMPLE_DOC_STRING = """ ...@@ -101,31 +101,36 @@ EXAMPLE_DOC_STRING = """
class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline): class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline):
r""" r"""
Pipeline for text-guided image inpainting using Stable Diffusion. *This is an experimental feature*. Flax-based pipeline for text-guided image inpainting using Stable Diffusion.
This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods the <Tip warning={true}>
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
🧪 This is an experimental feature!
</Tip>
This model inherits from [`FlaxDiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`FlaxAutoencoderKL`]): vae ([`FlaxAutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`FlaxCLIPTextModel`]): text_encoder ([`~transformers.FlaxCLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.FlaxCLIPTextModel), tokenizer ([`~transformers.CLIPTokenizer`]):
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`FlaxUNet2DConditionModel`]):
Tokenizer of class A `FlaxUNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`FlaxUNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or [`FlaxDDIMScheduler`], [`FlaxLMSDiscreteScheduler`], [`FlaxPNDMScheduler`], or
[`FlaxDPMSolverMultistepScheduler`]. [`FlaxDPMSolverMultistepScheduler`].
safety_checker ([`FlaxStableDiffusionSafetyChecker`]): safety_checker ([`FlaxStableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
def __init__( def __init__(
...@@ -408,27 +413,31 @@ class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline): ...@@ -408,27 +413,31 @@ class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline):
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image generation. The prompt or prompts to guide image generation.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference. This parameter is modulated by `strength`.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
latents (`jnp.array`, *optional*): latents (`jnp.array`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. tensor will ge generated generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
by sampling using the supplied random `generator`. array is generated by sampling using the supplied random `generator`.
jit (`bool`, defaults to `False`): jit (`bool`, defaults to `False`):
Whether to run `pmap` versions of the generation and safety scoring functions. NOTE: This argument Whether to run `pmap` versions of the generation and safety scoring functions.
exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a future release.
<Tip warning={true}>
This argument exists because `__call__` is not yet end-to-end pmap-able. It will be removed in a
future release.
</Tip>
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of Whether or not to return a [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] instead of
a plain tuple. a plain tuple.
...@@ -437,10 +446,10 @@ class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline): ...@@ -437,10 +446,10 @@ class FlaxStableDiffusionInpaintPipeline(FlaxDiffusionPipeline):
Returns: Returns:
[`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a If `return_dict` is `True`, [`~pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput`] is
`tuple. When returning a tuple, the first element is a list with the generated images, and the second returned, otherwise a `tuple` is returned where the first element is a list with the generated images
element is a list of `bool`s denoting whether the corresponding generated image likely represents and the second element is a list of `bool`s indicating whether the corresponding generated image
"not-safe-for-work" (nsfw) content, according to the `safety_checker`. contains "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -73,36 +73,33 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo ...@@ -73,36 +73,33 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo
r""" r"""
Pipeline for text-to-image generation using Stable Diffusion. Pipeline for text-to-image generation using Stable Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods: The pipeline also inherits the following loading methods:
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
- *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
- *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
- [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
as well as the following saving methods:
- *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -198,42 +195,39 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo ...@@ -198,42 +195,39 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -542,78 +536,69 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo ...@@ -542,78 +536,69 @@ class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, Lo
guidance_rescale: float = 0.0, guidance_rescale: float = 0.0,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
guidance_rescale (`float`, *optional*, defaults to 0.7): guidance_rescale (`float`, *optional*, defaults to 0.7):
Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are
Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when
[Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). using zero terminal SNR.
Guidance rescale factor should fix overexposure when using zero terminal SNR.
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -164,30 +164,29 @@ class AttendExciteAttnProcessor: ...@@ -164,30 +164,29 @@ class AttendExciteAttnProcessor:
class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversionLoaderMixin): class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversionLoaderMixin):
r""" r"""
Pipeline for text-to-image generation using Stable Diffusion and Attend and Excite. Pipeline for text-to-image generation using Stable Diffusion and Attend-and-Excite.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -236,17 +235,15 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion ...@@ -236,17 +235,15 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -702,75 +699,66 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion ...@@ -702,75 +699,66 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion
attn_res: Optional[Tuple[int]] = (16, 16), attn_res: Optional[Tuple[int]] = (16, 16),
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
token_indices (`List[int]`): token_indices (`List[int]`):
The token indices to alter with attend-and-excite. The token indices to alter with attend-and-excite.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
max_iter_to_alter (`int`, *optional*, defaults to `25`): max_iter_to_alter (`int`, *optional*, defaults to `25`):
Number of denoising steps to apply attend-and-excite. The first <max_iter_to_alter> denoising steps are Number of denoising steps to apply attend-and-excite. The `max_iter_to_alter` denoising steps are when
where the attend-and-excite is applied. I.e. if `max_iter_to_alter` is 25 and there are a total of `30` attend-and-excite is applied. For example, if `max_iter_to_alter` is `25` and there are a total of `30`
denoising steps, the first 25 denoising steps will apply attend-and-excite and the last 5 will not denoising steps, the first `25` denoising steps applies attend-and-excite and the last `5` will not.
apply attend-and-excite.
thresholds (`dict`, *optional*, defaults to `{0: 0.05, 10: 0.5, 20: 0.8}`): thresholds (`dict`, *optional*, defaults to `{0: 0.05, 10: 0.5, 20: 0.8}`):
Dictionary defining the iterations and desired thresholds to apply iterative latent refinement in. Dictionary defining the iterations and desired thresholds to apply iterative latent refinement in.
scale_factor (`int`, *optional*, default to 20): scale_factor (`int`, *optional*, default to 20):
Scale factor that controls the step size of each Attend and Excite update. Scale factor to control the step size of each attend-and-excite update.
attn_res (`tuple`, *optional*, default computed from width and height): attn_res (`tuple`, *optional*, default computed from width and height):
The 2D resolution of the semantic attention map. The 2D resolution of the semantic attention map.
...@@ -778,10 +766,10 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion ...@@ -778,10 +766,10 @@ class StableDiffusionAttendAndExcitePipeline(DiffusionPipeline, TextualInversion
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. :type attention_store: object "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
......
...@@ -64,29 +64,25 @@ def preprocess(image): ...@@ -64,29 +64,25 @@ def preprocess(image):
class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
r""" r"""
Pipeline for text-guided image to image generation using Stable Diffusion. Pipeline for text-guided depth-based image-to-image generation using Stable Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods: The pipeline also inherits the following loading methods:
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
- *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
- [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
as well as the following saving methods:
- *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
...@@ -521,68 +517,60 @@ class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoader ...@@ -521,68 +517,60 @@ class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoader
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
`Image`, or tensor representing an image batch, that will be used as the starting point for the `Image` or tensor representing an image batch to be used as the starting point. Can accept image
process. Can accept image latents as `image` only if `depth_map` is not `None`. latents as `image` only if `depth_map` is not `None`.
depth_map (`torch.FloatTensor`, *optional*): depth_map (`torch.FloatTensor`, *optional*):
depth prediction that will be used as additional conditioning for the image generation process. If not Depth prediction to be used as additional conditioning for the image generation process. If not
defined, it will automatically predicts the depth via `self.depth_estimator`. defined, it automatically predicts the depth with `self.depth_estimator`.
strength (`float`, *optional*, defaults to 0.8): strength (`float`, *optional*, defaults to 0.8):
Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
will be used as a starting point, adding more noise to it the larger the `strength`. The number of starting point and more noise is added the higher the `strength`. The number of denoising steps depends
denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
be maximum and the denoising process will run for the full number of iterations specified in process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
`num_inference_steps`. A value of 1, therefore, essentially ignores `image`. essentially ignores `image`.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. This parameter will be modulated by `strength`. expense of slower inference. This parameter is modulated by `strength`.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
...@@ -609,10 +597,8 @@ class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoader ...@@ -609,10 +597,8 @@ class StableDiffusionDepth2ImgPipeline(DiffusionPipeline, TextualInversionLoader
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images.
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
(nsfw) content, according to the `safety_checker`.
""" """
# 1. Check inputs # 1. Check inputs
self.check_inputs( self.check_inputs(
......
...@@ -239,10 +239,16 @@ def preprocess_mask(mask, batch_size: int = 1): ...@@ -239,10 +239,16 @@ def preprocess_mask(mask, batch_size: int = 1):
class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
r""" r"""
Pipeline for text-guided image inpainting using Stable Diffusion using DiffEdit. *This is an experimental feature*. <Tip warning={true}>
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This is an experimental feature!
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
</Tip>
Pipeline for text-guided image inpainting using Stable Diffusion and DiffEdit.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods: In addition the pipeline inherits the following loading methods:
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`]
...@@ -253,24 +259,23 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -253,24 +259,23 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
tokenizer (`CLIPTokenizer`): tokenizer (`CLIPTokenizer`):
Tokenizer of class A [`~transformers.CLIPTokenizer`] to tokenize text.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). unet ([`UNet2DConditionModel`]):
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. A [`UNet2DConditionModel`] to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. A scheduler to be used in combination with `unet` to denoise the encoded image latents.
inverse_scheduler (`[DDIMInverseScheduler]`): inverse_scheduler (`[DDIMInverseScheduler]`):
A scheduler to be used in combination with `unet` to fill in the unmasked part of the input latents A scheduler to be used in combination with `unet` to fill in the unmasked part of the input latents.
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
about a model's potential harms.
feature_extractor ([`CLIPImageProcessor`]): feature_extractor ([`CLIPImageProcessor`]):
Model that extracts features from generated images to be used as inputs for the `safety_checker`. A [`CLIPImageProcessor`] to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor", "inverse_scheduler"] _optional_components = ["safety_checker", "feature_extractor", "inverse_scheduler"]
...@@ -370,17 +375,15 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -370,17 +375,15 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -388,17 +391,16 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -388,17 +391,16 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
...@@ -406,10 +408,10 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -406,10 +408,10 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -826,6 +828,7 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -826,6 +828,7 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
) )
@torch.no_grad() @torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def generate_mask( def generate_mask(
self, self,
image: Union[torch.FloatTensor, PIL.Image.Image] = None, image: Union[torch.FloatTensor, PIL.Image.Image] = None,
...@@ -847,48 +850,42 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -847,48 +850,42 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function used to generate a latent mask given a mask prompt, a target prompt, and an image. Generate a latent mask given a mask prompt, a target prompt, and an image.
Args: Args:
image (`PIL.Image.Image`): image (`PIL.Image.Image`):
`Image`, or tensor representing an image batch which will be used for computing the mask. `Image` or tensor representing an image batch to be used for computing the mask.
target_prompt (`str` or `List[str]`, *optional*): target_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the semantic mask generation. If not defined, one has to pass The prompt or prompts to guide semantic mask generation. If not defined, you need to pass
`prompt_embeds`. instead. `prompt_embeds`.
target_negative_prompt (`str` or `List[str]`, *optional*): target_negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
target_prompt_embeds (`torch.FloatTensor`, *optional*): target_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
target_negative_prompt_embeds (`torch.FloatTensor`, *optional*): target_negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
source_prompt (`str` or `List[str]`, *optional*): source_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the semantic mask generation using the method in [DiffEdit: The prompt or prompts to guide semantic mask generation using DiffEdit. If not defined, you need to
Diffusion-Based Semantic Image Editing with Mask Guidance](https://arxiv.org/pdf/2210.11427.pdf). If pass `source_prompt_embeds` or `source_image` instead.
not defined, one has to pass `source_prompt_embeds` or `source_image` instead.
source_negative_prompt (`str` or `List[str]`, *optional*): source_negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the semantic mask generation away from using the method in [DiffEdit: The prompt or prompts to guide semantic mask generation away from using DiffEdit. If not defined, you
Diffusion-Based Semantic Image Editing with Mask Guidance](https://arxiv.org/pdf/2210.11427.pdf). If need to pass `source_negative_prompt_embeds` or `source_image` instead.
not defined, one has to pass `source_negative_prompt_embeds` or `source_image` instead.
source_prompt_embeds (`torch.FloatTensor`, *optional*): source_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings to guide the semantic mask generation. Can be used to easily tweak text Pre-generated text embeddings to guide the semantic mask generation. Can be used to easily tweak text
inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from inputs (prompt weighting). If not provided, text embeddings are generated from `source_prompt` input
`source_prompt` input argument. argument.
source_negative_prompt_embeds (`torch.FloatTensor`, *optional*): source_negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings to negatively guide the semantic mask generation. Can be used to easily Pre-generated text embeddings to negatively guide the semantic mask generation. Can be used to easily
tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from tweak text inputs (prompt weighting). If not provided, text embeddings are generated from
`source_negative_prompt` input argument. `source_negative_prompt` input argument.
num_maps_per_mask (`int`, *optional*, defaults to 10): num_maps_per_mask (`int`, *optional*, defaults to 10):
The number of noise maps sampled to generate the semantic mask using the method in [DiffEdit: The number of noise maps sampled to generate the semantic mask using DiffEdit.
Diffusion-Based Semantic Image Editing with Mask Guidance](https://arxiv.org/pdf/2210.11427.pdf).
mask_encode_strength (`float`, *optional*, defaults to 0.5): mask_encode_strength (`float`, *optional*, defaults to 0.5):
Conceptually, the strength of the noise maps sampled to generate the semantic mask using the method in The strength of the noise maps sampled to generate the semantic mask using DiffEdit. Must be between 0
[DiffEdit: Diffusion-Based Semantic Image Editing with Mask Guidance]( and 1.
https://arxiv.org/pdf/2210.11427.pdf). Must be between 0 and 1.
mask_thresholding_ratio (`float`, *optional*, defaults to 3.0): mask_thresholding_ratio (`float`, *optional*, defaults to 3.0):
The maximum multiple of the mean absolute difference used to clamp the semantic guidance map before The maximum multiple of the mean absolute difference used to clamp the semantic guidance map before
mask binarization. mask binarization.
...@@ -896,30 +893,25 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -896,30 +893,25 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
Returns: Returns:
`List[PIL.Image.Image]` or `np.array`: `List[PIL.Image.Image]` if `output_type` is `"pil"`, otherwise a `List[PIL.Image.Image]` or `np.array`:
`np.array`. When returning a `List[PIL.Image.Image]`, the list will consist of a batch of single-channel When returning a `List[PIL.Image.Image]`, the list consists of a batch of single-channel binary images
binary image with dimensions `(height // self.vae_scale_factor, width // self.vae_scale_factor)`, otherwise with dimensions `(height // self.vae_scale_factor, width // self.vae_scale_factor)`. If it's
the `np.array` will have shape `(batch_size, height // self.vae_scale_factor, width // `np.array`, the shape is `(batch_size, height // self.vae_scale_factor, width //
self.vae_scale_factor)`. self.vae_scale_factor)`.
""" """
# 1. Check inputs (Provide dummy argument for callback_steps) # 1. Check inputs (Provide dummy argument for callback_steps)
...@@ -1072,78 +1064,72 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -1072,78 +1064,72 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
num_auto_corr_rolls: int = 5, num_auto_corr_rolls: int = 5,
): ):
r""" r"""
Function used to generate inverted latents given a prompt and image. Generate inverted latents given a prompt and image.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
image (`PIL.Image.Image`): image (`PIL.Image.Image`):
`Image`, or tensor representing an image batch to produce the inverted latents, guided by `prompt`. `Image` or tensor representing an image batch to produce the inverted latents guided by `prompt`.
inpaint_strength (`float`, *optional*, defaults to 0.8): inpaint_strength (`float`, *optional*, defaults to 0.8):
Conceptually, indicates how far into the noising process to run latent inversion. Must be between 0 and Indicates extent of the noising process to run latent inversion. Must be between 0 and 1. When
1. When `strength` is 1, the inversion process will be run for the full number of iterations specified `strength` is 1, the inversion process iss ru for the full number of iterations specified in
in `num_inference_steps`. `image` will be used as a reference for the inversion process, adding more `num_inference_steps`. `image` is used as a reference for the inversion process, adding more noise the
noise the larger the `strength`. If `strength` is 0, no inpainting will occur. larger the `strength`. If `strength` is 0, no inpainting occurs.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
decode_latents (`bool`, *optional*, defaults to `False`): decode_latents (`bool`, *optional*, defaults to `False`):
Whether or not to decode the inverted latents into a generated image. Setting this argument to `True` Whether or not to decode the inverted latents into a generated image. Setting this argument to `True`
will decode all inverted latents for each timestep into a list of generated images. decodes all inverted latents for each timestep into a list of generated images.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.DiffEditInversionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.DiffEditInversionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
lambda_auto_corr (`float`, *optional*, defaults to 20.0): lambda_auto_corr (`float`, *optional*, defaults to 20.0):
Lambda parameter to control auto correction Lambda parameter to control auto correction.
lambda_kl (`float`, *optional*, defaults to 20.0): lambda_kl (`float`, *optional*, defaults to 20.0):
Lambda parameter to control Kullback–Leibler divergence output Lambda parameter to control Kullback–Leibler divergence output.
num_reg_steps (`int`, *optional*, defaults to 0): num_reg_steps (`int`, *optional*, defaults to 0):
Number of regularization loss steps Number of regularization loss steps.
num_auto_corr_rolls (`int`, *optional*, defaults to 5): num_auto_corr_rolls (`int`, *optional*, defaults to 5):
Number of auto correction roll steps Number of auto correction roll steps.
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput`] or [`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput`] or
`tuple`: [`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput`] `tuple`:
if `return_dict` is `True`, otherwise a `tuple`. When returning a tuple, the first element is the inverted If `return_dict` is `True`,
latents tensors ordered by increasing noise, and then second is the corresponding decoded images if [`~pipelines.stable_diffusion.pipeline_stable_diffusion_diffedit.DiffEditInversionPipelineOutput`] is
`decode_latents` is `True`, otherwise `None`. returned, otherwise a `tuple` is returned where the first element is the inverted latents tensors
ordered by increasing noise, and the second is the corresponding decoded images if `decode_latents` is
`True`, otherwise `None`.
""" """
# 1. Check inputs # 1. Check inputs
...@@ -1309,81 +1295,73 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -1309,81 +1295,73 @@ class StableDiffusionDiffEditPipeline(DiffusionPipeline, TextualInversionLoaderM
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
mask_image (`PIL.Image.Image`): mask_image (`PIL.Image.Image`):
`Image`, or tensor representing an image batch, to mask the generated image. White pixels in the mask `Image` or tensor representing an image batch to mask the generated image. White pixels in the mask are
will be repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be repainted, while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a
converted to a single channel (luminance) before use. If it's a tensor, it should contain one color single channel (luminance) before use. If it's a tensor, it should contain one color channel (L)
channel (L) instead of 3, so the expected shape would be `(B, 1, H, W)`. instead of 3, so the expected shape would be `(B, 1, H, W)`.
image_latents (`PIL.Image.Image` or `torch.FloatTensor`): image_latents (`PIL.Image.Image` or `torch.FloatTensor`):
Partially noised image latents from the inversion process to be used as inputs for image generation. Partially noised image latents from the inversion process to be used as inputs for image generation.
inpaint_strength (`float`, *optional*, defaults to 0.8): inpaint_strength (`float`, *optional*, defaults to 0.8):
Conceptually, indicates how much to inpaint the masked area. Must be between 0 and 1. When `strength` Indicates extent to inpaint the masked area. Must be between 0 and 1. When `strength` is 1, the
is 1, the denoising process will be run on the masked area for the full number of iterations specified denoising process is run on the masked area for the full number of iterations specified in
in `num_inference_steps`. `image_latents` will be used as a reference for the masked area, adding more `num_inference_steps`. `image_latents` is used as a reference for the masked area, adding more noise to
noise to that region the larger the `strength`. If `strength` is 0, no inpainting will occur. that region the larger the `strength`. If `strength` is 0, no inpainting occurs.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 1. Check inputs # 1. Check inputs
......
...@@ -36,27 +36,31 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -36,27 +36,31 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
class StableDiffusionImageVariationPipeline(DiffusionPipeline): class StableDiffusionImageVariationPipeline(DiffusionPipeline):
r""" r"""
Pipeline to generate variations from an input image using Stable Diffusion. Pipeline to generate image variations from an input image using Stable Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
image_encoder ([`CLIPVisionModelWithProjection`]): image_encoder ([`~transformers.CLIPVisionModelWithProjection`]):
Frozen CLIP image-encoder. Stable Diffusion Image Variation uses the vision portion of Frozen CLIP image-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModelWithProjection), text_encoder ([`~transformers.CLIPTextModel`]):
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents. tokenizer ([`~transformers.CLIPTokenizer`]):
A `CLIPTokenizer` to tokenize text.
unet ([`UNet2DConditionModel`]):
A `UNet2DConditionModel` to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
# TODO: feature_extractor is required to encode images (if they are in PIL format), # TODO: feature_extractor is required to encode images (if they are in PIL format),
# we should give a descriptive message if the pipeline doesn't have one. # we should give a descriptive message if the pipeline doesn't have one.
...@@ -253,58 +257,74 @@ class StableDiffusionImageVariationPipeline(DiffusionPipeline): ...@@ -253,58 +257,74 @@ class StableDiffusionImageVariationPipeline(DiffusionPipeline):
callback_steps: int = 1, callback_steps: int = 1,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`): image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`):
The image or images to guide the image generation. If you provide a tensor, it needs to comply with the Image or images to guide image generation. If you provide a tensor, it needs to be compatible with
configuration of [`CLIPImageProcessor`](https://huggingface.co/lambdalabs/sd-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json).
[this](https://huggingface.co/lambdalabs/sd-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json) height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
`CLIPImageProcessor`
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference. This parameter is modulated by `strength`.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
Examples:
```py
from diffusers import StableDiffusionImageVariationPipeline
from PIL import Image
from io import BytesIO
import requests
pipe = StableDiffusionImageVariationPipeline.from_pretrained(
"lambdalabs/sd-image-variations-diffusers", revision="v2.0"
)
pipe = pipe.to("cuda")
url = "https://lh3.googleusercontent.com/y-iFOHfLTwkuQSUegpwDdgKmOjRSTvPxat63dQLB25xkTs4lhIbRUFeNBWZzYf370g=s1200"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
out = pipe(image, num_images_per_prompt=3, guidance_scale=15)
out["images"][0].save("result.jpg")
```
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -102,38 +102,35 @@ class StableDiffusionImg2ImgPipeline( ...@@ -102,38 +102,35 @@ class StableDiffusionImg2ImgPipeline(
DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin
): ):
r""" r"""
Pipeline for text-guided image to image generation using Stable Diffusion. Pipeline for text-guided image-to-image generation using Stable Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods: The pipeline also inherits the following loading methods:
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
- *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
- *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
- [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
as well as the following saving methods:
- *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -230,10 +227,10 @@ class StableDiffusionImg2ImgPipeline( ...@@ -230,10 +227,10 @@ class StableDiffusionImg2ImgPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -593,74 +590,66 @@ class StableDiffusionImg2ImgPipeline( ...@@ -593,74 +590,66 @@ class StableDiffusionImg2ImgPipeline(
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
`Image`, or tensor representing an image batch, that will be used as the starting point for the `Image` or tensor representing an image batch to be used as the starting point. Can also accept image
process. Can also accpet image latents as `image`, if passing latents directly, it will not be encoded latents as `image`, but if passing latents directly it is not encoded again.
again.
strength (`float`, *optional*, defaults to 0.8): strength (`float`, *optional*, defaults to 0.8):
Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image` Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
will be used as a starting point, adding more noise to it the larger the `strength`. The number of starting point and more noise is added the higher the `strength`. The number of denoising steps depends
denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
be maximum and the denoising process will run for the full number of iterations specified in process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
`num_inference_steps`. A value of 1, therefore, essentially ignores `image`. essentially ignores `image`.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. This parameter will be modulated by `strength`. expense of slower inference. This parameter is modulated by `strength`.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 1. Check inputs. Raise error if not correct # 1. Check inputs. Raise error if not correct
self.check_inputs(prompt, strength, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds) self.check_inputs(prompt, strength, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds)
......
...@@ -158,45 +158,32 @@ class StableDiffusionInpaintPipeline( ...@@ -158,45 +158,32 @@ class StableDiffusionInpaintPipeline(
r""" r"""
Pipeline for text-guided image inpainting using Stable Diffusion. Pipeline for text-guided image inpainting using Stable Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods: The pipeline also inherits the following loading methods:
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
- *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
- [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
as well as the following saving methods:
- *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
<Tip>
It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such
as [runwayml/stable-diffusion-inpainting](https://huggingface.co/runwayml/stable-diffusion-inpainting). Default
text-to-image stable diffusion checkpoints, such as
[runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5) are also compatible with
this pipeline, but might be less performant.
</Tip>
Args: Args:
vae ([`AutoencoderKL`, `AsymmetricAutoencoderKL`]): vae ([`AutoencoderKL`, `AsymmetricAutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -298,10 +285,10 @@ class StableDiffusionInpaintPipeline( ...@@ -298,10 +285,10 @@ class StableDiffusionInpaintPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -706,79 +693,71 @@ class StableDiffusionInpaintPipeline( ...@@ -706,79 +693,71 @@ class StableDiffusionInpaintPipeline(
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
image (`PIL.Image.Image`): image (`PIL.Image.Image`):
`Image`, or tensor representing an image batch which will be inpainted, *i.e.* parts of the image will `Image` or tensor representing an image batch to be inpainted (which parts of the image to be masked
be masked out with `mask_image` and repainted according to `prompt`. out with `mask_image` and repainted according to `prompt`).
mask_image (`PIL.Image.Image`): mask_image (`PIL.Image.Image`):
`Image`, or tensor representing an image batch, to mask `image`. White pixels in the mask will be `Image` or tensor representing an image batch to mask `image`. White pixels in the mask are repainted
repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be converted while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a single channel
to a single channel (luminance) before use. If it's a tensor, it should contain one color channel (L) (luminance) before use. If it's a tensor, it should contain one color channel (L) instead of 3, so the
instead of 3, so the expected shape would be `(B, H, W, 1)`. expected shape would be `(B, H, W, 1)`.
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
strength (`float`, *optional*, defaults to 1.): strength (`float`, *optional*, defaults to 1.0):
Conceptually, indicates how much to transform the masked portion of the reference `image`. Must be Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
between 0 and 1. `image` will be used as a starting point, adding more noise to it the larger the starting point and more noise is added the higher the `strength`. The number of denoising steps depends
`strength`. The number of denoising steps depends on the amount of noise initially added. When on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
`strength` is 1, added noise will be maximum and the denoising process will run for the full number of process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
iterations specified in `num_inference_steps`. A value of 1, therefore, essentially ignores the masked essentially ignores `image`.
portion of the reference `image`.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference. This parameter is modulated by `strength`.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
```py ```py
...@@ -812,10 +791,10 @@ class StableDiffusionInpaintPipeline( ...@@ -812,10 +791,10 @@ class StableDiffusionInpaintPipeline(
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -223,10 +223,10 @@ class StableDiffusionInpaintPipelineLegacy( ...@@ -223,10 +223,10 @@ class StableDiffusionInpaintPipelineLegacy(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
......
...@@ -70,37 +70,34 @@ def preprocess(image): ...@@ -70,37 +70,34 @@ def preprocess(image):
class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
r""" r"""
Pipeline for pixel-level image editing by following text instructions. Based on Stable Diffusion. Pipeline for pixel-level image editing by following text instructions (based on Stable Diffusion).
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods: The pipeline also inherits the following loading methods:
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
- *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
- [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
as well as the following saving methods:
- *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -174,64 +171,57 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion ...@@ -174,64 +171,57 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion
callback_steps: int = 1, callback_steps: int = 1,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
image (`torch.FloatTensor` `np.ndarray`, `PIL.Image.Image`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): image (`torch.FloatTensor` `np.ndarray`, `PIL.Image.Image`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
`Image`, or tensor representing an image batch which will be repainted according to `prompt`. Can also `Image` or tensor representing an image batch to be repainted according to `prompt`. Can also accept
accpet image latents as `image`, if passing latents directly, it will not be encoded again. image latents as `image`, but if passing latents directly it is not encoded again.
num_inference_steps (`int`, *optional*, defaults to 100): num_inference_steps (`int`, *optional*, defaults to 100):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality. This pipeline requires a value of at least `1`.
image_guidance_scale (`float`, *optional*, defaults to 1.5): image_guidance_scale (`float`, *optional*, defaults to 1.5):
Image guidance scale is to push the generated image towards the inital image `image`. Image guidance Push the generated image towards the inital `image`. Image guidance scale is enabled by setting
scale is enabled by setting `image_guidance_scale > 1`. Higher image guidance scale encourages to `image_guidance_scale > 1`. Higher image guidance scale encourages generated images that are closely
generate images that are closely linked to the source image `image`, usually at the expense of lower linked to the source `image`, usually at the expense of lower image quality. This pipeline requires a
image quality. This pipeline requires a value of at least `1`. value of at least `1`.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
...@@ -264,10 +254,10 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion ...@@ -264,10 +254,10 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Check inputs # 0. Check inputs
self.check_inputs(prompt, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds) self.check_inputs(prompt, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds)
...@@ -431,10 +421,10 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion ...@@ -431,10 +421,10 @@ class StableDiffusionInstructPix2PixPipeline(DiffusionPipeline, TextualInversion
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
......
...@@ -130,10 +130,10 @@ class StableDiffusionKDiffusionPipeline(DiffusionPipeline, TextualInversionLoade ...@@ -130,10 +130,10 @@ class StableDiffusionKDiffusionPipeline(DiffusionPipeline, TextualInversionLoade
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
......
...@@ -60,25 +60,22 @@ def preprocess(image): ...@@ -60,25 +60,22 @@ def preprocess(image):
class StableDiffusionLatentUpscalePipeline(DiffusionPipeline): class StableDiffusionLatentUpscalePipeline(DiffusionPipeline):
r""" r"""
Pipeline to upscale the resolution of Stable Diffusion output images by a factor of 2. Pipeline for upscaling Stable Diffusion output image resolution by a factor of 2.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A [`EulerDiscreteScheduler`] to be used in combination with `unet` to denoise the encoded image latents.
[`EulerDiscreteScheduler`].
""" """
def __init__( def __init__(
...@@ -279,50 +276,46 @@ class StableDiffusionLatentUpscalePipeline(DiffusionPipeline): ...@@ -279,50 +276,46 @@ class StableDiffusionLatentUpscalePipeline(DiffusionPipeline):
callback_steps: int = 1, callback_steps: int = 1,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`): prompt (`str` or `List[str]`):
The prompt or prompts to guide the image upscaling. The prompt or prompts to guide image upscaling.
image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
`Image`, or tensor representing an image batch which will be upscaled. If it's a tensor, it can be `Image` or tensor representing an image batch to be upscaled. If it's a tensor, it can be either a
either a latent output from a stable diffusion model, or an image tensor in the range `[-1, 1]`. It latent output from a Stable Diffusion model or an image tensor in the range `[-1, 1]`. It is considered
will be considered a `latent` if `image.shape[1]` is `4`; otherwise, it will be considered to be an a `latent` if `image.shape[1]` is `4`; otherwise, it is considered to be an image representation and
image representation and encoded using this pipeline's `vae` encoder. encoded using this pipeline's `vae` encoder.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored The prompt or prompts to guide what to not include in image generation. If not defined, you need to
if `guidance_scale` is less than `1`). pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
Examples: Examples:
```py ```py
...@@ -363,10 +356,8 @@ class StableDiffusionLatentUpscalePipeline(DiffusionPipeline): ...@@ -363,10 +356,8 @@ class StableDiffusionLatentUpscalePipeline(DiffusionPipeline):
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images.
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
(nsfw) content, according to the `safety_checker`.
""" """
# 1. Check inputs # 1. Check inputs
......
...@@ -51,6 +51,8 @@ EXAMPLE_DOC_STRING = """ ...@@ -51,6 +51,8 @@ EXAMPLE_DOC_STRING = """
>>> prompt = "a photo of an astronaut riding a horse on mars" >>> prompt = "a photo of an astronaut riding a horse on mars"
>>> output = pipe(prompt) >>> output = pipe(prompt)
>>> rgb_image, depth_image = output.rgb, output.depth >>> rgb_image, depth_image = output.rgb, output.depth
>>> rgb_image[0].save("astronaut_ldm3d_rgb.jpg")
>>> depth_image[0].save("astronaut_ldm3d_depth.png")
``` ```
""" """
...@@ -62,11 +64,11 @@ class LDM3DPipelineOutput(BaseOutput): ...@@ -62,11 +64,11 @@ class LDM3DPipelineOutput(BaseOutput):
Args: Args:
images (`List[PIL.Image.Image]` or `np.ndarray`) images (`List[PIL.Image.Image]` or `np.ndarray`)
List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width, List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width,
num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline. num_channels)`.
nsfw_content_detected (`List[bool]`) nsfw_content_detected (`List[bool]`)
List of flags denoting whether the corresponding generated image likely represents "not-safe-for-work" List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or
(nsfw) content, or `None` if safety checking could not be performed. `None` if safety checking could not be performed.
""" """
rgb: Union[List[PIL.Image.Image], np.ndarray] rgb: Union[List[PIL.Image.Image], np.ndarray]
...@@ -78,40 +80,35 @@ class StableDiffusionLDM3DPipeline( ...@@ -78,40 +80,35 @@ class StableDiffusionLDM3DPipeline(
DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin
): ):
r""" r"""
Pipeline for text-to-image and 3d generation using LDM3D. LDM3D: Latent Diffusion Model for 3D: Pipeline for text-to-image and 3D generation using LDM3D.
https://arxiv.org/abs/2305.10853
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods: The pipeline also inherits the following loading methods:
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
- *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`] - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
- *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`] - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
- [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
as well as the following saving methods:
- *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode rgb and depth images to and from latent Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
representations. text_encoder ([`~transformers.CLIPTextModel`]):
text_encoder ([`CLIPTextModel`]): Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
Frozen text-encoder. Stable Diffusion uses the text portion of tokenizer ([`~transformers.CLIPTokenizer`]):
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically A `CLIPTokenizer` to tokenize text.
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. unet ([`UNet2DConditionModel`]):
tokenizer (`CLIPTokenizer`): A `UNet2DConditionModel` to denoise the encoded image latents.
Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded rgb and depth latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -160,17 +157,15 @@ class StableDiffusionLDM3DPipeline( ...@@ -160,17 +157,15 @@ class StableDiffusionLDM3DPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -178,17 +173,16 @@ class StableDiffusionLDM3DPipeline( ...@@ -178,17 +173,16 @@ class StableDiffusionLDM3DPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
...@@ -196,10 +190,10 @@ class StableDiffusionLDM3DPipeline( ...@@ -196,10 +190,10 @@ class StableDiffusionLDM3DPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -498,73 +492,65 @@ class StableDiffusionLDM3DPipeline( ...@@ -498,73 +492,65 @@ class StableDiffusionLDM3DPipeline(
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 5.0): guidance_scale (`float`, *optional*, defaults to 5.0):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -34,56 +34,36 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name ...@@ -34,56 +34,36 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
AUGS_CONST = ["A photo of ", "An image of ", "A picture of "] AUGS_CONST = ["A photo of ", "An image of ", "A picture of "]
EXAMPLE_DOC_STRING = """
Examples:
```py
>>> import torch
>>> from diffusers import StableDiffusionModelEditingPipeline
>>> model_ckpt = "CompVis/stable-diffusion-v1-4"
>>> pipe = StableDiffusionModelEditingPipeline.from_pretrained(model_ckpt)
>>> pipe = pipe.to("cuda")
>>> source_prompt = "A pack of roses"
>>> destination_prompt = "A pack of blue roses"
>>> pipe.edit_model(source_prompt, destination_prompt)
>>> prompt = "A field of roses"
>>> image = pipe(prompt).images[0]
```
"""
class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
r""" r"""
Pipeline for text-to-image model editing using "Editing Implicit Assumptions in Text-to-Image Diffusion Models". Pipeline for text-to-image model editing.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.). implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPFeatureExtractor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPFeatureExtractor`]):
A `CLIPFeatureExtractor` to extract features from generated images; used as inputs to the `safety_checker`.
with_to_k ([`bool`]): with_to_k ([`bool`]):
Whether to edit the key projection matrices along wiht the value projection matrices. Whether to edit the key projection matrices along with the value projection matrices.
with_augs ([`list`]): with_augs ([`list`]):
Textual augmentations to apply while editing the text-to-image model. Set to [] for no augmentations. Textual augmentations to apply while editing the text-to-image model. Set to `[]` for no augmentations.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -167,17 +147,15 @@ class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoa ...@@ -167,17 +147,15 @@ class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoa
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -459,19 +437,19 @@ class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoa ...@@ -459,19 +437,19 @@ class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoa
restart_params: bool = True, restart_params: bool = True,
): ):
r""" r"""
Apply model editing via closed-form solution (see Eq. 5 in the TIME paper https://arxiv.org/abs/2303.08084) Apply model editing via closed-form solution (see Eq. 5 in the TIME [paper](https://arxiv.org/abs/2303.08084)).
Args: Args:
source_prompt (`str`): source_prompt (`str`):
The source prompt containing the concept to be edited. The source prompt containing the concept to be edited.
destination_prompt (`str`): destination_prompt (`str`):
The destination prompt. Must contain all words from source_prompt with additional ones to specify the The destination prompt. Must contain all words from `source_prompt` with additional ones to specify the
target edit. target edit.
lamb (`float`, *optional*, defaults to 0.1): lamb (`float`, *optional*, defaults to 0.1):
The lambda parameter specifying the regularization intesity. Smaller values increase the editing power. The lambda parameter specifying the regularization intesity. Smaller values increase the editing power.
restart_params (`bool`, *optional*, defaults to True): restart_params (`bool`, *optional*, defaults to True):
Restart the model parameters to their pre-trained version before editing. This is done to avoid edit Restart the model parameters to their pre-trained version before editing. This is done to avoid edit
compounding. When it is False, edits accumulate. compounding. When it is `False`, edits accumulate.
""" """
# restart LDM parameters # restart LDM parameters
...@@ -590,73 +568,82 @@ class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoa ...@@ -590,73 +568,82 @@ class StableDiffusionModelEditingPipeline(DiffusionPipeline, TextualInversionLoa
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
```py
>>> import torch
>>> from diffusers import StableDiffusionModelEditingPipeline
>>> model_ckpt = "CompVis/stable-diffusion-v1-4"
>>> pipe = StableDiffusionModelEditingPipeline.from_pretrained(model_ckpt)
>>> pipe = pipe.to("cuda")
>>> source_prompt = "A pack of roses"
>>> destination_prompt = "A pack of blue roses"
>>> pipe.edit_model(source_prompt, destination_prompt)
>>> prompt = "A field of roses"
>>> image = pipe(prompt).images[0]
```
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -53,34 +53,29 @@ EXAMPLE_DOC_STRING = """ ...@@ -53,34 +53,29 @@ EXAMPLE_DOC_STRING = """
class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin): class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
r""" r"""
Pipeline for text-to-image generation using "MultiDiffusion: Fusing Diffusion Paths for Controlled Image Pipeline for text-to-image generation using MultiDiffusion.
Generation".
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.). implemented for all pipelines (downloading, saving, running on a particular device, etc.).
To generate panorama-like images, be sure to pass the `width` parameter accordingly when using the pipeline. Our
recommendation for the `width` value is 2048. This is the default value of the `width` parameter for this pipeline.
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. The original work A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
on Multi Diffsion used the [`DDIMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -129,17 +124,15 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -129,17 +124,15 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderM
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -470,70 +463,63 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -470,70 +463,63 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderM
circular_padding: bool = False, circular_padding: bool = False,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to 512):
height (`int`, *optional*, defaults to 512:
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to 2048): width (`int`, *optional*, defaults to 2048):
The width in pixels of the generated image. The width is kept to a high number because the The width in pixels of the generated image. The width is kept high because the pipeline is supposed
pipeline is supposed to be used for generating panorama-like images. generate panorama-like images.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
view_batch_size (`int`, *optional*, defaults to 1): view_batch_size (`int`, *optional*, defaults to 1):
The batch size to denoise splited views. For some GPUs with high performance, higher view batch size The batch size to denoise split views. For some GPUs with high performance, higher view batch size can
can speedup the generation and increase the VRAM usage. speedup the generation and increase the VRAM usage.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in `self.processor` in
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py). [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
circular_padding (`bool`, *optional*, defaults to `False`): circular_padding (`bool`, *optional*, defaults to `False`):
If set to True, circular padding is applied to ensure there are no stitching artifacts. Circular If set to `True`, circular padding is applied to ensure there are no stitching artifacts. Circular
padding allows the model to seamlessly generate a transition from the rightmost part of the image to padding allows the model to seamlessly generate a transition from the rightmost part of the image to
the leftmost part, maintaining consistency in a 360-degree sense. the leftmost part, maintaining consistency in a 360-degree sense.
...@@ -541,10 +527,10 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderM ...@@ -541,10 +527,10 @@ class StableDiffusionPanoramaPipeline(DiffusionPipeline, TextualInversionLoaderM
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -63,41 +63,35 @@ class StableDiffusionParadigmsPipeline( ...@@ -63,41 +63,35 @@ class StableDiffusionParadigmsPipeline(
DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin
): ):
r""" r"""
Parallelized version of StableDiffusionPipeline, based on the paper https://arxiv.org/abs/2305.16317 This pipeline Pipeline for text-to-image generation using a parallelized version of Stable Diffusion.
parallelizes the denoising steps to generate a single image faster (more akin to model parallelism).
Pipeline for text-to-image generation using Stable Diffusion. This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the The pipeline also inherits the following loading methods:
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
- [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
In addition the pipeline inherits the following loading methods: - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
- *Textual-Inversion*: [`loaders.TextualInversionLoaderMixin.load_textual_inversion`] - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
- *LoRA*: [`loaders.LoraLoaderMixin.load_lora_weights`]
- *Ckpt*: [`loaders.FromSingleFileMixin.from_single_file`]
as well as the following saving methods:
- *LoRA*: [`loaders.LoraLoaderMixin.save_lora_weights`]
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -149,17 +143,15 @@ class StableDiffusionParadigmsPipeline( ...@@ -149,17 +143,15 @@ class StableDiffusionParadigmsPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -167,17 +159,16 @@ class StableDiffusionParadigmsPipeline( ...@@ -167,17 +159,16 @@ class StableDiffusionParadigmsPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
def enable_vae_tiling(self): def enable_vae_tiling(self):
r""" r"""
Enable tiled VAE decoding. Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in processing larger images.
several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
""" """
self.vae.enable_tiling() self.vae.enable_tiling()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
def disable_vae_tiling(self): def disable_vae_tiling(self):
r""" r"""
Disable tiled VAE decoding. If `enable_vae_tiling` was previously invoked, this method will go back to Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_tiling() self.vae.disable_tiling()
...@@ -185,10 +176,10 @@ class StableDiffusionParadigmsPipeline( ...@@ -185,10 +176,10 @@ class StableDiffusionParadigmsPipeline(
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_model_cpu_offload
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -499,82 +490,74 @@ class StableDiffusionParadigmsPipeline( ...@@ -499,82 +490,74 @@ class StableDiffusionParadigmsPipeline(
debug: bool = False, debug: bool = False,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
parallel (`int`, *optional*, defaults to 10): parallel (`int`, *optional*, defaults to 10):
The batch size to use when doing parallel sampling. More parallelism may lead to faster inference but The batch size to use when doing parallel sampling. More parallelism may lead to faster inference but
requires higher memory usage and also can require more total FLOPs. requires higher memory usage and can also require more total FLOPs.
tolerance (`float`, *optional*, defaults to 0.1): tolerance (`float`, *optional*, defaults to 0.1):
The error tolerance for determining when to slide the batch window forward for parallel sampling. Lower The error tolerance for determining when to slide the batch window forward for parallel sampling. Lower
tolerance usually leads to less/no degradation. Higher tolerance is faster but can risk degradation of tolerance usually leads to less or no degradation. Higher tolerance is faster but can risk degradation
sample quality. The tolerance is specified as a ratio of the scheduler's noise magnitude. of sample quality. The tolerance is specified as a ratio of the scheduler's noise magnitude.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
debug (`bool`, *optional*, defaults to `False`): debug (`bool`, *optional*, defaults to `False`):
Whether or not to run in debug mode. In debug mode, torch.cumsum is evaluated using the CPU. Whether or not to run in debug mode. In debug mode, `torch.cumsum` is evaluated using the CPU.
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -94,28 +94,27 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin) ...@@ -94,28 +94,27 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin)
r""" r"""
Pipeline for text-to-image generation using Stable Diffusion. Pipeline for text-to-image generation using Stable Diffusion.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
[`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`]. [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
safety_checker ([`StableDiffusionSafetyChecker`]): safety_checker ([`StableDiffusionSafetyChecker`]):
Classification module that estimates whether generated images could be considered offensive or harmful. Classification module that estimates whether generated images could be considered offensive or harmful.
Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details. Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
feature_extractor ([`CLIPImageProcessor`]): about a model's potential harms.
Model that extracts features from generated images to be used as inputs for the `safety_checker`. feature_extractor ([`~transformers.CLIPImageProcessor`]):
A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
""" """
_optional_components = ["safety_checker", "feature_extractor"] _optional_components = ["safety_checker", "feature_extractor"]
...@@ -148,17 +147,15 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin) ...@@ -148,17 +147,15 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin)
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
def enable_vae_slicing(self): def enable_vae_slicing(self):
r""" r"""
Enable sliced VAE decoding. Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several
steps. This is useful to save some memory and allow larger batch sizes.
""" """
self.vae.enable_slicing() self.vae.enable_slicing()
# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
def disable_vae_slicing(self): def disable_vae_slicing(self):
r""" r"""
Disable sliced VAE decoding. If `enable_vae_slicing` was previously invoked, this method will go back to Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
computing decoding in one step. computing decoding in one step.
""" """
self.vae.disable_slicing() self.vae.disable_slicing()
...@@ -455,77 +452,67 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin) ...@@ -455,77 +452,67 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, TextualInversionLoaderMixin)
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead. height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
The height in pixels of the generated image. The height in pixels of the generated image.
width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor): width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image. The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
sag_scale (`float`, *optional*, defaults to 0.75): sag_scale (`float`, *optional*, defaults to 0.75):
SAG scale as defined in [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance] Chosen between [0, 1.0] for better quality.
(https://arxiv.org/abs/2210.00939). `sag_scale` is defined as `s_s` of equation (24) of SAG paper:
https://arxiv.org/pdf/2210.00939.pdf. Typically chosen between [0, 1.0] for better quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 0. Default height and width to unet # 0. Default height and width to unet
height = height or self.unet.config.sample_size * self.vae_scale_factor height = height or self.unet.config.sample_size * self.vae_scale_factor
......
...@@ -69,22 +69,20 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi ...@@ -69,22 +69,20 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi
r""" r"""
Pipeline for text-guided image super-resolution using Stable Diffusion 2. Pipeline for text-guided image super-resolution using Stable Diffusion 2.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
vae ([`AutoencoderKL`]): vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModel`]): text_encoder ([`~transformers.CLIPTextModel`]):
Frozen text-encoder. Stable Diffusion uses the text portion of Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically tokenizer ([`~transformers.CLIPTokenizer`]):
the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. A `CLIPTokenizer` to tokenize text.
tokenizer (`CLIPTokenizer`): unet ([`UNet2DConditionModel`]):
Tokenizer of class A `UNet2DConditionModel` to denoise the encoded image latents.
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
low_res_scheduler ([`SchedulerMixin`]): low_res_scheduler ([`SchedulerMixin`]):
A scheduler used to add initial noise to the low res conditioning image. It must be an instance of A scheduler used to add initial noise to the low resolution conditioning image. It must be an instance of
[`DDPMScheduler`]. [`DDPMScheduler`].
scheduler ([`SchedulerMixin`]): scheduler ([`SchedulerMixin`]):
A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
...@@ -142,10 +140,10 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi ...@@ -142,10 +140,10 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi
def enable_model_cpu_offload(self, gpu_id=0): def enable_model_cpu_offload(self, gpu_id=0):
r""" r"""
Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared Offload all models to CPU to reduce memory usage with a low impact on performance. Moves one whole model at a
to `enable_sequential_cpu_offload`, this method moves one whole model at a time to the GPU when its `forward` time to the GPU when its `forward` method is called, and the model remains in GPU until the next model runs.
method is called, and the model remains in GPU until the next model runs. Memory savings are lower than with Memory savings are lower than using `enable_sequential_cpu_offload`, but performance is much better due to the
`enable_sequential_cpu_offload`, but performance is much better due to the iterative execution of the `unet`. iterative execution of the `unet`.
""" """
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"): if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
from accelerate import cpu_offload_with_hook from accelerate import cpu_offload_with_hook
...@@ -513,62 +511,54 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi ...@@ -513,62 +511,54 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi
cross_attention_kwargs: Optional[Dict[str, Any]] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None,
): ):
r""" r"""
Function invoked when calling the pipeline for generation. The call function to the pipeline for generation.
Args: Args:
prompt (`str` or `List[str]`, *optional*): prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
instead.
image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
`Image`, or tensor representing an image batch which will be upscaled. * `Image` or tensor representing an image batch to be upscaled.
num_inference_steps (`int`, *optional*, defaults to 50): num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
guidance_scale (`float`, *optional*, defaults to 7.5): guidance_scale (`float`, *optional*, defaults to 7.5):
Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). A higher guidance scale value encourages the model to generate images closely linked to the text
`guidance_scale` is defined as `w` of equation 2. of [Imagen `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
usually at the expense of lower image quality.
negative_prompt (`str` or `List[str]`, *optional*): negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass The prompt or prompts to guide what to not include in image generation. If not defined, you need to
`negative_prompt_embeds`. instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
is less than `1`).
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
eta (`float`, *optional*, defaults to 0.0): eta (`float`, *optional*, defaults to 0.0):
Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
[`schedulers.DDIMScheduler`], will be ignored for others. to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
generator (`torch.Generator`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
to make generation deterministic. generation deterministic.
latents (`torch.FloatTensor`, *optional*): latents (`torch.FloatTensor`, *optional*):
Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
tensor will ge generated by sampling using the supplied random `generator`. tensor is generated by sampling using the supplied random `generator`.
prompt_embeds (`torch.FloatTensor`, *optional*): prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
provided, text embeddings will be generated from `prompt` input argument. provided, text embeddings are generated from the `prompt` input argument.
negative_prompt_embeds (`torch.FloatTensor`, *optional*): negative_prompt_embeds (`torch.FloatTensor`, *optional*):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
argument.
output_type (`str`, *optional*, defaults to `"pil"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between The output format of the generated image. Choose between `PIL.Image` or `np.array`.
[PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
plain tuple. plain tuple.
callback (`Callable`, *optional*): callback (`Callable`, *optional*):
A function that will be called every `callback_steps` steps during inference. The function will be A function that calls every `callback_steps` steps during inference. The function is called with the
called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
callback_steps (`int`, *optional*, defaults to 1): callback_steps (`int`, *optional*, defaults to 1):
The frequency at which the `callback` function will be called. If not specified, the callback will be The frequency at which the `callback` function is called. If not specified, the callback is called at
called at every step. every step.
cross_attention_kwargs (`dict`, *optional*): cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
`self.processor` in [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
[diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
Examples: Examples:
```py ```py
...@@ -598,10 +588,10 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi ...@@ -598,10 +588,10 @@ class StableDiffusionUpscalePipeline(DiffusionPipeline, TextualInversionLoaderMi
Returns: Returns:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
[`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple. If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned,
When returning a tuple, the first element is a list with the generated images, and the second element is a otherwise a `tuple` is returned where the first element is a list with the generated images and the
list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work" second element is a list of `bool`s indicating whether the corresponding generated image contains
(nsfw) content, according to the `safety_checker`. "not-safe-for-work" (nsfw) content.
""" """
# 1. Check inputs # 1. Check inputs
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment