Unverified Commit 04cd6adf authored by Sayak Paul's avatar Sayak Paul Committed by GitHub
Browse files

[Feat] add I2VGenXL for image-to-video generation (#6665)





---------
Co-authored-by: default avatarDhruv Nair <dhruv.nair@gmail.com>
Co-authored-by: default avatarPatrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: default avatarYiYi Xu <yixu310@gmail.com>
parent 66722dbe
...@@ -284,6 +284,8 @@ ...@@ -284,6 +284,8 @@
title: DiffEdit title: DiffEdit
- local: api/pipelines/dit - local: api/pipelines/dit
title: DiT title: DiT
- local: api/pipelines/i2vgenxl
title: I2VGen-XL
- local: api/pipelines/pix2pix - local: api/pipelines/pix2pix
title: InstructPix2Pix title: InstructPix2Pix
- local: api/pipelines/kandinsky - local: api/pipelines/kandinsky
......
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# I2VGen-XL
[I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models](https://hf.co/papers/2311.04145.pdf) by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou.
The abstract from the paper is:
*Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).*
The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/).
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
</Tip>
Sample output with I2VGenXL:
<table>
<tr>
<td><center>
masterpiece, bestquality, sunset.
<br>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"
alt="library"
style="width: 300px;" />
</center></td>
</tr>
</table>
## Notes
* I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP.
* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD).
* Unlike SVD, it additionally accepts text prompts as inputs.
* It can generate higher resolution videos.
* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results.
## I2VGenXLPipeline
[[autodoc]] I2VGenXLPipeline
- all
- __call__
## I2VGenXLPipelineOutput
[[autodoc]] pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput
\ No newline at end of file
This diff is collapsed.
...@@ -80,6 +80,7 @@ else: ...@@ -80,6 +80,7 @@ else:
"AutoencoderTiny", "AutoencoderTiny",
"ConsistencyDecoderVAE", "ConsistencyDecoderVAE",
"ControlNetModel", "ControlNetModel",
"I2VGenXLUNet",
"Kandinsky3UNet", "Kandinsky3UNet",
"ModelMixin", "ModelMixin",
"MotionAdapter", "MotionAdapter",
...@@ -217,6 +218,7 @@ else: ...@@ -217,6 +218,7 @@ else:
"BlipDiffusionPipeline", "BlipDiffusionPipeline",
"CLIPImageProjection", "CLIPImageProjection",
"CycleDiffusionPipeline", "CycleDiffusionPipeline",
"I2VGenXLPipeline",
"IFImg2ImgPipeline", "IFImg2ImgPipeline",
"IFImg2ImgSuperResolutionPipeline", "IFImg2ImgSuperResolutionPipeline",
"IFInpaintingPipeline", "IFInpaintingPipeline",
...@@ -462,6 +464,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -462,6 +464,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AutoencoderTiny, AutoencoderTiny,
ConsistencyDecoderVAE, ConsistencyDecoderVAE,
ControlNetModel, ControlNetModel,
I2VGenXLUNet,
Kandinsky3UNet, Kandinsky3UNet,
ModelMixin, ModelMixin,
MotionAdapter, MotionAdapter,
...@@ -578,6 +581,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -578,6 +581,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AudioLDMPipeline, AudioLDMPipeline,
CLIPImageProjection, CLIPImageProjection,
CycleDiffusionPipeline, CycleDiffusionPipeline,
I2VGenXLPipeline,
IFImg2ImgPipeline, IFImg2ImgPipeline,
IFImg2ImgSuperResolutionPipeline, IFImg2ImgSuperResolutionPipeline,
IFInpaintingPipeline, IFInpaintingPipeline,
......
...@@ -43,6 +43,7 @@ if is_torch_available(): ...@@ -43,6 +43,7 @@ if is_torch_available():
_import_structure["unets.unet_2d"] = ["UNet2DModel"] _import_structure["unets.unet_2d"] = ["UNet2DModel"]
_import_structure["unets.unet_2d_condition"] = ["UNet2DConditionModel"] _import_structure["unets.unet_2d_condition"] = ["UNet2DConditionModel"]
_import_structure["unets.unet_3d_condition"] = ["UNet3DConditionModel"] _import_structure["unets.unet_3d_condition"] = ["UNet3DConditionModel"]
_import_structure["unets.unet_i2vgen_xl"] = ["I2VGenXLUNet"]
_import_structure["unets.unet_kandinsky3"] = ["Kandinsky3UNet"] _import_structure["unets.unet_kandinsky3"] = ["Kandinsky3UNet"]
_import_structure["unets.unet_motion_model"] = ["MotionAdapter", "UNetMotionModel"] _import_structure["unets.unet_motion_model"] = ["MotionAdapter", "UNetMotionModel"]
_import_structure["unets.unet_spatio_temporal_condition"] = ["UNetSpatioTemporalConditionModel"] _import_structure["unets.unet_spatio_temporal_condition"] = ["UNetSpatioTemporalConditionModel"]
...@@ -76,6 +77,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -76,6 +77,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
TransformerTemporalModel, TransformerTemporalModel,
) )
from .unets import ( from .unets import (
I2VGenXLUNet,
Kandinsky3UNet, Kandinsky3UNet,
MotionAdapter, MotionAdapter,
UNet1DModel, UNet1DModel,
......
...@@ -143,7 +143,7 @@ class BasicTransformerBlock(nn.Module): ...@@ -143,7 +143,7 @@ class BasicTransformerBlock(nn.Module):
double_self_attention: bool = False, double_self_attention: bool = False,
upcast_attention: bool = False, upcast_attention: bool = False,
norm_elementwise_affine: bool = True, norm_elementwise_affine: bool = True,
norm_type: str = "layer_norm", # 'layer_norm', 'ada_norm', 'ada_norm_zero', 'ada_norm_single' norm_type: str = "layer_norm", # 'layer_norm', 'ada_norm', 'ada_norm_zero', 'ada_norm_single', 'layer_norm_i2vgen'
norm_eps: float = 1e-5, norm_eps: float = 1e-5,
final_dropout: bool = False, final_dropout: bool = False,
attention_type: str = "default", attention_type: str = "default",
...@@ -158,18 +158,15 @@ class BasicTransformerBlock(nn.Module): ...@@ -158,18 +158,15 @@ class BasicTransformerBlock(nn.Module):
super().__init__() super().__init__()
self.only_cross_attention = only_cross_attention self.only_cross_attention = only_cross_attention
self.use_ada_layer_norm_zero = (num_embeds_ada_norm is not None) and norm_type == "ada_norm_zero"
self.use_ada_layer_norm = (num_embeds_ada_norm is not None) and norm_type == "ada_norm"
self.use_ada_layer_norm_single = norm_type == "ada_norm_single"
self.use_layer_norm = norm_type == "layer_norm"
self.use_ada_layer_norm_continuous = norm_type == "ada_norm_continuous"
if norm_type in ("ada_norm", "ada_norm_zero") and num_embeds_ada_norm is None: if norm_type in ("ada_norm", "ada_norm_zero") and num_embeds_ada_norm is None:
raise ValueError( raise ValueError(
f"`norm_type` is set to {norm_type}, but `num_embeds_ada_norm` is not defined. Please make sure to" f"`norm_type` is set to {norm_type}, but `num_embeds_ada_norm` is not defined. Please make sure to"
f" define `num_embeds_ada_norm` if setting `norm_type` to {norm_type}." f" define `num_embeds_ada_norm` if setting `norm_type` to {norm_type}."
) )
self.norm_type = norm_type
self.num_embeds_ada_norm = num_embeds_ada_norm
if positional_embeddings and (num_positional_embeddings is None): if positional_embeddings and (num_positional_embeddings is None):
raise ValueError( raise ValueError(
"If `positional_embedding` type is defined, `num_positition_embeddings` must also be defined." "If `positional_embedding` type is defined, `num_positition_embeddings` must also be defined."
...@@ -182,11 +179,11 @@ class BasicTransformerBlock(nn.Module): ...@@ -182,11 +179,11 @@ class BasicTransformerBlock(nn.Module):
# Define 3 blocks. Each block has its own normalization layer. # Define 3 blocks. Each block has its own normalization layer.
# 1. Self-Attn # 1. Self-Attn
if self.use_ada_layer_norm: if norm_type == "ada_norm":
self.norm1 = AdaLayerNorm(dim, num_embeds_ada_norm) self.norm1 = AdaLayerNorm(dim, num_embeds_ada_norm)
elif self.use_ada_layer_norm_zero: elif norm_type == "ada_norm_zero":
self.norm1 = AdaLayerNormZero(dim, num_embeds_ada_norm) self.norm1 = AdaLayerNormZero(dim, num_embeds_ada_norm)
elif self.use_ada_layer_norm_continuous: elif norm_type == "ada_norm_continuous":
self.norm1 = AdaLayerNormContinuous( self.norm1 = AdaLayerNormContinuous(
dim, dim,
ada_norm_continous_conditioning_embedding_dim, ada_norm_continous_conditioning_embedding_dim,
...@@ -214,9 +211,9 @@ class BasicTransformerBlock(nn.Module): ...@@ -214,9 +211,9 @@ class BasicTransformerBlock(nn.Module):
# We currently only use AdaLayerNormZero for self attention where there will only be one attention block. # We currently only use AdaLayerNormZero for self attention where there will only be one attention block.
# I.e. the number of returned modulation chunks from AdaLayerZero would not make sense if returned during # I.e. the number of returned modulation chunks from AdaLayerZero would not make sense if returned during
# the second cross attention block. # the second cross attention block.
if self.use_ada_layer_norm: if norm_type == "ada_norm":
self.norm2 = AdaLayerNorm(dim, num_embeds_ada_norm) self.norm2 = AdaLayerNorm(dim, num_embeds_ada_norm)
elif self.use_ada_layer_norm_continuous: elif norm_type == "ada_norm_continuous":
self.norm2 = AdaLayerNormContinuous( self.norm2 = AdaLayerNormContinuous(
dim, dim,
ada_norm_continous_conditioning_embedding_dim, ada_norm_continous_conditioning_embedding_dim,
...@@ -243,7 +240,7 @@ class BasicTransformerBlock(nn.Module): ...@@ -243,7 +240,7 @@ class BasicTransformerBlock(nn.Module):
self.attn2 = None self.attn2 = None
# 3. Feed-forward # 3. Feed-forward
if self.use_ada_layer_norm_continuous: if norm_type == "ada_norm_continuous":
self.norm3 = AdaLayerNormContinuous( self.norm3 = AdaLayerNormContinuous(
dim, dim,
ada_norm_continous_conditioning_embedding_dim, ada_norm_continous_conditioning_embedding_dim,
...@@ -252,8 +249,11 @@ class BasicTransformerBlock(nn.Module): ...@@ -252,8 +249,11 @@ class BasicTransformerBlock(nn.Module):
ada_norm_bias, ada_norm_bias,
"layer_norm", "layer_norm",
) )
elif not self.use_ada_layer_norm_single:
elif norm_type in ["ada_norm_zero", "ada_norm", "layer_norm", "ada_norm_continuous"]:
self.norm3 = nn.LayerNorm(dim, norm_eps, norm_elementwise_affine) self.norm3 = nn.LayerNorm(dim, norm_eps, norm_elementwise_affine)
elif norm_type == "layer_norm_i2vgen":
self.norm3 = None
self.ff = FeedForward( self.ff = FeedForward(
dim, dim,
...@@ -269,7 +269,7 @@ class BasicTransformerBlock(nn.Module): ...@@ -269,7 +269,7 @@ class BasicTransformerBlock(nn.Module):
self.fuser = GatedSelfAttentionDense(dim, cross_attention_dim, num_attention_heads, attention_head_dim) self.fuser = GatedSelfAttentionDense(dim, cross_attention_dim, num_attention_heads, attention_head_dim)
# 5. Scale-shift for PixArt-Alpha. # 5. Scale-shift for PixArt-Alpha.
if self.use_ada_layer_norm_single: if norm_type == "ada_norm_single":
self.scale_shift_table = nn.Parameter(torch.randn(6, dim) / dim**0.5) self.scale_shift_table = nn.Parameter(torch.randn(6, dim) / dim**0.5)
# let chunk size default to None # let chunk size default to None
...@@ -296,17 +296,17 @@ class BasicTransformerBlock(nn.Module): ...@@ -296,17 +296,17 @@ class BasicTransformerBlock(nn.Module):
# 0. Self-Attention # 0. Self-Attention
batch_size = hidden_states.shape[0] batch_size = hidden_states.shape[0]
if self.use_ada_layer_norm: if self.norm_type == "ada_norm":
norm_hidden_states = self.norm1(hidden_states, timestep) norm_hidden_states = self.norm1(hidden_states, timestep)
elif self.use_ada_layer_norm_zero: elif self.norm_type == "ada_norm_zero":
norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1( norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
) )
elif self.use_layer_norm: elif self.norm_type in ["layer_norm", "layer_norm_i2vgen"]:
norm_hidden_states = self.norm1(hidden_states) norm_hidden_states = self.norm1(hidden_states)
elif self.use_ada_layer_norm_continuous: elif self.norm_type == "ada_norm_continuous":
norm_hidden_states = self.norm1(hidden_states, added_cond_kwargs["pooled_text_emb"]) norm_hidden_states = self.norm1(hidden_states, added_cond_kwargs["pooled_text_emb"])
elif self.use_ada_layer_norm_single: elif self.norm_type == "ada_norm_single":
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = ( shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1) self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
).chunk(6, dim=1) ).chunk(6, dim=1)
...@@ -332,9 +332,9 @@ class BasicTransformerBlock(nn.Module): ...@@ -332,9 +332,9 @@ class BasicTransformerBlock(nn.Module):
attention_mask=attention_mask, attention_mask=attention_mask,
**cross_attention_kwargs, **cross_attention_kwargs,
) )
if self.use_ada_layer_norm_zero: if self.norm_type == "ada_norm_zero":
attn_output = gate_msa.unsqueeze(1) * attn_output attn_output = gate_msa.unsqueeze(1) * attn_output
elif self.use_ada_layer_norm_single: elif self.norm_type == "ada_norm_single":
attn_output = gate_msa * attn_output attn_output = gate_msa * attn_output
hidden_states = attn_output + hidden_states hidden_states = attn_output + hidden_states
...@@ -347,20 +347,20 @@ class BasicTransformerBlock(nn.Module): ...@@ -347,20 +347,20 @@ class BasicTransformerBlock(nn.Module):
# 3. Cross-Attention # 3. Cross-Attention
if self.attn2 is not None: if self.attn2 is not None:
if self.use_ada_layer_norm: if self.norm_type == "ada_norm":
norm_hidden_states = self.norm2(hidden_states, timestep) norm_hidden_states = self.norm2(hidden_states, timestep)
elif self.use_ada_layer_norm_zero or self.use_layer_norm: elif self.norm_type in ["ada_norm_zero", "layer_norm", "layer_norm_i2vgen"]:
norm_hidden_states = self.norm2(hidden_states) norm_hidden_states = self.norm2(hidden_states)
elif self.use_ada_layer_norm_single: elif self.norm_type == "ada_norm_single":
# For PixArt norm2 isn't applied here: # For PixArt norm2 isn't applied here:
# https://github.com/PixArt-alpha/PixArt-alpha/blob/0f55e922376d8b797edd44d25d0e7464b260dcab/diffusion/model/nets/PixArtMS.py#L70C1-L76C103 # https://github.com/PixArt-alpha/PixArt-alpha/blob/0f55e922376d8b797edd44d25d0e7464b260dcab/diffusion/model/nets/PixArtMS.py#L70C1-L76C103
norm_hidden_states = hidden_states norm_hidden_states = hidden_states
elif self.use_ada_layer_norm_continuous: elif self.norm_type == "ada_norm_continuous":
norm_hidden_states = self.norm2(hidden_states, added_cond_kwargs["pooled_text_emb"]) norm_hidden_states = self.norm2(hidden_states, added_cond_kwargs["pooled_text_emb"])
else: else:
raise ValueError("Incorrect norm") raise ValueError("Incorrect norm")
if self.pos_embed is not None and self.use_ada_layer_norm_single is False: if self.pos_embed is not None and self.norm_type != "ada_norm_single":
norm_hidden_states = self.pos_embed(norm_hidden_states) norm_hidden_states = self.pos_embed(norm_hidden_states)
attn_output = self.attn2( attn_output = self.attn2(
...@@ -372,15 +372,16 @@ class BasicTransformerBlock(nn.Module): ...@@ -372,15 +372,16 @@ class BasicTransformerBlock(nn.Module):
hidden_states = attn_output + hidden_states hidden_states = attn_output + hidden_states
# 4. Feed-forward # 4. Feed-forward
if self.use_ada_layer_norm_continuous: # i2vgen doesn't have this norm 🤷‍♂️
if self.norm_type == "ada_norm_continuous":
norm_hidden_states = self.norm3(hidden_states, added_cond_kwargs["pooled_text_emb"]) norm_hidden_states = self.norm3(hidden_states, added_cond_kwargs["pooled_text_emb"])
elif not self.use_ada_layer_norm_single: elif not self.norm_type == "ada_norm_single":
norm_hidden_states = self.norm3(hidden_states) norm_hidden_states = self.norm3(hidden_states)
if self.use_ada_layer_norm_zero: if self.norm_type == "ada_norm_zero":
norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None] norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
if self.use_ada_layer_norm_single: if self.norm_type == "ada_norm_single":
norm_hidden_states = self.norm2(hidden_states) norm_hidden_states = self.norm2(hidden_states)
norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
...@@ -392,9 +393,9 @@ class BasicTransformerBlock(nn.Module): ...@@ -392,9 +393,9 @@ class BasicTransformerBlock(nn.Module):
else: else:
ff_output = self.ff(norm_hidden_states, scale=lora_scale) ff_output = self.ff(norm_hidden_states, scale=lora_scale)
if self.use_ada_layer_norm_zero: if self.norm_type == "ada_norm_zero":
ff_output = gate_mlp.unsqueeze(1) * ff_output ff_output = gate_mlp.unsqueeze(1) * ff_output
elif self.use_ada_layer_norm_single: elif self.norm_type == "ada_norm_single":
ff_output = gate_mlp * ff_output ff_output = gate_mlp * ff_output
hidden_states = ff_output + hidden_states hidden_states = ff_output + hidden_states
......
...@@ -6,6 +6,7 @@ if is_torch_available(): ...@@ -6,6 +6,7 @@ if is_torch_available():
from .unet_2d import UNet2DModel from .unet_2d import UNet2DModel
from .unet_2d_condition import UNet2DConditionModel from .unet_2d_condition import UNet2DConditionModel
from .unet_3d_condition import UNet3DConditionModel from .unet_3d_condition import UNet3DConditionModel
from .unet_i2vgen_xl import I2VGenXLUNet
from .unet_kandinsky3 import Kandinsky3UNet from .unet_kandinsky3 import Kandinsky3UNet
from .unet_motion_model import MotionAdapter, UNetMotionModel from .unet_motion_model import MotionAdapter, UNetMotionModel
from .unet_spatio_temporal_condition import UNetSpatioTemporalConditionModel from .unet_spatio_temporal_condition import UNetSpatioTemporalConditionModel
......
This diff is collapsed.
...@@ -220,6 +220,7 @@ else: ...@@ -220,6 +220,7 @@ else:
"TextToVideoZeroSDXLPipeline", "TextToVideoZeroSDXLPipeline",
"VideoToVideoSDPipeline", "VideoToVideoSDPipeline",
] ]
_import_structure["i2vgen_xl"] = ["I2VGenXLPipeline"]
_import_structure["unclip"] = ["UnCLIPImageVariationPipeline", "UnCLIPPipeline"] _import_structure["unclip"] = ["UnCLIPImageVariationPipeline", "UnCLIPPipeline"]
_import_structure["unidiffuser"] = [ _import_structure["unidiffuser"] = [
"ImageTextPipelineOutput", "ImageTextPipelineOutput",
...@@ -384,6 +385,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: ...@@ -384,6 +385,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
VersatileDiffusionTextToImagePipeline, VersatileDiffusionTextToImagePipeline,
VQDiffusionPipeline, VQDiffusionPipeline,
) )
from .i2vgen_xl import I2VGenXLPipeline
from .kandinsky import ( from .kandinsky import (
KandinskyCombinedPipeline, KandinskyCombinedPipeline,
KandinskyImg2ImgCombinedPipeline, KandinskyImg2ImgCombinedPipeline,
......
from typing import TYPE_CHECKING
from ...utils import (
DIFFUSERS_SLOW_IMPORT,
OptionalDependencyNotAvailable,
_LazyModule,
get_objects_from_module,
is_torch_available,
is_transformers_available,
)
_dummy_objects = {}
_import_structure = {}
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils import dummy_torch_and_transformers_objects # noqa F403
_dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
else:
_import_structure["pipeline_i2vgen_xl"] = ["I2VGenXLPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
try:
if not (is_transformers_available() and is_torch_available()):
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from ...utils.dummy_torch_and_transformers_objects import * # noqa F403
else:
from .pipeline_i2vgen_xl import I2VGenXLPipeline
else:
import sys
sys.modules[__name__] = _LazyModule(
__name__,
globals()["__file__"],
_import_structure,
module_spec=__spec__,
)
for name, value in _dummy_objects.items():
setattr(sys.modules[__name__], name, value)
This diff is collapsed.
...@@ -92,6 +92,21 @@ class ControlNetModel(metaclass=DummyObject): ...@@ -92,6 +92,21 @@ class ControlNetModel(metaclass=DummyObject):
requires_backends(cls, ["torch"]) requires_backends(cls, ["torch"])
class I2VGenXLUNet(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
class Kandinsky3UNet(metaclass=DummyObject): class Kandinsky3UNet(metaclass=DummyObject):
_backends = ["torch"] _backends = ["torch"]
......
...@@ -197,6 +197,21 @@ class CycleDiffusionPipeline(metaclass=DummyObject): ...@@ -197,6 +197,21 @@ class CycleDiffusionPipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"]) requires_backends(cls, ["torch", "transformers"])
class I2VGenXLPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class IFImg2ImgPipeline(metaclass=DummyObject): class IFImg2ImgPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"] _backends = ["torch", "transformers"]
......
# coding=utf-8
# Copyright 2023 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import gc
import random
import unittest
import numpy as np
import torch
from transformers import (
CLIPImageProcessor,
CLIPTextConfig,
CLIPTextModel,
CLIPTokenizer,
CLIPVisionConfig,
CLIPVisionModelWithProjection,
)
from diffusers import (
AutoencoderKL,
DDIMScheduler,
I2VGenXLPipeline,
)
from diffusers.models.unets import I2VGenXLUNet
from diffusers.utils import is_xformers_available, load_image
from diffusers.utils.testing_utils import (
enable_full_determinism,
floats_tensor,
numpy_cosine_similarity_distance,
print_tensor_test,
require_torch_gpu,
skip_mps,
slow,
torch_device,
)
from ..test_pipelines_common import PipelineTesterMixin
enable_full_determinism()
@skip_mps
class I2VGenXLPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = I2VGenXLPipeline
params = frozenset(["prompt", "negative_prompt", "image"])
batch_params = frozenset(["prompt", "negative_prompt", "image", "generator"])
# No `output_type`.
required_optional_params = frozenset(["num_inference_steps", "generator", "latents", "return_dict"])
def get_dummy_components(self):
torch.manual_seed(0)
scheduler = DDIMScheduler(
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False,
set_alpha_to_one=False,
)
torch.manual_seed(0)
unet = I2VGenXLUNet(
block_out_channels=(4, 8),
layers_per_block=1,
sample_size=32,
in_channels=4,
out_channels=4,
down_block_types=("CrossAttnDownBlock3D", "DownBlock3D"),
up_block_types=("UpBlock3D", "CrossAttnUpBlock3D"),
cross_attention_dim=4,
num_attention_heads=4,
norm_num_groups=2,
)
torch.manual_seed(0)
vae = AutoencoderKL(
block_out_channels=(8,),
in_channels=3,
out_channels=3,
down_block_types=["DownEncoderBlock2D"],
up_block_types=["UpDecoderBlock2D"],
latent_channels=4,
sample_size=32,
norm_num_groups=2,
)
torch.manual_seed(0)
text_encoder_config = CLIPTextConfig(
bos_token_id=0,
eos_token_id=2,
hidden_size=4,
intermediate_size=16,
layer_norm_eps=1e-05,
num_attention_heads=2,
num_hidden_layers=2,
pad_token_id=1,
vocab_size=1000,
hidden_act="gelu",
projection_dim=32,
)
text_encoder = CLIPTextModel(text_encoder_config)
tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
torch.manual_seed(0)
vision_encoder_config = CLIPVisionConfig(
hidden_size=4,
projection_dim=4,
num_hidden_layers=2,
num_attention_heads=2,
image_size=32,
intermediate_size=16,
patch_size=1,
)
image_encoder = CLIPVisionModelWithProjection(vision_encoder_config)
torch.manual_seed(0)
feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
components = {
"unet": unet,
"scheduler": scheduler,
"vae": vae,
"text_encoder": text_encoder,
"image_encoder": image_encoder,
"tokenizer": tokenizer,
"feature_extractor": feature_extractor,
}
return components
def get_dummy_inputs(self, device, seed=0):
if str(device).startswith("mps"):
generator = torch.manual_seed(seed)
else:
generator = torch.Generator(device=device).manual_seed(seed)
input_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
inputs = {
"prompt": "A painting of a squirrel eating a burger",
"image": input_image,
"generator": generator,
"num_inference_steps": 2,
"guidance_scale": 6.0,
"output_type": "pt",
"num_frames": 4,
"width": 32,
"height": 32,
}
return inputs
def test_text_to_video_default_case(self):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
inputs["output_type"] = "np"
frames = pipe(**inputs).frames
image_slice = frames[0][0][-3:, -3:, -1]
assert frames[0][0].shape == (32, 32, 3)
expected_slice = np.array([0.5146, 0.6525, 0.6032, 0.5204, 0.5675, 0.4125, 0.3016, 0.5172, 0.4095])
assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
def test_save_load_local(self):
super().test_save_load_local(expected_max_difference=0.006)
def test_sequential_cpu_offload_forward_pass(self):
super().test_sequential_cpu_offload_forward_pass(expected_max_diff=0.008)
def test_dict_tuple_outputs_equivalent(self):
super().test_dict_tuple_outputs_equivalent(expected_max_difference=0.008)
def test_save_load_optional_components(self):
super().test_save_load_optional_components(expected_max_difference=0.008)
@unittest.skip("Deprecated functionality")
def test_attention_slicing_forward_pass(self):
pass
@unittest.skipIf(
torch_device != "cuda" or not is_xformers_available(),
reason="XFormers attention is only available with CUDA and `xformers` installed",
)
def test_xformers_attention_forwardGenerator_pass(self):
self._test_xformers_attention_forwardGenerator_pass(test_mean_pixel_difference=False, expected_max_diff=1e-2)
def test_inference_batch_single_identical(self):
super().test_inference_batch_single_identical(batch_size=2, expected_max_diff=0.008)
def test_model_cpu_offload_forward_pass(self):
super().test_model_cpu_offload_forward_pass(expected_max_diff=0.008)
def test_num_videos_per_prompt(self):
device = "cpu" # ensure determinism for the device-dependent torch.Generator
components = self.get_dummy_components()
pipe = self.pipeline_class(**components)
pipe = pipe.to(device)
pipe.set_progress_bar_config(disable=None)
inputs = self.get_dummy_inputs(device)
inputs["output_type"] = "np"
frames = pipe(**inputs, num_videos_per_prompt=2).frames
assert frames.shape == (2, 4, 32, 32, 3)
assert frames[0][0].shape == (32, 32, 3)
image_slice = frames[0][0][-3:, -3:, -1]
expected_slice = np.array([0.5146, 0.6525, 0.6032, 0.5204, 0.5675, 0.4125, 0.3016, 0.5172, 0.4095])
assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
@slow
@require_torch_gpu
class I2VGenXLPipelineSlowTests(unittest.TestCase):
def tearDown(self):
# clean up the VRAM after each test
super().tearDown()
gc.collect()
torch.cuda.empty_cache()
def test_i2vgen_xl(self):
pipe = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
pipe = pipe.to(torch_device)
pipe.enable_model_cpu_offload()
pipe.set_progress_bar_config(disable=None)
image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
)
generator = torch.Generator("cpu").manual_seed(0)
num_frames = 3
output = pipe(
image=image,
prompt="my cat",
num_frames=num_frames,
generator=generator,
num_inference_steps=3,
output_type="np",
)
image = output.frames[0]
assert image.shape == (num_frames, 704, 1280, 3)
image_slice = image[0, -3:, -3:, -1]
print_tensor_test(image_slice.flatten())
expected_slice = np.array([0.5482, 0.6244, 0.6274, 0.4584, 0.5935, 0.5937, 0.4579, 0.5767, 0.5892])
assert numpy_cosine_similarity_distance(image_slice.flatten(), expected_slice.flatten()) < 1e-3
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment