[Feat] add I2VGenXL for image-to-video generation (#6665)

--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>

[Feat] add I2VGenXL for image-to-video generation (#6665)
--------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>
04cd6adf · Sayak Paul · GitHub · 66722dbe · 04cd6adf · 04cd6adf
Unverified Commit 04cd6adf authored Feb 01, 2024 by Sayak Paul Committed by GitHub Jan 31, 2024
15 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -284,6 +284,8 @@
      title: DiffEdit
    - local: api/pipelines/dit
      title: DiT
+    - local: api/pipelines/i2vgenxl
+      title: I2VGen-XL
    - local: api/pipelines/pix2pix
      title: InstructPix2Pix
    - local: api/pipelines/kandinsky

--- a/docs/source/en/api/pipelines/i2vgenxl.md
+++ b/docs/source/en/api/pipelines/i2vgenxl.md
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# I2VGen-XL
+[I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models](https://hf.co/papers/2311.04145.pdf) by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou.
+The abstract from the paper is:
+*Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).*
+The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/). 
+<Tip>
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage). 
+</Tip>
+Sample output with I2VGenXL:
+<table>
+    <tr>
+        <td><center>
+        masterpiece, bestquality, sunset.
+        <br>
+        <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"
+            alt="library"
+            style="width: 300px;" />
+        </center></td>
+    </tr>
+</table>
+## Notes
+* I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP.
+* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD). 
+* Unlike SVD, it additionally accepts text prompts as inputs. 
+* It can generate higher resolution videos. 
+* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results.
+## I2VGenXLPipeline
+[[autodoc]] I2VGenXLPipeline
+	- all
+	- __call__
+## I2VGenXLPipelineOutput
+[[autodoc]] pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput
\ No newline at end of file
--- a/scripts/convert_i2vgen_to_diffusers.py
+++ b/scripts/convert_i2vgen_to_diffusers.py
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -80,6 +80,7 @@ else:
            "AutoencoderTiny",
            "ConsistencyDecoderVAE",
            "ControlNetModel",
+            "I2VGenXLUNet",
            "Kandinsky3UNet",
            "ModelMixin",
            "MotionAdapter",
@@ -217,6 +218,7 @@ else:
            "BlipDiffusionPipeline",
            "CLIPImageProjection",
            "CycleDiffusionPipeline",
+            "I2VGenXLPipeline",
            "IFImg2ImgPipeline",
            "IFImg2ImgSuperResolutionPipeline",
            "IFInpaintingPipeline",
@@ -462,6 +464,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderTiny,
            ConsistencyDecoderVAE,
            ControlNetModel,
+            I2VGenXLUNet,
            Kandinsky3UNet,
            ModelMixin,
            MotionAdapter,
@@ -578,6 +581,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AudioLDMPipeline,
            CLIPImageProjection,
            CycleDiffusionPipeline,
+            I2VGenXLPipeline,
            IFImg2ImgPipeline,
            IFImg2ImgSuperResolutionPipeline,
            IFInpaintingPipeline,

--- a/src/diffusers/models/__init__.py
+++ b/src/diffusers/models/__init__.py
@@ -43,6 +43,7 @@ if is_torch_available():
    _import_structure["unets.unet_2d"] = ["UNet2DModel"]
    _import_structure["unets.unet_2d_condition"] = ["UNet2DConditionModel"]
    _import_structure["unets.unet_3d_condition"] = ["UNet3DConditionModel"]
+    _import_structure["unets.unet_i2vgen_xl"] = ["I2VGenXLUNet"]
    _import_structure["unets.unet_kandinsky3"] = ["Kandinsky3UNet"]
    _import_structure["unets.unet_motion_model"] = ["MotionAdapter", "UNetMotionModel"]
    _import_structure["unets.unet_spatio_temporal_condition"] = ["UNetSpatioTemporalConditionModel"]
@@ -76,6 +77,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            TransformerTemporalModel,
        )
        from .unets import (
+            I2VGenXLUNet,
            Kandinsky3UNet,
            MotionAdapter,
            UNet1DModel,

--- a/src/diffusers/models/attention.py
+++ b/src/diffusers/models/attention.py
@@ -143,7 +143,7 @@ class BasicTransformerBlock(nn.Module):
        double_self_attention: bool = False,
        upcast_attention: bool = False,
        norm_elementwise_affine: bool = True,
-        norm_type: str = "layer_norm",  # 'layer_norm', 'ada_norm', 'ada_norm_zero', 'ada_norm_single'
+        norm_type: str = "layer_norm",  # 'layer_norm', 'ada_norm', 'ada_norm_zero', 'ada_norm_single', 'layer_norm_i2vgen'
        norm_eps: float = 1e-5,
        final_dropout: bool = False,
        attention_type: str = "default",
@@ -158,18 +158,15 @@ class BasicTransformerBlock(nn.Module):
        super().__init__()
        self.only_cross_attention = only_cross_attention
-        self.use_ada_layer_norm_zero = (num_embeds_ada_norm is not None) and norm_type == "ada_norm_zero"
-        self.use_ada_layer_norm = (num_embeds_ada_norm is not None) and norm_type == "ada_norm"
-        self.use_ada_layer_norm_single = norm_type == "ada_norm_single"
-        self.use_layer_norm = norm_type == "layer_norm"
-        self.use_ada_layer_norm_continuous = norm_type == "ada_norm_continuous"
        if norm_type in ("ada_norm", "ada_norm_zero") and num_embeds_ada_norm is None:
            raise ValueError(
                f"`norm_type` is set to {norm_type}, but `num_embeds_ada_norm` is not defined. Please make sure to"
                f" define `num_embeds_ada_norm` if setting `norm_type` to {norm_type}."
            )
+        self.norm_type = norm_type
+        self.num_embeds_ada_norm = num_embeds_ada_norm
        if positional_embeddings and (num_positional_embeddings is None):
            raise ValueError(
                "If `positional_embedding` type is defined, `num_positition_embeddings` must also be defined."
@@ -182,11 +179,11 @@ class BasicTransformerBlock(nn.Module):
        # Define 3 blocks. Each block has its own normalization layer.
        # 1. Self-Attn
-        if self.use_ada_layer_norm:
+        if norm_type == "ada_norm":
            self.norm1 = AdaLayerNorm(dim, num_embeds_ada_norm)
-        elif self.use_ada_layer_norm_zero:
+        elif norm_type == "ada_norm_zero":
            self.norm1 = AdaLayerNormZero(dim, num_embeds_ada_norm)
-        elif self.use_ada_layer_norm_continuous:
+        elif norm_type == "ada_norm_continuous":
            self.norm1 = AdaLayerNormContinuous(
                dim,
                ada_norm_continous_conditioning_embedding_dim,
@@ -214,9 +211,9 @@ class BasicTransformerBlock(nn.Module):
            # We currently only use AdaLayerNormZero for self attention where there will only be one attention block.
            # I.e. the number of returned modulation chunks from AdaLayerZero would not make sense if returned during
            # the second cross attention block.
-            if self.use_ada_layer_norm:
+            if norm_type == "ada_norm":
                self.norm2 = AdaLayerNorm(dim, num_embeds_ada_norm)
-            elif self.use_ada_layer_norm_continuous:
+            elif norm_type == "ada_norm_continuous":
                self.norm2 = AdaLayerNormContinuous(
                    dim,
                    ada_norm_continous_conditioning_embedding_dim,
@@ -243,7 +240,7 @@ class BasicTransformerBlock(nn.Module):
            self.attn2 = None
        # 3. Feed-forward
-        if self.use_ada_layer_norm_continuous:
+        if norm_type == "ada_norm_continuous":
            self.norm3 = AdaLayerNormContinuous(
                dim,
                ada_norm_continous_conditioning_embedding_dim,
@@ -252,8 +249,11 @@ class BasicTransformerBlock(nn.Module):
                ada_norm_bias,
                "layer_norm",
            )
-        elif not self.use_ada_layer_norm_single:
+        elif norm_type in ["ada_norm_zero", "ada_norm", "layer_norm", "ada_norm_continuous"]:
            self.norm3 = nn.LayerNorm(dim, norm_eps, norm_elementwise_affine)
+        elif norm_type == "layer_norm_i2vgen":
+            self.norm3 = None
        self.ff = FeedForward(
            dim,
@@ -269,7 +269,7 @@ class BasicTransformerBlock(nn.Module):
            self.fuser = GatedSelfAttentionDense(dim, cross_attention_dim, num_attention_heads, attention_head_dim)
        # 5. Scale-shift for PixArt-Alpha.
-        if self.use_ada_layer_norm_single:
+        if norm_type == "ada_norm_single":
            self.scale_shift_table = nn.Parameter(torch.randn(6, dim) / dim**0.5)
        # let chunk size default to None
@@ -296,17 +296,17 @@ class BasicTransformerBlock(nn.Module):
        # 0. Self-Attention
        batch_size = hidden_states.shape[0]
-        if self.use_ada_layer_norm:
+        if self.norm_type == "ada_norm":
            norm_hidden_states = self.norm1(hidden_states, timestep)
-        elif self.use_ada_layer_norm_zero:
+        elif self.norm_type == "ada_norm_zero":
            norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
                hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
            )
-        elif self.use_layer_norm:
+        elif self.norm_type in ["layer_norm", "layer_norm_i2vgen"]:
            norm_hidden_states = self.norm1(hidden_states)
-        elif self.use_ada_layer_norm_continuous:
+        elif self.norm_type == "ada_norm_continuous":
            norm_hidden_states = self.norm1(hidden_states, added_cond_kwargs["pooled_text_emb"])
-        elif self.use_ada_layer_norm_single:
+        elif self.norm_type == "ada_norm_single":
            shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
                self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
            ).chunk(6, dim=1)
@@ -332,9 +332,9 @@ class BasicTransformerBlock(nn.Module):
            attention_mask=attention_mask,
            **cross_attention_kwargs,
        )
-        if self.use_ada_layer_norm_zero:
+        if self.norm_type == "ada_norm_zero":
            attn_output = gate_msa.unsqueeze(1) * attn_output
-        elif self.use_ada_layer_norm_single:
+        elif self.norm_type == "ada_norm_single":
            attn_output = gate_msa * attn_output
        hidden_states = attn_output + hidden_states
@@ -347,20 +347,20 @@ class BasicTransformerBlock(nn.Module):
        # 3. Cross-Attention
        if self.attn2 is not None:
-            if self.use_ada_layer_norm:
+            if self.norm_type == "ada_norm":
                norm_hidden_states = self.norm2(hidden_states, timestep)
-            elif self.use_ada_layer_norm_zero or self.use_layer_norm:
+            elif self.norm_type in ["ada_norm_zero", "layer_norm", "layer_norm_i2vgen"]:
                norm_hidden_states = self.norm2(hidden_states)
-            elif self.use_ada_layer_norm_single:
+            elif self.norm_type == "ada_norm_single":
                # For PixArt norm2 isn't applied here:
                # https://github.com/PixArt-alpha/PixArt-alpha/blob/0f55e922376d8b797edd44d25d0e7464b260dcab/diffusion/model/nets/PixArtMS.py#L70C1-L76C103
                norm_hidden_states = hidden_states
-            elif self.use_ada_layer_norm_continuous:
+            elif self.norm_type == "ada_norm_continuous":
                norm_hidden_states = self.norm2(hidden_states, added_cond_kwargs["pooled_text_emb"])
            else:
                raise ValueError("Incorrect norm")
-            if self.pos_embed is not None and self.use_ada_layer_norm_single is False:
+            if self.pos_embed is not None and self.norm_type != "ada_norm_single":
                norm_hidden_states = self.pos_embed(norm_hidden_states)
            attn_output = self.attn2(
@@ -372,15 +372,16 @@ class BasicTransformerBlock(nn.Module):
            hidden_states = attn_output + hidden_states
        # 4. Feed-forward
-        if self.use_ada_layer_norm_continuous:
+        # i2vgen doesn't have this norm 🤷‍♂️
+        if self.norm_type == "ada_norm_continuous":
            norm_hidden_states = self.norm3(hidden_states, added_cond_kwargs["pooled_text_emb"])
-        elif not self.use_ada_layer_norm_single:
+        elif not self.norm_type == "ada_norm_single":
            norm_hidden_states = self.norm3(hidden_states)
-        if self.use_ada_layer_norm_zero:
+        if self.norm_type == "ada_norm_zero":
            norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
-        if self.use_ada_layer_norm_single:
+        if self.norm_type == "ada_norm_single":
            norm_hidden_states = self.norm2(hidden_states)
            norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
@@ -392,9 +393,9 @@ class BasicTransformerBlock(nn.Module):
        else:
            ff_output = self.ff(norm_hidden_states, scale=lora_scale)
-        if self.use_ada_layer_norm_zero:
+        if self.norm_type == "ada_norm_zero":
            ff_output = gate_mlp.unsqueeze(1) * ff_output
-        elif self.use_ada_layer_norm_single:
+        elif self.norm_type == "ada_norm_single":
            ff_output = gate_mlp * ff_output
        hidden_states = ff_output + hidden_states

--- a/src/diffusers/models/unets/__init__.py
+++ b/src/diffusers/models/unets/__init__.py
@@ -6,6 +6,7 @@ if is_torch_available():
    from .unet_2d import UNet2DModel
    from .unet_2d_condition import UNet2DConditionModel
    from .unet_3d_condition import UNet3DConditionModel
+    from .unet_i2vgen_xl import I2VGenXLUNet
    from .unet_kandinsky3 import Kandinsky3UNet
    from .unet_motion_model import MotionAdapter, UNetMotionModel
    from .unet_spatio_temporal_condition import UNetSpatioTemporalConditionModel

--- a/src/diffusers/models/unets/unet_i2vgen_xl.py
+++ b/src/diffusers/models/unets/unet_i2vgen_xl.py
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -220,6 +220,7 @@ else:
        "TextToVideoZeroSDXLPipeline",
        "VideoToVideoSDPipeline",
    ]
+    _import_structure["i2vgen_xl"] = ["I2VGenXLPipeline"]
    _import_structure["unclip"] = ["UnCLIPImageVariationPipeline", "UnCLIPPipeline"]
    _import_structure["unidiffuser"] = [
        "ImageTextPipelineOutput",
@@ -384,6 +385,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            VersatileDiffusionTextToImagePipeline,
            VQDiffusionPipeline,
        )
+        from .i2vgen_xl import I2VGenXLPipeline
        from .kandinsky import (
            KandinskyCombinedPipeline,
            KandinskyImg2ImgCombinedPipeline,

--- a/src/diffusers/pipelines/i2vgen_xl/__init__.py
+++ b/src/diffusers/pipelines/i2vgen_xl/__init__.py
+from typing import TYPE_CHECKING
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+_dummy_objects = {}
+_import_structure = {}
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa F403
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["pipeline_i2vgen_xl"] = ["I2VGenXLPipeline"]
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *  # noqa F403
+    else:
+        from .pipeline_i2vgen_xl import I2VGenXLPipeline
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
--- a/src/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py
+++ b/src/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py
--- a/src/diffusers/utils/dummy_pt_objects.py
+++ b/src/diffusers/utils/dummy_pt_objects.py
@@ -92,6 +92,21 @@ class ControlNetModel(metaclass=DummyObject):
        requires_backends(cls, ["torch"])
+class I2VGenXLUNet(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
 class Kandinsky3UNet(metaclass=DummyObject):
    _backends = ["torch"]

--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -197,6 +197,21 @@ class CycleDiffusionPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])
+class I2VGenXLPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
 class IFImg2ImgPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

--- a/tests/pipelines/i2vgen_xl/__init__.py
+++ b/tests/pipelines/i2vgen_xl/__init__.py
--- a/tests/pipelines/i2vgen_xl/test_i2vgenxl.py
+++ b/tests/pipelines/i2vgen_xl/test_i2vgenxl.py
+# coding=utf-8
+# Copyright 2023 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import gc
+import random
+import unittest
+import numpy as np
+import torch
+from transformers import (
+    CLIPImageProcessor,
+    CLIPTextConfig,
+    CLIPTextModel,
+    CLIPTokenizer,
+    CLIPVisionConfig,
+    CLIPVisionModelWithProjection,
+)
+from diffusers import (
+    AutoencoderKL,
+    DDIMScheduler,
+    I2VGenXLPipeline,
+)
+from diffusers.models.unets import I2VGenXLUNet
+from diffusers.utils import is_xformers_available, load_image
+from diffusers.utils.testing_utils import (
+    enable_full_determinism,
+    floats_tensor,
+    numpy_cosine_similarity_distance,
+    print_tensor_test,
+    require_torch_gpu,
+    skip_mps,
+    slow,
+    torch_device,
+)
+from ..test_pipelines_common import PipelineTesterMixin
+enable_full_determinism()
+@skip_mps
+class I2VGenXLPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+    pipeline_class = I2VGenXLPipeline
+    params = frozenset(["prompt", "negative_prompt", "image"])
+    batch_params = frozenset(["prompt", "negative_prompt", "image", "generator"])
+    # No `output_type`.
+    required_optional_params = frozenset(["num_inference_steps", "generator", "latents", "return_dict"])
+    def get_dummy_components(self):
+        torch.manual_seed(0)
+        scheduler = DDIMScheduler(
+            beta_start=0.00085,
+            beta_end=0.012,
+            beta_schedule="scaled_linear",
+            clip_sample=False,
+            set_alpha_to_one=False,
+        )
+        torch.manual_seed(0)
+        unet = I2VGenXLUNet(
+            block_out_channels=(4, 8),
+            layers_per_block=1,
+            sample_size=32,
+            in_channels=4,
+            out_channels=4,
+            down_block_types=("CrossAttnDownBlock3D", "DownBlock3D"),
+            up_block_types=("UpBlock3D", "CrossAttnUpBlock3D"),
+            cross_attention_dim=4,
+            num_attention_heads=4,
+            norm_num_groups=2,
+        )
+        torch.manual_seed(0)
+        vae = AutoencoderKL(
+            block_out_channels=(8,),
+            in_channels=3,
+            out_channels=3,
+            down_block_types=["DownEncoderBlock2D"],
+            up_block_types=["UpDecoderBlock2D"],
+            latent_channels=4,
+            sample_size=32,
+            norm_num_groups=2,
+        )
+        torch.manual_seed(0)
+        text_encoder_config = CLIPTextConfig(
+            bos_token_id=0,
+            eos_token_id=2,
+            hidden_size=4,
+            intermediate_size=16,
+            layer_norm_eps=1e-05,
+            num_attention_heads=2,
+            num_hidden_layers=2,
+            pad_token_id=1,
+            vocab_size=1000,
+            hidden_act="gelu",
+            projection_dim=32,
+        )
+        text_encoder = CLIPTextModel(text_encoder_config)
+        tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+        torch.manual_seed(0)
+        vision_encoder_config = CLIPVisionConfig(
+            hidden_size=4,
+            projection_dim=4,
+            num_hidden_layers=2,
+            num_attention_heads=2,
+            image_size=32,
+            intermediate_size=16,
+            patch_size=1,
+        )
+        image_encoder = CLIPVisionModelWithProjection(vision_encoder_config)
+        torch.manual_seed(0)
+        feature_extractor = CLIPImageProcessor(crop_size=32, size=32)
+        components = {
+            "unet": unet,
+            "scheduler": scheduler,
+            "vae": vae,
+            "text_encoder": text_encoder,
+            "image_encoder": image_encoder,
+            "tokenizer": tokenizer,
+            "feature_extractor": feature_extractor,
+        }
+        return components
+    def get_dummy_inputs(self, device, seed=0):
+        if str(device).startswith("mps"):
+            generator = torch.manual_seed(seed)
+        else:
+            generator = torch.Generator(device=device).manual_seed(seed)
+        input_image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+        inputs = {
+            "prompt": "A painting of a squirrel eating a burger",
+            "image": input_image,
+            "generator": generator,
+            "num_inference_steps": 2,
+            "guidance_scale": 6.0,
+            "output_type": "pt",
+            "num_frames": 4,
+            "width": 32,
+            "height": 32,
+        }
+        return inputs
+    def test_text_to_video_default_case(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(device)
+        pipe.set_progress_bar_config(disable=None)
+        inputs = self.get_dummy_inputs(device)
+        inputs["output_type"] = "np"
+        frames = pipe(**inputs).frames
+        image_slice = frames[0][0][-3:, -3:, -1]
+        assert frames[0][0].shape == (32, 32, 3)
+        expected_slice = np.array([0.5146, 0.6525, 0.6032, 0.5204, 0.5675, 0.4125, 0.3016, 0.5172, 0.4095])
+        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+    def test_save_load_local(self):
+        super().test_save_load_local(expected_max_difference=0.006)
+    def test_sequential_cpu_offload_forward_pass(self):
+        super().test_sequential_cpu_offload_forward_pass(expected_max_diff=0.008)
+    def test_dict_tuple_outputs_equivalent(self):
+        super().test_dict_tuple_outputs_equivalent(expected_max_difference=0.008)
+    def test_save_load_optional_components(self):
+        super().test_save_load_optional_components(expected_max_difference=0.008)
+    @unittest.skip("Deprecated functionality")
+    def test_attention_slicing_forward_pass(self):
+        pass
+    @unittest.skipIf(
+        torch_device != "cuda" or not is_xformers_available(),
+        reason="XFormers attention is only available with CUDA and `xformers` installed",
+    )
+    def test_xformers_attention_forwardGenerator_pass(self):
+        self._test_xformers_attention_forwardGenerator_pass(test_mean_pixel_difference=False, expected_max_diff=1e-2)
+    def test_inference_batch_single_identical(self):
+        super().test_inference_batch_single_identical(batch_size=2, expected_max_diff=0.008)
+    def test_model_cpu_offload_forward_pass(self):
+        super().test_model_cpu_offload_forward_pass(expected_max_diff=0.008)
+    def test_num_videos_per_prompt(self):
+        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
+        components = self.get_dummy_components()
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(device)
+        pipe.set_progress_bar_config(disable=None)
+        inputs = self.get_dummy_inputs(device)
+        inputs["output_type"] = "np"
+        frames = pipe(**inputs, num_videos_per_prompt=2).frames
+        assert frames.shape == (2, 4, 32, 32, 3)
+        assert frames[0][0].shape == (32, 32, 3)
+        image_slice = frames[0][0][-3:, -3:, -1]
+        expected_slice = np.array([0.5146, 0.6525, 0.6032, 0.5204, 0.5675, 0.4125, 0.3016, 0.5172, 0.4095])
+        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+@slow
+@require_torch_gpu
+class I2VGenXLPipelineSlowTests(unittest.TestCase):
+    def tearDown(self):
+        # clean up the VRAM after each test
+        super().tearDown()
+        gc.collect()
+        torch.cuda.empty_cache()
+    def test_i2vgen_xl(self):
+        pipe = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
+        pipe = pipe.to(torch_device)
+        pipe.enable_model_cpu_offload()
+        pipe.set_progress_bar_config(disable=None)
+        image = load_image(
+            "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
+        )
+        generator = torch.Generator("cpu").manual_seed(0)
+        num_frames = 3
+        output = pipe(
+            image=image,
+            prompt="my cat",
+            num_frames=num_frames,
+            generator=generator,
+            num_inference_steps=3,
+            output_type="np",
+        )
+        image = output.frames[0]
+        assert image.shape == (num_frames, 704, 1280, 3)
+        image_slice = image[0, -3:, -3:, -1]
+        print_tensor_test(image_slice.flatten())
+        expected_slice = np.array([0.5482, 0.6244, 0.6274, 0.4584, 0.5935, 0.5937, 0.4579, 0.5767, 0.5892])
+        assert numpy_cosine_similarity_distance(image_slice.flatten(), expected_slice.flatten()) < 1e-3