Hunyuanvideo15 (#12696)

* add --------- Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-161-123.ec2.internal> Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-160-103.ec2.internal> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Hunyuanvideo15 (#12696)
* add --------- Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-161-123.ec2.internal> Co-authored-by: yiyi@huggingface.co <yiyi@ip-26-0-160-103.ec2.internal> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
6156cf8f · YiYi Xu · GitHub · 152f7ca3 · 6156cf8f · 6156cf8f
Unverified Commit 6156cf8f authored Nov 30, 2025 by YiYi Xu Committed by GitHub Nov 30, 2025
20 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -359,6 +359,8 @@
        title: HunyuanDiT2DModel
      - local: api/models/hunyuanimage_transformer_2d
        title: HunyuanImageTransformer2DModel
+      - local: api/models/hunyuan_video15_transformer_3d
+        title: HunyuanVideo15Transformer3DModel
      - local: api/models/hunyuan_video_transformer_3d
        title: HunyuanVideoTransformer3DModel
      - local: api/models/latte_transformer3d
@@ -433,6 +435,8 @@
        title: AutoencoderKLHunyuanImageRefiner
      - local: api/models/autoencoder_kl_hunyuan_video
        title: AutoencoderKLHunyuanVideo
+      - local: api/models/autoencoder_kl_hunyuan_video15
+        title: AutoencoderKLHunyuanVideo15
      - local: api/models/autoencoderkl_ltx_video
        title: AutoencoderKLLTXVideo
      - local: api/models/autoencoderkl_magvit
@@ -652,6 +656,8 @@
        title: Framepack
      - local: api/pipelines/hunyuan_video
        title: HunyuanVideo
+      - local: api/pipelines/hunyuan_video15
+        title: HunyuanVideo1.5
      - local: api/pipelines/i2vgenxl
        title: I2VGen-XL
      - local: api/pipelines/kandinsky5_video

--- a/docs/source/en/api/models/autoencoder_kl_hunyuan_video15.md
+++ b/docs/source/en/api/models/autoencoder_kl_hunyuan_video15.md
+<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+# AutoencoderKLHunyuanVideo15
+The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5) by Tencent.
+The model can be loaded with the following code snippet.
+```python
+from diffusers import AutoencoderKLHunyuanVideo15
+vae = AutoencoderKLHunyuanVideo15.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v", subfolder="vae", torch_dtype=torch.float32)
+# make sure to enable tiling to avoid OOM
+vae.enable_tiling()
+```
+## AutoencoderKLHunyuanVideo15
+[[autodoc]] AutoencoderKLHunyuanVideo15
+  - decode
+  - encode
+  - all
+## DecoderOutput
+[[autodoc]] models.autoencoders.vae.DecoderOutput
--- a/docs/source/en/api/models/hunyuan_video15_transformer_3d.md
+++ b/docs/source/en/api/models/hunyuan_video15_transformer_3d.md
+<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+# HunyuanVideo15Transformer3DModel
+A Diffusion Transformer model for 3D video-like data used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5).
+The model can be loaded with the following code snippet.
+```python
+from diffusers import HunyuanVideo15Transformer3DModel
+transformer = HunyuanVideo15Transformer3DModel.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v" subfolder="transformer", torch_dtype=torch.bfloat16)
+```
+## HunyuanVideo15Transformer3DModel
+[[autodoc]] HunyuanVideo15Transformer3DModel
+## Transformer2DModelOutput
+[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
--- a/docs/source/en/api/pipelines/hunyuan_video15.md
+++ b/docs/source/en/api/pipelines/hunyuan_video15.md
+<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License. -->
+# HunyuanVideo-1.5
+HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models.
+You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
+> [!TIP]
+> Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.
+>
+> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
+The example below demonstrates how to generate a video optimized for memory or inference speed.
+<hfoptions id="usage">
+<hfoption id="memory">
+Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
+```py
+import torch
+from diffusers import AutoModel, HunyuanVideo15Pipeline
+from diffusers.utils import export_to_video
+pipeline = HunyuanVideo15Pipeline.from_pretrained(
+    "HunyuanVideo-1.5-Diffusers-480p_t2v",
+    torch_dtype=torch.bfloat16,
+)
+# model-offloading and tiling
+pipeline.enable_model_cpu_offload()
+pipeline.vae.enable_tiling()
+prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
+video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
+export_to_video(video, "output.mp4", fps=15)
+```
+## Notes
+- HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently.
+    - **H100/H800:** `_flash_3_hub` or `_flash_varlen_3`
+    - **A100/A800/RTX 4090:** `flash_hub` or `flash_varlen`
+    - **Other GPUs:** `sage_hub`
+Refer to the [Attention backends](../../optimization/attention_backends) guide for more details about using a different backend.
+```py
+pipe.transformer.set_attention_backend("flash_hub")  # or your preferred backend
+```
+- [`HunyuanVideo15Pipeline`] use guider and does not take `guidance_scale` parameter at runtime. 
+You can check the default guider configuration using `pipe.guider`:
+```py
+>>> pipe.guider 
+ClassifierFreeGuidance {
+  "_class_name": "ClassifierFreeGuidance",
+  "_diffusers_version": "0.36.0.dev0",
+  "enabled": true,
+  "guidance_rescale": 0.0,
+  "guidance_scale": 6.0,
+  "start": 0.0,
+  "stop": 1.0,
+  "use_original_formulation": false
+}
+State:
+  step: None
+  num_inference_steps: None
+  timestep: None
+  count_prepared: 0
+  enabled: True
+  num_conditions: 2
+```
+To update guider configuration, you can run `pipe.guider = pipe.guider.new(...)`
+```py
+pipe.guider = pipe.guider.new(guidance_scale=5.0)
+```
+Read more on Guider [here](../../modular_diffusers/guiders).
+## HunyuanVideo15Pipeline
+[[autodoc]] HunyuanVideo15Pipeline
+  - all
+  - __call__
+## HunyuanVideo15ImageToVideoPipeline
+[[autodoc]] HunyuanVideo15ImageToVideoPipeline
+  - all
+  - __call__
+## HunyuanVideo15PipelineOutput
+[[autodoc]] pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput
--- a/scripts/convert_hunyuan_video1_5_to_diffusers.py
+++ b/scripts/convert_hunyuan_video1_5_to_diffusers.py
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -190,6 +190,7 @@ else:
            "AutoencoderKLHunyuanImage",
            "AutoencoderKLHunyuanImageRefiner",
            "AutoencoderKLHunyuanVideo",
+            "AutoencoderKLHunyuanVideo15",
            "AutoencoderKLLTXVideo",
            "AutoencoderKLMagvit",
            "AutoencoderKLMochi",
@@ -225,6 +226,7 @@ else:
            "HunyuanDiT2DModel",
            "HunyuanDiT2DMultiControlNetModel",
            "HunyuanImageTransformer2DModel",
+            "HunyuanVideo15Transformer3DModel",
            "HunyuanVideoFramepackTransformer3DModel",
            "HunyuanVideoTransformer3DModel",
            "I2VGenXLUNet",
@@ -481,6 +483,8 @@ else:
            "HunyuanImagePipeline",
            "HunyuanImageRefinerPipeline",
            "HunyuanSkyreelsImageToVideoPipeline",
+            "HunyuanVideo15ImageToVideoPipeline",
+            "HunyuanVideo15Pipeline",
            "HunyuanVideoFramepackPipeline",
            "HunyuanVideoImageToVideoPipeline",
            "HunyuanVideoPipeline",
@@ -909,6 +913,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderKLHunyuanImage,
            AutoencoderKLHunyuanImageRefiner,
            AutoencoderKLHunyuanVideo,
+            AutoencoderKLHunyuanVideo15,
            AutoencoderKLLTXVideo,
            AutoencoderKLMagvit,
            AutoencoderKLMochi,
@@ -944,6 +949,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            HunyuanDiT2DModel,
            HunyuanDiT2DMultiControlNetModel,
            HunyuanImageTransformer2DModel,
+            HunyuanVideo15Transformer3DModel,
            HunyuanVideoFramepackTransformer3DModel,
            HunyuanVideoTransformer3DModel,
            I2VGenXLUNet,
@@ -1170,6 +1176,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            HunyuanImagePipeline,
            HunyuanImageRefinerPipeline,
            HunyuanSkyreelsImageToVideoPipeline,
+            HunyuanVideo15ImageToVideoPipeline,
+            HunyuanVideo15Pipeline,
            HunyuanVideoFramepackPipeline,
            HunyuanVideoImageToVideoPipeline,
            HunyuanVideoPipeline,

--- a/src/diffusers/models/__init__.py
+++ b/src/diffusers/models/__init__.py
@@ -39,6 +39,7 @@ if is_torch_available():
    _import_structure["autoencoders.autoencoder_kl_hunyuan_video"] = ["AutoencoderKLHunyuanVideo"]
    _import_structure["autoencoders.autoencoder_kl_hunyuanimage"] = ["AutoencoderKLHunyuanImage"]
    _import_structure["autoencoders.autoencoder_kl_hunyuanimage_refiner"] = ["AutoencoderKLHunyuanImageRefiner"]
+    _import_structure["autoencoders.autoencoder_kl_hunyuanvideo15"] = ["AutoencoderKLHunyuanVideo15"]
    _import_structure["autoencoders.autoencoder_kl_ltx"] = ["AutoencoderKLLTXVideo"]
    _import_structure["autoencoders.autoencoder_kl_magvit"] = ["AutoencoderKLMagvit"]
    _import_structure["autoencoders.autoencoder_kl_mochi"] = ["AutoencoderKLMochi"]
@@ -96,6 +97,7 @@ if is_torch_available():
    _import_structure["transformers.transformer_flux2"] = ["Flux2Transformer2DModel"]
    _import_structure["transformers.transformer_hidream_image"] = ["HiDreamImageTransformer2DModel"]
    _import_structure["transformers.transformer_hunyuan_video"] = ["HunyuanVideoTransformer3DModel"]
+    _import_structure["transformers.transformer_hunyuan_video15"] = ["HunyuanVideo15Transformer3DModel"]
    _import_structure["transformers.transformer_hunyuan_video_framepack"] = ["HunyuanVideoFramepackTransformer3DModel"]
    _import_structure["transformers.transformer_hunyuanimage"] = ["HunyuanImageTransformer2DModel"]
    _import_structure["transformers.transformer_kandinsky"] = ["Kandinsky5Transformer3DModel"]
@@ -147,6 +149,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderKLHunyuanImage,
            AutoencoderKLHunyuanImageRefiner,
            AutoencoderKLHunyuanVideo,
+            AutoencoderKLHunyuanVideo15,
            AutoencoderKLLTXVideo,
            AutoencoderKLMagvit,
            AutoencoderKLMochi,
@@ -199,6 +202,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            HiDreamImageTransformer2DModel,
            HunyuanDiT2DModel,
            HunyuanImageTransformer2DModel,
+            HunyuanVideo15Transformer3DModel,
            HunyuanVideoFramepackTransformer3DModel,
            HunyuanVideoTransformer3DModel,
            Kandinsky5Transformer3DModel,

--- a/src/diffusers/models/attention_dispatch.py
+++ b/src/diffusers/models/attention_dispatch.py
@@ -282,6 +282,7 @@ def attention_backend(backend: Union[str, AttentionBackendName] = AttentionBacke
    backend = AttentionBackendName(backend)
    _check_attention_backend_requirements(backend)
+    _maybe_download_kernel_for_backend(backend)
    old_backend = _AttentionBackendRegistry._active_backend
    _AttentionBackendRegistry._active_backend = backend

--- a/src/diffusers/models/autoencoders/__init__.py
+++ b/src/diffusers/models/autoencoders/__init__.py
@@ -8,6 +8,7 @@ from .autoencoder_kl_flux2 import AutoencoderKLFlux2
 from .autoencoder_kl_hunyuan_video import AutoencoderKLHunyuanVideo
 from .autoencoder_kl_hunyuanimage import AutoencoderKLHunyuanImage
 from .autoencoder_kl_hunyuanimage_refiner import AutoencoderKLHunyuanImageRefiner
+from .autoencoder_kl_hunyuanvideo15 import AutoencoderKLHunyuanVideo15
 from .autoencoder_kl_ltx import AutoencoderKLLTXVideo
 from .autoencoder_kl_magvit import AutoencoderKLMagvit
 from .autoencoder_kl_mochi import AutoencoderKLMochi

--- a/src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
+++ b/src/diffusers/models/autoencoders/autoencoder_kl_hunyuanvideo15.py
--- a/src/diffusers/models/transformers/__init__.py
+++ b/src/diffusers/models/transformers/__init__.py
@@ -29,6 +29,7 @@ if is_torch_available():
    from .transformer_flux2 import Flux2Transformer2DModel
    from .transformer_hidream_image import HiDreamImageTransformer2DModel
    from .transformer_hunyuan_video import HunyuanVideoTransformer3DModel
+    from .transformer_hunyuan_video15 import HunyuanVideo15Transformer3DModel
    from .transformer_hunyuan_video_framepack import HunyuanVideoFramepackTransformer3DModel
    from .transformer_hunyuanimage import HunyuanImageTransformer2DModel
    from .transformer_kandinsky import Kandinsky5Transformer3DModel

--- a/src/diffusers/models/transformers/transformer_hunyuan_video15.py
+++ b/src/diffusers/models/transformers/transformer_hunyuan_video15.py
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -243,6 +243,7 @@ else:
        "HunyuanVideoImageToVideoPipeline",
        "HunyuanVideoFramepackPipeline",
    ]
+    _import_structure["hunyuan_video1_5"] = ["HunyuanVideo15Pipeline", "HunyuanVideo15ImageToVideoPipeline"]
    _import_structure["hunyuan_image"] = ["HunyuanImagePipeline", "HunyuanImageRefinerPipeline"]
    _import_structure["kandinsky"] = [
        "KandinskyCombinedPipeline",
@@ -665,6 +666,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            HunyuanVideoImageToVideoPipeline,
            HunyuanVideoPipeline,
        )
+        from .hunyuan_video1_5 import HunyuanVideo15ImageToVideoPipeline, HunyuanVideo15Pipeline
        from .hunyuandit import HunyuanDiTPipeline
        from .i2vgen_xl import I2VGenXLPipeline
        from .kandinsky import (

--- a/src/diffusers/pipelines/hunyuan_video1_5/__init__.py
+++ b/src/diffusers/pipelines/hunyuan_video1_5/__init__.py
+from typing import TYPE_CHECKING
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+_dummy_objects = {}
+_import_structure = {}
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa F403
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["pipeline_hunyuan_video1_5"] = ["HunyuanVideo15Pipeline"]
+    _import_structure["pipeline_hunyuan_video1_5_image2video"] = ["HunyuanVideo15ImageToVideoPipeline"]
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *
+    else:
+        from .pipeline_hunyuan_video1_5 import HunyuanVideo15Pipeline
+        from .pipeline_hunyuan_video1_5_image2video import HunyuanVideo15ImageToVideoPipeline
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
--- a/src/diffusers/pipelines/hunyuan_video1_5/image_processor.py
+++ b/src/diffusers/pipelines/hunyuan_video1_5/image_processor.py
+# Copyright 2025 The HunyuanVideo Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+from ...configuration_utils import register_to_config
+from ...video_processor import VideoProcessor
+# copied from https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5/blob/main/hyvideo/utils/data_utils.py#L20
+def generate_crop_size_list(base_size=256, patch_size=16, max_ratio=4.0):
+    num_patches = round((base_size / patch_size) ** 2)
+    assert max_ratio >= 1.0
+    crop_size_list = []
+    wp, hp = num_patches, 1
+    while wp > 0:
+        if max(wp, hp) / min(wp, hp) <= max_ratio:
+            crop_size_list.append((wp * patch_size, hp * patch_size))
+        if (hp + 1) * wp <= num_patches:
+            hp += 1
+        else:
+            wp -= 1
+    return crop_size_list
+# copied from https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5/blob/main/hyvideo/utils/data_utils.py#L38
+def get_closest_ratio(height: float, width: float, ratios: list, buckets: list):
+    """
+    Get the closest ratio in the buckets.
+    Args:
+        height (float): video height
+        width (float): video width
+        ratios (list): video aspect ratio
+        buckets (list): buckets generated by `generate_crop_size_list`
+    Returns:
+        the closest size in the buckets and the corresponding ratio
+    """
+    aspect_ratio = float(height) / float(width)
+    diff_ratios = ratios - aspect_ratio
+    if aspect_ratio >= 1:
+        indices = [(index, x) for index, x in enumerate(diff_ratios) if x <= 0]
+    else:
+        indices = [(index, x) for index, x in enumerate(diff_ratios) if x >= 0]
+    closest_ratio_id = min(indices, key=lambda pair: abs(pair[1]))[0]
+    closest_size = buckets[closest_ratio_id]
+    closest_ratio = ratios[closest_ratio_id]
+    return closest_size, closest_ratio
+class HunyuanVideo15ImageProcessor(VideoProcessor):
+    r"""
+    Image/video processor to preproces/postprocess the reference image/generatedvideo for the HunyuanVideo1.5 model.
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to downscale the image's (height, width) dimensions to multiples of `vae_scale_factor`. Can accept
+            `height` and `width` arguments from [`image_processor.VaeImageProcessor.preprocess`] method.
+        vae_scale_factor (`int`, *optional*, defaults to `16`):
+            VAE (spatial) scale factor. If `do_resize` is `True`, the image is automatically resized to multiples of
+            this factor.
+        vae_latent_channels (`int`, *optional*, defaults to `32`):
+            VAE latent channels.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Whether to convert the image to RGB.
+    """
+    @register_to_config
+    def __init__(
+        self,
+        do_resize: bool = True,
+        vae_scale_factor: int = 16,
+        vae_latent_channels: int = 32,
+        do_convert_rgb: bool = True,
+    ):
+        super().__init__(
+            do_resize=do_resize,
+            vae_scale_factor=vae_scale_factor,
+            vae_latent_channels=vae_latent_channels,
+            do_convert_rgb=do_convert_rgb,
+        )
+    def calculate_default_height_width(self, height: int, width: int, target_size: int):
+        crop_size_list = generate_crop_size_list(base_size=target_size, patch_size=self.config.vae_scale_factor)
+        aspect_ratios = np.array([round(float(h) / float(w), 5) for h, w in crop_size_list])
+        height, width = get_closest_ratio(height, width, aspect_ratios, crop_size_list)[0]
+        return height, width
--- a/src/diffusers/pipelines/hunyuan_video1_5/pipeline_hunyuan_video1_5.py
+++ b/src/diffusers/pipelines/hunyuan_video1_5/pipeline_hunyuan_video1_5.py
--- a/src/diffusers/pipelines/hunyuan_video1_5/pipeline_hunyuan_video1_5_image2video.py
+++ b/src/diffusers/pipelines/hunyuan_video1_5/pipeline_hunyuan_video1_5_image2video.py
--- a/src/diffusers/pipelines/hunyuan_video1_5/pipeline_output.py
+++ b/src/diffusers/pipelines/hunyuan_video1_5/pipeline_output.py
+from dataclasses import dataclass
+import torch
+from diffusers.utils import BaseOutput
+@dataclass
+class HunyuanVideo15PipelineOutput(BaseOutput):
+    r"""
+    Output class for HunyuanVideo1.5 pipelines.
+    Args:
+        frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
+            List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
+            denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
+            `(batch_size, num_frames, channels, height, width)`.
+    """
+    frames: torch.Tensor
--- a/src/diffusers/utils/dummy_pt_objects.py
+++ b/src/diffusers/utils/dummy_pt_objects.py
@@ -468,6 +468,21 @@ class AutoencoderKLHunyuanVideo(metaclass=DummyObject):
        requires_backends(cls, ["torch"])
+class AutoencoderKLHunyuanVideo15(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
 class AutoencoderKLLTXVideo(metaclass=DummyObject):
    _backends = ["torch"]
@@ -993,6 +1008,21 @@ class HunyuanImageTransformer2DModel(metaclass=DummyObject):
        requires_backends(cls, ["torch"])
+class HunyuanVideo15Transformer3DModel(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
 class HunyuanVideoFramepackTransformer3DModel(metaclass=DummyObject):
    _backends = ["torch"]

--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py