Add support for Ovis-Image (#12740)

* add ovis_image * fix code quality * optimize pipeline_ovis_image.py according to the feedbacks * optimize imports * add docs * make style * make style * add ovis to toctree * oops --------- Co-authored-by: YiYi Xu <yixu310@gmail.com>

Add support for Ovis-Image (#12740)
* add ovis_image * fix code quality * optimize pipeline_ovis_image.py according to the feedbacks * optimize imports * add docs * make style * make style * add ovis to toctree * oops --------- Co-authored-by: YiYi Xu <yixu310@gmail.com>
4f136f84 · Guo-Hua Wang · GitHub · edf36f51 · 4f136f84 · 4f136f84
Unverified Commit 4f136f84 authored Dec 03, 2025 by Guo-Hua Wang Committed by GitHub Dec 02, 2025
15 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -375,6 +375,8 @@
        title: MochiTransformer3DModel
      - local: api/models/omnigen_transformer
        title: OmniGenTransformer2DModel
+      - local: api/models/ovisimage_transformer2d
+        title: OvisImageTransformer2DModel
      - local: api/models/pixart_transformer2d
        title: PixArtTransformer2DModel
      - local: api/models/prior_transformer
@@ -567,6 +569,8 @@
        title: MultiDiffusion
      - local: api/pipelines/omnigen
        title: OmniGen
+      - local: api/pipelines/ovis_image
+        title: Ovis-Image
      - local: api/pipelines/pag
        title: PAG
      - local: api/pipelines/paint_by_example

--- a/docs/source/en/api/models/ovisimage_transformer2d.md
+++ b/docs/source/en/api/models/ovisimage_transformer2d.md
+<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+# OvisImageTransformer2DModel
+The model can be loaded with the following code snippet.
+```python
+from diffusers import OvisImageTransformer2DModel
+transformer = OvisImageTransformer2DModel.from_pretrained("AIDC-AI/Ovis-Image-7B", subfolder="transformer", torch_dtype=torch.bfloat16)
+```
+## OvisImageTransformer2DModel
+[[autodoc]] OvisImageTransformer2DModel
--- a/docs/source/en/api/pipelines/ovis_image.md
+++ b/docs/source/en/api/pipelines/ovis_image.md
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Ovis-Image
+![concepts](https://github.com/AIDC-AI/Ovis-Image/blob/main/docs/imgs/ovis_image_case.png)
+Ovis-Image is a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints.
+[Ovis-Image Technical Report](https://arxiv.org/abs/2511.22982) from Alibaba Group, by Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen.
+The abstract from the paper is:
+*We introduce Ovis-Image, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.*
+**Highlights**: 
+*   **Strong text rendering at a compact 7B scale**: Ovis-Image is a 7B text-to-image model that delivers text rendering quality comparable to much larger 20B-class systems such as Qwen-Image and competitive with leading closed-source models like GPT4o in text-centric scenarios, while remaining small enough to run on widely accessible hardware.
+*   **High fidelity on text-heavy, layout-sensitive prompts**: The model excels on prompts that demand tight alignment between linguistic content and rendered typography (e.g., posters, banners, logos, UI mockups, infographics), producing legible, correctly spelled, and semantically consistent text across diverse fonts, sizes, and aspect ratios without compromising overall visual quality.
+*   **Efficiency and deployability**: With its 7B parameter budget and streamlined architecture, Ovis-Image fits on a single high-end GPU with moderate memory, supports low-latency interactive use, and scales to batch production serving, bringing near–frontier text rendering to applications where tens-of-billions–parameter models are impractical.
+This pipeline was contributed by Ovis-Image Team. The original codebase can be found [here](https://github.com/AIDC-AI/Ovis-Image).
+Available models:
+| Model | Recommended dtype |
+|:-----:|:-----------------:|
+| [`AIDC-AI/Ovis-Image-7B`](https://huggingface.co/AIDC-AI/Ovis-Image-7B) | `torch.bfloat16` |
+Refer to [this](https://huggingface.co/collections/AIDC-AI/ovis-image) collection for more information.
+## OvisImagePipeline
+[[autodoc]] OvisImagePipeline
+	- all
+	- __call__
+## OvisImagePipelineOutput
+[[autodoc]] pipelines.ovis_image.pipeline_output.OvisImagePipelineOutput
--- a/scripts/convert_ovis_image_to_diffusers.py
+++ b/scripts/convert_ovis_image_to_diffusers.py
+import argparse
+from contextlib import nullcontext
+import safetensors.torch
+import torch
+from accelerate import init_empty_weights
+from huggingface_hub import hf_hub_download
+from diffusers import OvisImageTransformer2DModel
+from diffusers.utils.import_utils import is_accelerate_available
+"""
+# Transformer
+python scripts/convert_ovis_image_to_diffusers.py  \
+--original_state_dict_repo_id "AIDC-AI/Ovis-Image-7B" \
+--filename "ovis_image.safetensors"
+--output_path "ovis-image" \
+--transformer
+"""
+CTX = init_empty_weights if is_accelerate_available() else nullcontext
+parser = argparse.ArgumentParser()
+parser.add_argument("--original_state_dict_repo_id", default=None, type=str)
+parser.add_argument("--filename", default="ovis_image.safetensors", type=str)
+parser.add_argument("--checkpoint_path", default=None, type=str)
+parser.add_argument("--in_channels", type=int, default=64)
+parser.add_argument("--out_channels", type=int, default=None)
+parser.add_argument("--transformer", action="store_true")
+parser.add_argument("--output_path", type=str)
+parser.add_argument("--dtype", type=str, default="bf16")
+args = parser.parse_args()
+dtype = torch.bfloat16 if args.dtype == "bf16" else torch.float32
+def load_original_checkpoint(args):
+    if args.original_state_dict_repo_id is not None:
+        ckpt_path = hf_hub_download(repo_id=args.original_state_dict_repo_id, filename=args.filename)
+    elif args.checkpoint_path is not None:
+        ckpt_path = args.checkpoint_path
+    else:
+        raise ValueError(" please provide either `original_state_dict_repo_id` or a local `checkpoint_path`")
+    original_state_dict = safetensors.torch.load_file(ckpt_path)
+    return original_state_dict
+# in SD3 original implementation of AdaLayerNormContinuous, it split linear projection output into shift, scale;
+# while in diffusers it split into scale, shift. Here we swap the linear projection weights in order to be able to use diffusers implementation
+def swap_scale_shift(weight):
+    shift, scale = weight.chunk(2, dim=0)
+    new_weight = torch.cat([scale, shift], dim=0)
+    return new_weight
+def convert_ovis_image_transformer_checkpoint_to_diffusers(
+    original_state_dict, num_layers, num_single_layers, inner_dim, mlp_ratio=4.0
+):
+    converted_state_dict = {}
+    ## time_text_embed.timestep_embedder <-  time_in
+    converted_state_dict["timestep_embedder.linear_1.weight"] = original_state_dict.pop("time_in.in_layer.weight")
+    converted_state_dict["timestep_embedder.linear_1.bias"] = original_state_dict.pop("time_in.in_layer.bias")
+    converted_state_dict["timestep_embedder.linear_2.weight"] = original_state_dict.pop("time_in.out_layer.weight")
+    converted_state_dict["timestep_embedder.linear_2.bias"] = original_state_dict.pop("time_in.out_layer.bias")
+    # context_embedder
+    converted_state_dict["context_embedder_norm.weight"] = original_state_dict.pop("semantic_txt_norm.weight")
+    converted_state_dict["context_embedder.weight"] = original_state_dict.pop("semantic_txt_in.weight")
+    converted_state_dict["context_embedder.bias"] = original_state_dict.pop("semantic_txt_in.bias")
+    # x_embedder
+    converted_state_dict["x_embedder.weight"] = original_state_dict.pop("img_in.weight")
+    converted_state_dict["x_embedder.bias"] = original_state_dict.pop("img_in.bias")
+    # double transformer blocks
+    for i in range(num_layers):
+        block_prefix = f"transformer_blocks.{i}."
+        # norms.
+        ## norm1
+        converted_state_dict[f"{block_prefix}norm1.linear.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_mod.lin.weight"
+        )
+        converted_state_dict[f"{block_prefix}norm1.linear.bias"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_mod.lin.bias"
+        )
+        ## norm1_context
+        converted_state_dict[f"{block_prefix}norm1_context.linear.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_mod.lin.weight"
+        )
+        converted_state_dict[f"{block_prefix}norm1_context.linear.bias"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_mod.lin.bias"
+        )
+        # Q, K, V
+        sample_q, sample_k, sample_v = torch.chunk(
+            original_state_dict.pop(f"double_blocks.{i}.img_attn.qkv.weight"), 3, dim=0
+        )
+        context_q, context_k, context_v = torch.chunk(
+            original_state_dict.pop(f"double_blocks.{i}.txt_attn.qkv.weight"), 3, dim=0
+        )
+        sample_q_bias, sample_k_bias, sample_v_bias = torch.chunk(
+            original_state_dict.pop(f"double_blocks.{i}.img_attn.qkv.bias"), 3, dim=0
+        )
+        context_q_bias, context_k_bias, context_v_bias = torch.chunk(
+            original_state_dict.pop(f"double_blocks.{i}.txt_attn.qkv.bias"), 3, dim=0
+        )
+        converted_state_dict[f"{block_prefix}attn.to_q.weight"] = torch.cat([sample_q])
+        converted_state_dict[f"{block_prefix}attn.to_q.bias"] = torch.cat([sample_q_bias])
+        converted_state_dict[f"{block_prefix}attn.to_k.weight"] = torch.cat([sample_k])
+        converted_state_dict[f"{block_prefix}attn.to_k.bias"] = torch.cat([sample_k_bias])
+        converted_state_dict[f"{block_prefix}attn.to_v.weight"] = torch.cat([sample_v])
+        converted_state_dict[f"{block_prefix}attn.to_v.bias"] = torch.cat([sample_v_bias])
+        converted_state_dict[f"{block_prefix}attn.add_q_proj.weight"] = torch.cat([context_q])
+        converted_state_dict[f"{block_prefix}attn.add_q_proj.bias"] = torch.cat([context_q_bias])
+        converted_state_dict[f"{block_prefix}attn.add_k_proj.weight"] = torch.cat([context_k])
+        converted_state_dict[f"{block_prefix}attn.add_k_proj.bias"] = torch.cat([context_k_bias])
+        converted_state_dict[f"{block_prefix}attn.add_v_proj.weight"] = torch.cat([context_v])
+        converted_state_dict[f"{block_prefix}attn.add_v_proj.bias"] = torch.cat([context_v_bias])
+        # qk_norm
+        converted_state_dict[f"{block_prefix}attn.norm_q.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_attn.norm.query_norm.weight"
+        )
+        converted_state_dict[f"{block_prefix}attn.norm_k.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_attn.norm.key_norm.weight"
+        )
+        converted_state_dict[f"{block_prefix}attn.norm_added_q.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_attn.norm.query_norm.weight"
+        )
+        converted_state_dict[f"{block_prefix}attn.norm_added_k.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_attn.norm.key_norm.weight"
+        )
+        # ff img_mlp
+        converted_state_dict[f"{block_prefix}ff.net.0.proj.weight"] = torch.cat(
+            [
+                original_state_dict.pop(f"double_blocks.{i}.img_mlp.up_proj.weight"),
+                original_state_dict.pop(f"double_blocks.{i}.img_mlp.gate_proj.weight"),
+            ],
+            dim=0,
+        )
+        converted_state_dict[f"{block_prefix}ff.net.0.proj.bias"] = torch.cat(
+            [
+                original_state_dict.pop(f"double_blocks.{i}.img_mlp.up_proj.bias"),
+                original_state_dict.pop(f"double_blocks.{i}.img_mlp.gate_proj.bias"),
+            ],
+            dim=0,
+        )
+        converted_state_dict[f"{block_prefix}ff.net.2.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_mlp.down_proj.weight"
+        )
+        converted_state_dict[f"{block_prefix}ff.net.2.bias"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_mlp.down_proj.bias"
+        )
+        converted_state_dict[f"{block_prefix}ff_context.net.0.proj.weight"] = torch.cat(
+            [
+                original_state_dict.pop(f"double_blocks.{i}.txt_mlp.up_proj.weight"),
+                original_state_dict.pop(f"double_blocks.{i}.txt_mlp.gate_proj.weight"),
+            ],
+            dim=0,
+        )
+        converted_state_dict[f"{block_prefix}ff_context.net.0.proj.bias"] = torch.cat(
+            [
+                original_state_dict.pop(f"double_blocks.{i}.txt_mlp.up_proj.bias"),
+                original_state_dict.pop(f"double_blocks.{i}.txt_mlp.gate_proj.bias"),
+            ],
+            dim=0,
+        )
+        converted_state_dict[f"{block_prefix}ff_context.net.2.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_mlp.down_proj.weight"
+        )
+        converted_state_dict[f"{block_prefix}ff_context.net.2.bias"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_mlp.down_proj.bias"
+        )
+        # output projections.
+        converted_state_dict[f"{block_prefix}attn.to_out.0.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_attn.proj.weight"
+        )
+        converted_state_dict[f"{block_prefix}attn.to_out.0.bias"] = original_state_dict.pop(
+            f"double_blocks.{i}.img_attn.proj.bias"
+        )
+        converted_state_dict[f"{block_prefix}attn.to_add_out.weight"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_attn.proj.weight"
+        )
+        converted_state_dict[f"{block_prefix}attn.to_add_out.bias"] = original_state_dict.pop(
+            f"double_blocks.{i}.txt_attn.proj.bias"
+        )
+    # single transformer blocks
+    for i in range(num_single_layers):
+        block_prefix = f"single_transformer_blocks.{i}."
+        # norm.linear  <- single_blocks.0.modulation.lin
+        converted_state_dict[f"{block_prefix}norm.linear.weight"] = original_state_dict.pop(
+            f"single_blocks.{i}.modulation.lin.weight"
+        )
+        converted_state_dict[f"{block_prefix}norm.linear.bias"] = original_state_dict.pop(
+            f"single_blocks.{i}.modulation.lin.bias"
+        )
+        # Q, K, V, mlp
+        mlp_hidden_dim = int(inner_dim * mlp_ratio)
+        split_size = (inner_dim, inner_dim, inner_dim, mlp_hidden_dim * 2)
+        q, k, v, mlp = torch.split(original_state_dict.pop(f"single_blocks.{i}.linear1.weight"), split_size, dim=0)
+        q_bias, k_bias, v_bias, mlp_bias = torch.split(
+            original_state_dict.pop(f"single_blocks.{i}.linear1.bias"), split_size, dim=0
+        )
+        converted_state_dict[f"{block_prefix}attn.to_q.weight"] = torch.cat([q])
+        converted_state_dict[f"{block_prefix}attn.to_q.bias"] = torch.cat([q_bias])
+        converted_state_dict[f"{block_prefix}attn.to_k.weight"] = torch.cat([k])
+        converted_state_dict[f"{block_prefix}attn.to_k.bias"] = torch.cat([k_bias])
+        converted_state_dict[f"{block_prefix}attn.to_v.weight"] = torch.cat([v])
+        converted_state_dict[f"{block_prefix}attn.to_v.bias"] = torch.cat([v_bias])
+        converted_state_dict[f"{block_prefix}proj_mlp.weight"] = torch.cat([mlp])
+        converted_state_dict[f"{block_prefix}proj_mlp.bias"] = torch.cat([mlp_bias])
+        # qk norm
+        converted_state_dict[f"{block_prefix}attn.norm_q.weight"] = original_state_dict.pop(
+            f"single_blocks.{i}.norm.query_norm.weight"
+        )
+        converted_state_dict[f"{block_prefix}attn.norm_k.weight"] = original_state_dict.pop(
+            f"single_blocks.{i}.norm.key_norm.weight"
+        )
+        # output projections.
+        converted_state_dict[f"{block_prefix}proj_out.weight"] = original_state_dict.pop(
+            f"single_blocks.{i}.linear2.weight"
+        )
+        converted_state_dict[f"{block_prefix}proj_out.bias"] = original_state_dict.pop(
+            f"single_blocks.{i}.linear2.bias"
+        )
+    converted_state_dict["proj_out.weight"] = original_state_dict.pop("final_layer.linear.weight")
+    converted_state_dict["proj_out.bias"] = original_state_dict.pop("final_layer.linear.bias")
+    converted_state_dict["norm_out.linear.weight"] = swap_scale_shift(
+        original_state_dict.pop("final_layer.adaLN_modulation.1.weight")
+    )
+    converted_state_dict["norm_out.linear.bias"] = swap_scale_shift(
+        original_state_dict.pop("final_layer.adaLN_modulation.1.bias")
+    )
+    return converted_state_dict
+def main(args):
+    original_ckpt = load_original_checkpoint(args)
+    if args.transformer:
+        num_layers = 6
+        num_single_layers = 27
+        inner_dim = 3072
+        mlp_ratio = 4.0
+        converted_transformer_state_dict = convert_ovis_image_transformer_checkpoint_to_diffusers(
+            original_ckpt, num_layers, num_single_layers, inner_dim, mlp_ratio=mlp_ratio
+        )
+        transformer = OvisImageTransformer2DModel(in_channels=args.in_channels, out_channels=args.out_channels)
+        transformer.load_state_dict(converted_transformer_state_dict, strict=True)
+        print("Saving Ovis-Image Transformer in Diffusers format.")
+        transformer.to(dtype).save_pretrained(f"{args.output_path}/transformer")
+if __name__ == "__main__":
+    main(args)
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -242,6 +242,7 @@ else:
            "MultiAdapter",
            "MultiControlNetModel",
            "OmniGenTransformer2DModel",
+            "OvisImageTransformer2DModel",
            "ParallelConfig",
            "PixArtTransformer2DModel",
            "PriorTransformer",
@@ -537,6 +538,7 @@ else:
            "MochiPipeline",
            "MusicLDMPipeline",
            "OmniGenPipeline",
+            "OvisImagePipeline",
            "PaintByExamplePipeline",
            "PIAPipeline",
            "PixArtAlphaPipeline",
@@ -965,6 +967,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            MultiAdapter,
            MultiControlNetModel,
            OmniGenTransformer2DModel,
+            OvisImageTransformer2DModel,
            ParallelConfig,
            PixArtTransformer2DModel,
            PriorTransformer,
@@ -1230,6 +1233,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            MochiPipeline,
            MusicLDMPipeline,
            OmniGenPipeline,
+            OvisImagePipeline,
            PaintByExamplePipeline,
            PIAPipeline,
            PixArtAlphaPipeline,

--- a/src/diffusers/models/__init__.py
+++ b/src/diffusers/models/__init__.py
@@ -105,6 +105,7 @@ if is_torch_available():
    _import_structure["transformers.transformer_lumina2"] = ["Lumina2Transformer2DModel"]
    _import_structure["transformers.transformer_mochi"] = ["MochiTransformer3DModel"]
    _import_structure["transformers.transformer_omnigen"] = ["OmniGenTransformer2DModel"]
+    _import_structure["transformers.transformer_ovis_image"] = ["OvisImageTransformer2DModel"]
    _import_structure["transformers.transformer_prx"] = ["PRXTransformer2DModel"]
    _import_structure["transformers.transformer_qwenimage"] = ["QwenImageTransformer2DModel"]
    _import_structure["transformers.transformer_sana_video"] = ["SanaVideoTransformer3DModel"]
@@ -212,6 +213,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            LuminaNextDiT2DModel,
            MochiTransformer3DModel,
            OmniGenTransformer2DModel,
+            OvisImageTransformer2DModel,
            PixArtTransformer2DModel,
            PriorTransformer,
            PRXTransformer2DModel,

--- a/src/diffusers/models/transformers/__init__.py
+++ b/src/diffusers/models/transformers/__init__.py
@@ -37,6 +37,7 @@ if is_torch_available():
    from .transformer_lumina2 import Lumina2Transformer2DModel
    from .transformer_mochi import MochiTransformer3DModel
    from .transformer_omnigen import OmniGenTransformer2DModel
+    from .transformer_ovis_image import OvisImageTransformer2DModel
    from .transformer_prx import PRXTransformer2DModel
    from .transformer_qwenimage import QwenImageTransformer2DModel
    from .transformer_sana_video import SanaVideoTransformer3DModel

--- a/src/diffusers/models/transformers/transformer_ovis_image.py
+++ b/src/diffusers/models/transformers/transformer_ovis_image.py
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -301,6 +301,7 @@ else:
    _import_structure["mochi"] = ["MochiPipeline"]
    _import_structure["musicldm"] = ["MusicLDMPipeline"]
    _import_structure["omnigen"] = ["OmniGenPipeline"]
+    _import_structure["ovis_image"] = ["OvisImagePipeline"]
    _import_structure["visualcloze"] = ["VisualClozePipeline", "VisualClozeGenerationPipeline"]
    _import_structure["paint_by_example"] = ["PaintByExamplePipeline"]
    _import_structure["pia"] = ["PIAPipeline"]
@@ -719,6 +720,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
        from .mochi import MochiPipeline
        from .musicldm import MusicLDMPipeline
        from .omnigen import OmniGenPipeline
+        from .ovis_image import OvisImagePipeline
        from .pag import (
            AnimateDiffPAGPipeline,
            HunyuanDiTPAGPipeline,

--- a/src/diffusers/pipelines/ovis_image/__init__.py
+++ b/src/diffusers/pipelines/ovis_image/__init__.py
+from typing import TYPE_CHECKING
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+_dummy_objects = {}
+_import_structure = {}
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa: F403
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["pipeline_output"] = ["OvisImagePipelineOutput"]
+    _import_structure["pipeline_ovis_image"] = ["OvisImagePipeline"]
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *
+    else:
+        from .pipeline_output import OvisImagePipelineOutput
+        from .pipeline_ovis_image import OvisImagePipeline
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
--- a/src/diffusers/pipelines/ovis_image/pipeline_output.py
+++ b/src/diffusers/pipelines/ovis_image/pipeline_output.py
+# Copyright 2025 Alibaba Ovis-Image Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass
+from typing import List, Union
+import numpy as np
+import PIL.Image
+from diffusers.utils import BaseOutput
+@dataclass
+class OvisImagePipelineOutput(BaseOutput):
+    """
+    Output class for Ovis-Image pipelines.
+    Args:
+        images (`List[PIL.Image.Image]` or `np.ndarray`)
+            List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width,
+            num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline.
+    """
+    images: Union[List[PIL.Image.Image], np.ndarray]
--- a/src/diffusers/pipelines/ovis_image/pipeline_ovis_image.py
+++ b/src/diffusers/pipelines/ovis_image/pipeline_ovis_image.py
--- a/src/diffusers/utils/dummy_pt_objects.py
+++ b/src/diffusers/utils/dummy_pt_objects.py
@@ -1248,6 +1248,21 @@ class OmniGenTransformer2DModel(metaclass=DummyObject):
        requires_backends(cls, ["torch"])
+class OvisImageTransformer2DModel(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
 class ParallelConfig(metaclass=DummyObject):
    _backends = ["torch"]

--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -1952,6 +1952,21 @@ class OmniGenPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])
+class OvisImagePipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
 class PaintByExamplePipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

--- a/tests/pipelines/ovis_image/__init__.py
+++ b/tests/pipelines/ovis_image/__init__.py