[core] ConsisID (#10140)

* Update __init__.py * add consisid * update consisid * update consisid * make style * make_style * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * add doc * make style * Rename consisid .md to consisid.md * Update geodiff_molecule_conformation.ipynb * Update geodiff_molecule_conformation.ipynb * Update geodiff_molecule_conformation.ipynb * Update demo.ipynb * Update pipeline_consisid.py * make fix-copies * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * update doc & pipeline code * fix typo * make style * update example * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * update example * update example * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * update * add test and update * remove some changes from docs * refactor * fix * undo changes to examples * remove save/load and fuse methods * update * link hf-doc-img & make test extremely small * update * add lora * fix test * update * update * change expected_diff_max to 0.4 * fix typo * fix link * fix typo * update docs * update * remove consisid lora tests --------- Co-authored-by: hlky <hlky@hlky.ac> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Aryan <aryan@huggingface.co>

[core] ConsisID (#10140)
* Update __init__.py * add consisid * update consisid * update consisid * make style * make_style * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * add doc * make style * Rename consisid .md to consisid.md * Update geodiff_molecule_conformation.ipynb * Update geodiff_molecule_conformation.ipynb * Update geodiff_molecule_conformation.ipynb * Update demo.ipynb * Update pipeline_consisid.py * make fix-copies * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * update doc & pipeline code * fix typo * make style * update example * Update docs/source/en/using-diffusers/consisid.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * update example * update example * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * Update src/diffusers/pipelines/consisid/pipeline_consisid.py Co-authored-by: hlky <hlky@hlky.ac> * update * add test and update * remove some changes from docs * refactor * fix * undo changes to examples * remove save/load and fuse methods * update * link hf-doc-img & make test extremely small * update * add lora * fix test * update * update * change expected_diff_max to 0.4 * fix typo * fix link * fix typo * update docs * update * remove consisid lora tests --------- Co-authored-by: hlky <hlky@hlky.ac> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Aryan <aryan@huggingface.co>
23b467c7 · Shenghai Yuan · GitHub · aeac0a00 · 23b467c7 · 23b467c7
Unverified Commit 23b467c7 authored Jan 19, 2025 by Shenghai Yuan Committed by GitHub Jan 19, 2025
20 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -79,6 +79,8 @@
 - sections:
  - local: using-diffusers/cogvideox
    title: CogVideoX
+  - local: using-diffusers/consisid
+    title: ConsisID
  - local: using-diffusers/sdxl
    title: Stable Diffusion XL
  - local: using-diffusers/sdxl_turbo
@@ -270,6 +272,8 @@
        title: AuraFlowTransformer2DModel
      - local: api/models/cogvideox_transformer3d
        title: CogVideoXTransformer3DModel
+      - local: api/models/consisid_transformer3d
+        title: ConsisIDTransformer3DModel
      - local: api/models/cogview3plus_transformer2d
        title: CogView3PlusTransformer2DModel
      - local: api/models/dit_transformer2d
@@ -372,6 +376,8 @@
      title: CogVideoX
    - local: api/pipelines/cogview3
      title: CogView3
+    - local: api/pipelines/consisid
+      title: ConsisID
    - local: api/pipelines/consistency_models
      title: Consistency Models
    - local: api/pipelines/controlnet

--- a/docs/source/en/api/models/consisid_transformer3d.md
+++ b/docs/source/en/api/models/consisid_transformer3d.md
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. -->
+# ConsisIDTransformer3DModel
+A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc.
+The model can be loaded with the following code snippet.
+```python
+from diffusers import ConsisIDTransformer3DModel
+transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
+```
+## ConsisIDTransformer3DModel
+[[autodoc]] ConsisIDTransformer3DModel
+## Transformer2DModelOutput
+[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
--- a/docs/source/en/api/pipelines/consisid.md
+++ b/docs/source/en/api/pipelines/consisid.md
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+-->
+# ConsisID
+[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.
+The abstract from the paper is:
+*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.*
+<Tip>
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+</Tip>
+This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh).
+There are two official ConsisID checkpoints for identity-preserving text-to-video.
+| checkpoints | recommended inference dtype |
+|:---:|:---:|
+| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
+| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
+### Memory optimization
+ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
+| Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved |
+| :----------------------------- | :------------------- | :------------------ |
+| -                              | 37 GB                | 44 GB               |
+| enable_model_cpu_offload       | 22 GB                | 25 GB               |
+| enable_sequential_cpu_offload  | 16 GB                | 22 GB               |
+| vae.enable_slicing             | 16 GB                | 22 GB               |
+| vae.enable_tiling              | 5 GB                 | 7 GB                |
+## ConsisIDPipeline
+[[autodoc]] ConsisIDPipeline
+  - all
+  - __call__
+## ConsisIDPipelineOutput
+[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput
--- a/docs/source/en/using-diffusers/consisid.md
+++ b/docs/source/en/using-diffusers/consisid.md
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# ConsisID
+[ConsisID](https://github.com/PKU-YuanGroup/ConsisID) is an identity-preserving text-to-video generation model that keeps the face consistent in the generated video by frequency decomposition. The main features of ConsisID are:
+- Frequency decomposition: The characteristics of the DiT architecture are analyzed from the frequency domain perspective, and based on these characteristics, a reasonable control information injection method is designed.
+- Consistency training strategy: A coarse-to-fine training strategy, dynamic masking loss, and dynamic cross-face loss further enhance the model's generalization ability and identity preservation performance.
+- Inference without finetuning: Previous methods required case-by-case finetuning of the input ID before inference, leading to significant time and computational costs. In contrast, ConsisID is tuning-free.
+This guide will walk you through using ConsisID for use cases.
+## Load Model Checkpoints
+Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
+```python
+# !pip install consisid_eva_clip insightface facexlib
+import torch
+from diffusers import ConsisIDPipeline
+from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
+from huggingface_hub import snapshot_download
+# Download ckpts
+snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
+# Load face helper model to preprocess input face image
+face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
+# Load consisid base model
+pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
+pipe.to("cuda")
+```
+## Identity-Preserving Text-to-Video
+For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results.
+```python
+from diffusers.utils import export_to_video
+prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
+image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
+id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
+video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
+export_to_video(video.frames[0], "output.mp4", fps=8)
+```
+<table>
+  <tr>
+    <th style="text-align: center;">Face Image</th>
+    <th style="text-align: center;">Video</th>
+    <th style="text-align: center;">Description</th
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_0.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_0.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress, which moves gracefully with the rhythm of the horse, the fabric flowing fluidly in the animated motion. The dappled sunlight filters through the trees, casting soft, painterly patterns on the forest floor. Her posture is poised, showing both control and elegance as she guides the horse with ease. The animation's gentle, fluid style adds a dreamlike quality to the scene, with the woman’s calm demeanor and the peaceful surroundings evoking a sense of freedom and harmony.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_1.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_1.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air, which is rendered with soft, ethereal strokes. A faint smile plays on her lips as she summons a wisp of ice magic, watching with focus as the surrounding trees and ground begin to shimmer and freeze, covered in delicate ice crystals. The animation’s fluid motion brings the magic to life, with the frost spreading outward in intricate, sparkling patterns. The environment is painted with soft, watercolor-like hues, enhancing the magical, dreamlike atmosphere. The overall mood is serene yet powerful, with the quiet winter air amplifying the delicate beauty of the frozen scene.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_2.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_2.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes squinting slightly against the sun, while a few wisps of hair flutter in the wind. He is dressed in a light, pastel-colored shirt, and the balloons around him sway with the wind, adding a sense of playfulness to the scene. The background blurs softly, with hints of a vibrant market or park, enhancing the light-hearted, yet tender mood of the moment.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_3.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_3.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_4.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_4.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge. The setting appears playful, with colorful toys scattered around and a soft rug underfoot, while sunlight streams through a nearby window, highlighting the fluttering cape and adding to the impression of heroism. The overall atmosphere is lighthearted and fun, with the baby's expressions capturing a mix of innocence and an adorable attempt at bravery, as if truly ready to save the day.</td>
+  </tr>
+</table>
+## Resources
+Learn more about ConsisID with the following resources.
+- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features.
+- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details.
--- a/docs/source/zh/_toctree.yml
+++ b/docs/source/zh/_toctree.yml
@@ -5,6 +5,8 @@
    title: 快速入门
  - local: stable_diffusion
    title: 有效和高效的扩散
+  - local: consisid 
+    title: 身份保持的文本到视频生成
  - local: installation
    title: 安装
  title: 开始
--- a/docs/source/zh/consisid.md
+++ b/docs/source/zh/consisid.md
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# ConsisID
+[ConsisID](https://github.com/PKU-YuanGroup/ConsisID)是一种身份保持的文本到视频生成模型，其通过频率分解在生成的视频中保持面部一致性。它具有以下特点：
+- 基于频率分解：将人物ID特征解耦为高频和低频部分，从频域的角度分析DIT架构的特性，并且基于此特性设计合理的控制信息注入方式。
+- 一致性训练策略：我们提出粗到细训练策略、动态掩码损失、动态跨脸损失，进一步提高了模型的泛化能力和身份保持效果。
+- 推理无需微调：之前的方法在推理前，需要对输入id进行case-by-case微调，时间和算力开销较大，而我们的方法是tuning-free的。
+本指南将指导您使用 ConsisID 生成身份保持的视频。
+## Load Model Checkpoints
+模型权重可以存储在Hub上或本地的单独子文件夹中，在这种情况下，您应该使用 [`~DiffusionPipeline.from_pretrained`] 方法。
+```python
+# !pip install consisid_eva_clip insightface facexlib
+import torch
+from diffusers import ConsisIDPipeline
+from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
+from huggingface_hub import snapshot_download
+# Download ckpts
+snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
+# Load face helper model to preprocess input face image
+face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
+# Load consisid base model
+pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
+pipe.to("cuda")
+```
+## Identity-Preserving Text-to-Video
+对于身份保持的文本到视频生成，需要输入文本提示和包含清晰面部（例如，最好是半身或全身）的图像。默认情况下，ConsisID 会生成 720x480 的视频以获得最佳效果。
+```python
+from diffusers.utils import export_to_video
+prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
+image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
+id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
+video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
+export_to_video(video.frames[0], "output.mp4", fps=8)
+```
+<table>
+  <tr>
+    <th style="text-align: center;">Face Image</th>
+    <th style="text-align: center;">Video</th>
+    <th style="text-align: center;">Description</th
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_0.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_0.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video, in a beautifully crafted animated style, features a confident woman riding a horse through a lush forest clearing. Her expression is focused yet serene as she adjusts her wide-brimmed hat with a practiced hand. She wears a flowy bohemian dress, which moves gracefully with the rhythm of the horse, the fabric flowing fluidly in the animated motion. The dappled sunlight filters through the trees, casting soft, painterly patterns on the forest floor. Her posture is poised, showing both control and elegance as she guides the horse with ease. The animation's gentle, fluid style adds a dreamlike quality to the scene, with the woman’s calm demeanor and the peaceful surroundings evoking a sense of freedom and harmony.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_1.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_1.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video, in a captivating animated style, shows a woman standing in the center of a snowy forest, her eyes narrowed in concentration as she extends her hand forward. She is dressed in a deep blue cloak, her breath visible in the cold air, which is rendered with soft, ethereal strokes. A faint smile plays on her lips as she summons a wisp of ice magic, watching with focus as the surrounding trees and ground begin to shimmer and freeze, covered in delicate ice crystals. The animation’s fluid motion brings the magic to life, with the frost spreading outward in intricate, sparkling patterns. The environment is painted with soft, watercolor-like hues, enhancing the magical, dreamlike atmosphere. The overall mood is serene yet powerful, with the quiet winter air amplifying the delicate beauty of the frozen scene.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_2.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_2.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The animation features a whimsical portrait of a balloon seller standing in a gentle breeze, captured with soft, hazy brushstrokes that evoke the feel of a serene spring day. His face is framed by a gentle smile, his eyes squinting slightly against the sun, while a few wisps of hair flutter in the wind. He is dressed in a light, pastel-colored shirt, and the balloons around him sway with the wind, adding a sense of playfulness to the scene. The background blurs softly, with hints of a vibrant market or park, enhancing the light-hearted, yet tender mood of the moment.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_3.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_3.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel.</td>
+  </tr>
+  <tr>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_image_4.png?download=true" style="height: auto; width: 600px;"></td>
+    <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_output_4.gif?download=true" style="height: auto; width: 2000px;"></td>
+    <td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge. The setting appears playful, with colorful toys scattered around and a soft rug underfoot, while sunlight streams through a nearby window, highlighting the fluttering cape and adding to the impression of heroism. The overall atmosphere is lighthearted and fun, with the baby's expressions capturing a mix of innocence and an adorable attempt at bravery, as if truly ready to save the day.</td>
+  </tr>
+</table>
+## Resources
+通过以下资源了解有关 ConsisID 的更多信息：
+- 一段 [视频](https://www.youtube.com/watch?v=PhlgC-bI5SQ) 演示了 ConsisID 的主要功能；
+- 有关更多详细信息，请参阅研究论文 [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440)。
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -92,6 +92,7 @@ else:
            "AutoencoderTiny",
            "CogVideoXTransformer3DModel",
            "CogView3PlusTransformer2DModel",
+            "ConsisIDTransformer3DModel",
            "ConsistencyDecoderVAE",
            "ControlNetModel",
            "ControlNetUnionModel",
@@ -275,6 +276,7 @@ else:
            "CogVideoXPipeline",
            "CogVideoXVideoToVideoPipeline",
            "CogView3PlusPipeline",
+            "ConsisIDPipeline",
            "CycleDiffusionPipeline",
            "FluxControlImg2ImgPipeline",
            "FluxControlInpaintPipeline",
@@ -602,6 +604,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AutoencoderTiny,
            CogVideoXTransformer3DModel,
            CogView3PlusTransformer2DModel,
+            ConsisIDTransformer3DModel,
            ConsistencyDecoderVAE,
            ControlNetModel,
            ControlNetUnionModel,
@@ -764,6 +767,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            CogVideoXPipeline,
            CogVideoXVideoToVideoPipeline,
            CogView3PlusPipeline,
+            ConsisIDPipeline,
            CycleDiffusionPipeline,
            FluxControlImg2ImgPipeline,
            FluxControlInpaintPipeline,

--- a/src/diffusers/loaders/peft.py
+++ b/src/diffusers/loaders/peft.py
@@ -47,6 +47,7 @@ _SET_ADAPTER_SCALE_FN_MAPPING = {
    "SD3Transformer2DModel": lambda model_cls, weights: weights,
    "FluxTransformer2DModel": lambda model_cls, weights: weights,
    "CogVideoXTransformer3DModel": lambda model_cls, weights: weights,
+    "ConsisIDTransformer3DModel": lambda model_cls, weights: weights,
    "MochiTransformer3DModel": lambda model_cls, weights: weights,
    "HunyuanVideoTransformer3DModel": lambda model_cls, weights: weights,
    "LTXVideoTransformer3DModel": lambda model_cls, weights: weights,

--- a/src/diffusers/models/__init__.py
+++ b/src/diffusers/models/__init__.py
@@ -54,6 +54,7 @@ if is_torch_available():
    _import_structure["modeling_utils"] = ["ModelMixin"]
    _import_structure["transformers.auraflow_transformer_2d"] = ["AuraFlowTransformer2DModel"]
    _import_structure["transformers.cogvideox_transformer_3d"] = ["CogVideoXTransformer3DModel"]
+    _import_structure["transformers.consisid_transformer_3d"] = ["ConsisIDTransformer3DModel"]
    _import_structure["transformers.dit_transformer_2d"] = ["DiTTransformer2DModel"]
    _import_structure["transformers.dual_transformer_2d"] = ["DualTransformer2DModel"]
    _import_structure["transformers.hunyuan_transformer_2d"] = ["HunyuanDiT2DModel"]
@@ -129,6 +130,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            AuraFlowTransformer2DModel,
            CogVideoXTransformer3DModel,
            CogView3PlusTransformer2DModel,
+            ConsisIDTransformer3DModel,
            DiTTransformer2DModel,
            DualTransformer2DModel,
            FluxTransformer2DModel,

--- a/src/diffusers/models/transformers/__init__.py
+++ b/src/diffusers/models/transformers/__init__.py
@@ -4,6 +4,7 @@ from ...utils import is_torch_available
 if is_torch_available():
    from .auraflow_transformer_2d import AuraFlowTransformer2DModel
    from .cogvideox_transformer_3d import CogVideoXTransformer3DModel
+    from .consisid_transformer_3d import ConsisIDTransformer3DModel
    from .dit_transformer_2d import DiTTransformer2DModel
    from .dual_transformer_2d import DualTransformer2DModel
    from .hunyuan_transformer_2d import HunyuanDiT2DModel

--- a/src/diffusers/models/transformers/consisid_transformer_3d.py
+++ b/src/diffusers/models/transformers/consisid_transformer_3d.py
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -154,6 +154,7 @@ else:
        "CogVideoXFunControlPipeline",
    ]
    _import_structure["cogview3"] = ["CogView3PlusPipeline"]
+    _import_structure["consisid"] = ["ConsisIDPipeline"]
    _import_structure["controlnet"].extend(
        [
            "BlipDiffusionControlNetPipeline",
@@ -496,6 +497,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            CogVideoXVideoToVideoPipeline,
        )
        from .cogview3 import CogView3PlusPipeline
+        from .consisid import ConsisIDPipeline
        from .controlnet import (
            BlipDiffusionControlNetPipeline,
            StableDiffusionControlNetImg2ImgPipeline,

--- a/src/diffusers/pipelines/consisid/__init__.py
+++ b/src/diffusers/pipelines/consisid/__init__.py
+from typing import TYPE_CHECKING
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+_dummy_objects = {}
+_import_structure = {}
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa F403
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["pipeline_consisid"] = ["ConsisIDPipeline"]
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *
+    else:
+        from .pipeline_consisid import ConsisIDPipeline
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
--- a/src/diffusers/pipelines/consisid/consisid_utils.py
+++ b/src/diffusers/pipelines/consisid/consisid_utils.py
+import importlib.util
+import os
+import cv2
+import numpy as np
+import torch
+from PIL import Image, ImageOps
+from torchvision.transforms import InterpolationMode
+from torchvision.transforms.functional import normalize, resize
+from ...utils import load_image
+_insightface_available = importlib.util.find_spec("insightface") is not None
+_consisid_eva_clip_available = importlib.util.find_spec("consisid_eva_clip") is not None
+_facexlib_available = importlib.util.find_spec("facexlib") is not None
+if _insightface_available:
+    import insightface
+    from insightface.app import FaceAnalysis
+else:
+    raise ImportError("insightface is not available. Please install it using 'pip install insightface'.")
+if _consisid_eva_clip_available:
+    from consisid_eva_clip import create_model_and_transforms
+    from consisid_eva_clip.constants import OPENAI_DATASET_MEAN, OPENAI_DATASET_STD
+else:
+    raise ImportError("consisid_eva_clip is not available. Please install it using 'pip install consisid_eva_clip'.")
+if _facexlib_available:
+    from facexlib.parsing import init_parsing_model
+    from facexlib.utils.face_restoration_helper import FaceRestoreHelper
+else:
+    raise ImportError("facexlib is not available. Please install it using 'pip install facexlib'.")
+def resize_numpy_image_long(image, resize_long_edge=768):
+    """
+    Resize the input image to a specified long edge while maintaining aspect ratio.
+    Args:
+        image (numpy.ndarray): Input image (H x W x C or H x W).
+        resize_long_edge (int): The target size for the long edge of the image. Default is 768.
+    Returns:
+        numpy.ndarray: Resized image with the long edge matching `resize_long_edge`, while maintaining the aspect
+        ratio.
+    """
+    h, w = image.shape[:2]
+    if max(h, w) <= resize_long_edge:
+        return image
+    k = resize_long_edge / max(h, w)
+    h = int(h * k)
+    w = int(w * k)
+    image = cv2.resize(image, (w, h), interpolation=cv2.INTER_LANCZOS4)
+    return image
+def img2tensor(imgs, bgr2rgb=True, float32=True):
+    """Numpy array to tensor.
+    Args:
+        imgs (list[ndarray] | ndarray): Input images.
+        bgr2rgb (bool): Whether to change bgr to rgb.
+        float32 (bool): Whether to change to float32.
+    Returns:
+        list[tensor] | tensor: Tensor images. If returned results only have
+            one element, just return tensor.
+    """
+    def _totensor(img, bgr2rgb, float32):
+        if img.shape[2] == 3 and bgr2rgb:
+            if img.dtype == "float64":
+                img = img.astype("float32")
+            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+        img = torch.from_numpy(img.transpose(2, 0, 1))
+        if float32:
+            img = img.float()
+        return img
+    if isinstance(imgs, list):
+        return [_totensor(img, bgr2rgb, float32) for img in imgs]
+    return _totensor(imgs, bgr2rgb, float32)
+def to_gray(img):
+    """
+    Converts an RGB image to grayscale by applying the standard luminosity formula.
+    Args:
+        img (torch.Tensor): The input image tensor with shape (batch_size, channels, height, width).
+                             The image is expected to be in RGB format (3 channels).
+    Returns:
+        torch.Tensor: The grayscale image tensor with shape (batch_size, 3, height, width).
+                      The grayscale values are replicated across all three channels.
+    """
+    x = 0.299 * img[:, 0:1] + 0.587 * img[:, 1:2] + 0.114 * img[:, 2:3]
+    x = x.repeat(1, 3, 1, 1)
+    return x
+def process_face_embeddings(
+    face_helper_1,
+    clip_vision_model,
+    face_helper_2,
+    eva_transform_mean,
+    eva_transform_std,
+    app,
+    device,
+    weight_dtype,
+    image,
+    original_id_image=None,
+    is_align_face=True,
+):
+    """
+    Process face embeddings from an image, extracting relevant features such as face embeddings, landmarks, and parsed
+    face features using a series of face detection and alignment tools.
+    Args:
+        face_helper_1: Face helper object (first helper) for alignment and landmark detection.
+        clip_vision_model: Pre-trained CLIP vision model used for feature extraction.
+        face_helper_2: Face helper object (second helper) for embedding extraction.
+        eva_transform_mean: Mean values for image normalization before passing to EVA model.
+        eva_transform_std: Standard deviation values for image normalization before passing to EVA model.
+        app: Application instance used for face detection.
+        device: Device (CPU or GPU) where the computations will be performed.
+        weight_dtype: Data type of the weights for precision (e.g., `torch.float32`).
+        image: Input image in RGB format with pixel values in the range [0, 255].
+        original_id_image: (Optional) Original image for feature extraction if `is_align_face` is False.
+        is_align_face: Boolean flag indicating whether face alignment should be performed.
+    Returns:
+        Tuple:
+            - id_cond: Concatenated tensor of Ante face embedding and CLIP vision embedding
+            - id_vit_hidden: Hidden state of the CLIP vision model, a list of tensors.
+            - return_face_features_image_2: Processed face features image after normalization and parsing.
+            - face_kps: Keypoints of the face detected in the image.
+    """
+    face_helper_1.clean_all()
+    image_bgr = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+    # get antelopev2 embedding
+    face_info = app.get(image_bgr)
+    if len(face_info) > 0:
+        face_info = sorted(face_info, key=lambda x: (x["bbox"][2] - x["bbox"][0]) * (x["bbox"][3] - x["bbox"][1]))[
+            -1
+        ]  # only use the maximum face
+        id_ante_embedding = face_info["embedding"]  # (512,)
+        face_kps = face_info["kps"]
+    else:
+        id_ante_embedding = None
+        face_kps = None
+    # using facexlib to detect and align face
+    face_helper_1.read_image(image_bgr)
+    face_helper_1.get_face_landmarks_5(only_center_face=True)
+    if face_kps is None:
+        face_kps = face_helper_1.all_landmarks_5[0]
+    face_helper_1.align_warp_face()
+    if len(face_helper_1.cropped_faces) == 0:
+        raise RuntimeError("facexlib align face fail")
+    align_face = face_helper_1.cropped_faces[0]  # (512, 512, 3)  # RGB
+    # incase insightface didn't detect face
+    if id_ante_embedding is None:
+        print("fail to detect face using insightface, extract embedding on align face")
+        id_ante_embedding = face_helper_2.get_feat(align_face)
+    id_ante_embedding = torch.from_numpy(id_ante_embedding).to(device, weight_dtype)  # torch.Size([512])
+    if id_ante_embedding.ndim == 1:
+        id_ante_embedding = id_ante_embedding.unsqueeze(0)  # torch.Size([1, 512])
+    # parsing
+    if is_align_face:
+        input = img2tensor(align_face, bgr2rgb=True).unsqueeze(0) / 255.0  # torch.Size([1, 3, 512, 512])
+        input = input.to(device)
+        parsing_out = face_helper_1.face_parse(normalize(input, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]))[0]
+        parsing_out = parsing_out.argmax(dim=1, keepdim=True)  # torch.Size([1, 1, 512, 512])
+        bg_label = [0, 16, 18, 7, 8, 9, 14, 15]
+        bg = sum(parsing_out == i for i in bg_label).bool()
+        white_image = torch.ones_like(input)  # torch.Size([1, 3, 512, 512])
+        # only keep the face features
+        return_face_features_image = torch.where(bg, white_image, to_gray(input))  # torch.Size([1, 3, 512, 512])
+        return_face_features_image_2 = torch.where(bg, white_image, input)  # torch.Size([1, 3, 512, 512])
+    else:
+        original_image_bgr = cv2.cvtColor(original_id_image, cv2.COLOR_RGB2BGR)
+        input = img2tensor(original_image_bgr, bgr2rgb=True).unsqueeze(0) / 255.0  # torch.Size([1, 3, 512, 512])
+        input = input.to(device)
+        return_face_features_image = return_face_features_image_2 = input
+    # transform img before sending to eva-clip-vit
+    face_features_image = resize(
+        return_face_features_image, clip_vision_model.image_size, InterpolationMode.BICUBIC
+    )  # torch.Size([1, 3, 336, 336])
+    face_features_image = normalize(face_features_image, eva_transform_mean, eva_transform_std)
+    id_cond_vit, id_vit_hidden = clip_vision_model(
+        face_features_image.to(weight_dtype), return_all_features=False, return_hidden=True, shuffle=False
+    )  # torch.Size([1, 768]),  list(torch.Size([1, 577, 1024]))
+    id_cond_vit_norm = torch.norm(id_cond_vit, 2, 1, True)
+    id_cond_vit = torch.div(id_cond_vit, id_cond_vit_norm)
+    id_cond = torch.cat(
+        [id_ante_embedding, id_cond_vit], dim=-1
+    )  # torch.Size([1, 512]), torch.Size([1, 768])  ->  torch.Size([1, 1280])
+    return (
+        id_cond,
+        id_vit_hidden,
+        return_face_features_image_2,
+        face_kps,
+    )  # torch.Size([1, 1280]), list(torch.Size([1, 577, 1024]))
+def process_face_embeddings_infer(
+    face_helper_1,
+    clip_vision_model,
+    face_helper_2,
+    eva_transform_mean,
+    eva_transform_std,
+    app,
+    device,
+    weight_dtype,
+    img_file_path,
+    is_align_face=True,
+):
+    """
+    Process face embeddings from an input image for inference, including alignment, feature extraction, and embedding
+    concatenation.
+    Args:
+        face_helper_1: Face helper object (first helper) for alignment and landmark detection.
+        clip_vision_model: Pre-trained CLIP vision model used for feature extraction.
+        face_helper_2: Face helper object (second helper) for embedding extraction.
+        eva_transform_mean: Mean values for image normalization before passing to EVA model.
+        eva_transform_std: Standard deviation values for image normalization before passing to EVA model.
+        app: Application instance used for face detection.
+        device: Device (CPU or GPU) where the computations will be performed.
+        weight_dtype: Data type of the weights for precision (e.g., `torch.float32`).
+        img_file_path: Path to the input image file (string) or a numpy array representing an image.
+        is_align_face: Boolean flag indicating whether face alignment should be performed (default: True).
+    Returns:
+        Tuple:
+            - id_cond: Concatenated tensor of Ante face embedding and CLIP vision embedding.
+            - id_vit_hidden: Hidden state of the CLIP vision model, a list of tensors.
+            - image: Processed face image after feature extraction and alignment.
+            - face_kps: Keypoints of the face detected in the image.
+    """
+    # Load and preprocess the input image
+    if isinstance(img_file_path, str):
+        image = np.array(load_image(image=img_file_path).convert("RGB"))
+    else:
+        image = np.array(ImageOps.exif_transpose(Image.fromarray(img_file_path)).convert("RGB"))
+    # Resize image to ensure the longer side is 1024 pixels
+    image = resize_numpy_image_long(image, 1024)
+    original_id_image = image
+    # Process the image to extract face embeddings and related features
+    id_cond, id_vit_hidden, align_crop_face_image, face_kps = process_face_embeddings(
+        face_helper_1,
+        clip_vision_model,
+        face_helper_2,
+        eva_transform_mean,
+        eva_transform_std,
+        app,
+        device,
+        weight_dtype,
+        image,
+        original_id_image,
+        is_align_face,
+    )
+    # Convert the aligned cropped face image (torch tensor) to a numpy array
+    tensor = align_crop_face_image.cpu().detach()
+    tensor = tensor.squeeze()
+    tensor = tensor.permute(1, 2, 0)
+    tensor = tensor.numpy() * 255
+    tensor = tensor.astype(np.uint8)
+    image = ImageOps.exif_transpose(Image.fromarray(tensor))
+    return id_cond, id_vit_hidden, image, face_kps
+def prepare_face_models(model_path, device, dtype):
+    """
+    Prepare all face models for the facial recognition task.
+    Parameters:
+    - model_path: Path to the directory containing model files.
+    - device: The device (e.g., 'cuda', 'cpu') where models will be loaded.
+    - dtype: Data type (e.g., torch.float32) for model inference.
+    Returns:
+    - face_helper_1: First face restoration helper.
+    - face_helper_2: Second face restoration helper.
+    - face_clip_model: CLIP model for face extraction.
+    - eva_transform_mean: Mean value for image normalization.
+    - eva_transform_std: Standard deviation value for image normalization.
+    - face_main_model: Main face analysis model.
+    """
+    # get helper model
+    face_helper_1 = FaceRestoreHelper(
+        upscale_factor=1,
+        face_size=512,
+        crop_ratio=(1, 1),
+        det_model="retinaface_resnet50",
+        save_ext="png",
+        device=device,
+        model_rootpath=os.path.join(model_path, "face_encoder"),
+    )
+    face_helper_1.face_parse = None
+    face_helper_1.face_parse = init_parsing_model(
+        model_name="bisenet", device=device, model_rootpath=os.path.join(model_path, "face_encoder")
+    )
+    face_helper_2 = insightface.model_zoo.get_model(
+        f"{model_path}/face_encoder/models/antelopev2/glintr100.onnx", providers=["CUDAExecutionProvider"]
+    )
+    face_helper_2.prepare(ctx_id=0)
+    # get local facial extractor part 1
+    model, _, _ = create_model_and_transforms(
+        "EVA02-CLIP-L-14-336",
+        os.path.join(model_path, "face_encoder", "EVA02_CLIP_L_336_psz14_s6B.pt"),
+        force_custom_clip=True,
+    )
+    face_clip_model = model.visual
+    eva_transform_mean = getattr(face_clip_model, "image_mean", OPENAI_DATASET_MEAN)
+    eva_transform_std = getattr(face_clip_model, "image_std", OPENAI_DATASET_STD)
+    if not isinstance(eva_transform_mean, (list, tuple)):
+        eva_transform_mean = (eva_transform_mean,) * 3
+    if not isinstance(eva_transform_std, (list, tuple)):
+        eva_transform_std = (eva_transform_std,) * 3
+    eva_transform_mean = eva_transform_mean
+    eva_transform_std = eva_transform_std
+    # get local facial extractor part 2
+    face_main_model = FaceAnalysis(
+        name="antelopev2", root=os.path.join(model_path, "face_encoder"), providers=["CUDAExecutionProvider"]
+    )
+    face_main_model.prepare(ctx_id=0, det_size=(640, 640))
+    # move face models to device
+    face_helper_1.face_det.eval()
+    face_helper_1.face_parse.eval()
+    face_clip_model.eval()
+    face_helper_1.face_det.to(device)
+    face_helper_1.face_parse.to(device)
+    face_clip_model.to(device, dtype=dtype)
+    return face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std
--- a/src/diffusers/pipelines/consisid/pipeline_consisid.py
+++ b/src/diffusers/pipelines/consisid/pipeline_consisid.py
--- a/src/diffusers/pipelines/consisid/pipeline_output.py
+++ b/src/diffusers/pipelines/consisid/pipeline_output.py
+from dataclasses import dataclass
+import torch
+from diffusers.utils import BaseOutput
+@dataclass
+class ConsisIDPipelineOutput(BaseOutput):
+    r"""
+    Output class for ConsisID pipelines.
+    Args:
+        frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
+            List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing
+            denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
+            `(batch_size, num_frames, channels, height, width)`.
+    """
+    frames: torch.Tensor
--- a/src/diffusers/utils/dummy_pt_objects.py
+++ b/src/diffusers/utils/dummy_pt_objects.py
@@ -227,6 +227,21 @@ class CogView3PlusTransformer2DModel(metaclass=DummyObject):
        requires_backends(cls, ["torch"])
+class ConsisIDTransformer3DModel(metaclass=DummyObject):
+    _backends = ["torch"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
 class ConsistencyDecoderVAE(metaclass=DummyObject):
    _backends = ["torch"]

--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -362,6 +362,21 @@ class CogView3PlusPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])
+class ConsisIDPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
 class CycleDiffusionPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

--- a/tests/models/transformers/test_models_transformer_consisid.py
+++ b/tests/models/transformers/test_models_transformer_consisid.py
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+import torch
+from diffusers import ConsisIDTransformer3DModel
+from diffusers.utils.testing_utils import (
+    enable_full_determinism,
+    torch_device,
+)
+from ..test_modeling_common import ModelTesterMixin
+enable_full_determinism()
+class ConsisIDTransformerTests(ModelTesterMixin, unittest.TestCase):
+    model_class = ConsisIDTransformer3DModel
+    main_input_name = "hidden_states"
+    uses_custom_attn_processor = True
+    @property
+    def dummy_input(self):
+        batch_size = 2
+        num_channels = 4
+        num_frames = 1
+        height = 8
+        width = 8
+        embedding_dim = 8
+        sequence_length = 8
+        hidden_states = torch.randn((batch_size, num_frames, num_channels, height, width)).to(torch_device)
+        encoder_hidden_states = torch.randn((batch_size, sequence_length, embedding_dim)).to(torch_device)
+        timestep = torch.randint(0, 1000, size=(batch_size,)).to(torch_device)
+        id_vit_hidden = [torch.ones([batch_size, 2, 2]).to(torch_device)] * 1
+        id_cond = torch.ones(batch_size, 2).to(torch_device)
+        return {
+            "hidden_states": hidden_states,
+            "encoder_hidden_states": encoder_hidden_states,
+            "timestep": timestep,
+            "id_vit_hidden": id_vit_hidden,
+            "id_cond": id_cond,
+        }
+    @property
+    def input_shape(self):
+        return (1, 4, 8, 8)
+    @property
+    def output_shape(self):
+        return (1, 4, 8, 8)
+    def prepare_init_args_and_inputs_for_common(self):
+        init_dict = {
+            "num_attention_heads": 2,
+            "attention_head_dim": 8,
+            "in_channels": 4,
+            "out_channels": 4,
+            "time_embed_dim": 2,
+            "text_embed_dim": 8,
+            "num_layers": 1,
+            "sample_width": 8,
+            "sample_height": 8,
+            "sample_frames": 8,
+            "patch_size": 2,
+            "temporal_compression_ratio": 4,
+            "max_text_seq_length": 8,
+            "cross_attn_interval": 1,
+            "is_kps": False,
+            "is_train_face": True,
+            "cross_attn_dim_head": 1,
+            "cross_attn_num_heads": 1,
+            "LFE_id_dim": 2,
+            "LFE_vit_dim": 2,
+            "LFE_depth": 5,
+            "LFE_dim_head": 8,
+            "LFE_num_heads": 2,
+            "LFE_num_id_token": 1,
+            "LFE_num_querie": 1,
+            "LFE_output_dim": 10,
+            "LFE_ff_mult": 1,
+            "LFE_num_scale": 1,
+        }
+        inputs_dict = self.dummy_input
+        return init_dict, inputs_dict
+    def test_gradient_checkpointing_is_applied(self):
+        expected_set = {"ConsisIDTransformer3DModel"}
+        super().test_gradient_checkpointing_is_applied(expected_set=expected_set)
--- a/tests/pipelines/consisid/__init__.py
+++ b/tests/pipelines/consisid/__init__.py