Initial commit CI/CD

727428ec · jerrrrry · 727428ec · 727428ec · 727428ec · 727428ec
Commit 727428ec authored Jan 21, 2026 by jerrrrry
20 changed files
--- a/asset/radar.png
+++ b/asset/radar.png
--- a/comfyui/README.md
+++ b/comfyui/README.md
+# Hunyuan DiT for ComfyUI
+
+Hunyuan DiT is a diffusion model that understands both English and Chinese. You can use it in [ComfyUI](https://github.com/comfyanonymous/ComfyUI), the most powerful and modular diffusion model GUI, API, and backend with a graph/nodes interface.
+
+## Hunyuan DiT 1.2
+
+### text2image
+
+1. Download [hunyuan_dit_1.2.safetensors](https://huggingface.co/comfyanonymous/hunyuan_dit_comfyui/blob/main/hunyuan_dit_1.2.safetensors) and place it in your `ComfyUI/models/checkpoints` directory.
+
+2. Load the following image in ComfyUI to get the workflow:
+
+   Tip: The workflow JSON info was added to the image file.
+
+   ![Example](hunyuan_dit_1.2_example.png)
+
+Some base models:
+
+- Trained by @tencent
+
+  - Any style: [Link](https://huggingface.co/comfyanonymous/hunyuan_dit_comfyui/blob/main/hunyuan_dit_1.2.safetensors)
+
+- Trained by @LAX
+
+  - Anime style: [Link](https://huggingface.co/comfyanonymous/Freeway_Animation_Hunyuan_Demo_ComfyUI_Converted/blob/main/freeway_animation_demo_hunyuan_dit_ema.safetensors)
+
+### ControlNet
+
+1. Download the [ControlNet model weight files here](https://huggingface.co/Tencent-Hunyuan/HYDiT-ControlNet-v1.2/tree/main) and place them in your `ComfyUI/models/controlnet/hunyuandit` directory.
+
+2. Load the following image in ComfyUI to get a `canny` ControlNet workflow:
+
+   ![Example](hunyuan_dit_1.2_controlnet_canny_example.png)
+
+Workflow demo:
+
+![Demo](hunyuan_dit_1.2_controlnet_canny_demo.png)
+
+Some ControlNet models:
+
+- Trained by @tencent
+
+  - Pose: [Link](https://huggingface.co/Tencent-Hunyuan/HYDiT-ControlNet-v1.2/blob/main/pytorch_model_pose_distill.pt)
+
+  - Depth: [Link](https://huggingface.co/Tencent-Hunyuan/HYDiT-ControlNet-v1.2/blob/main/pytorch_model_depth_distill.pt)
+
+  - Canny: [Link](https://huggingface.co/Tencent-Hunyuan/HYDiT-ControlNet-v1.2/blob/main/pytorch_model_canny_distill.pt)
+
+- Trained by @TTPlanetPig:
+
+  - Inpaint ControlNet: [Link](https://huggingface.co/TTPlanet/HunyuanDiT_Controlnet_inpainting)
+
+  - Tile ControlNet: [Link](https://huggingface.co/TTPlanet/HunyuanDiT_Controlnet_tile)
+
+  - Lineart ControlNet: [Link](https://huggingface.co/TTPlanet/HunyuanDiT_Controlnet_lineart)
+
+  - HunyuanDIT_v1.2 ComfyUI nodes
+
+    - Comfyui_TTP_CN_Preprocessor: [Link](https://github.com/TTPlanetPig/Comfyui_TTP_CN_Preprocessor)
+
+    - Comfyui_TTP_Toolset: [Link](https://github.com/TTPlanetPig/Comfyui_TTP_Toolset)
+
+  - Download from ModelScope: [Link](https://modelscope.cn/models/baicai003/hunyuandit_v1.2_controlnets)
\ No newline at end of file
--- a/comfyui/hunyuan_dit_1.2_controlnet_canny_demo.png
+++ b/comfyui/hunyuan_dit_1.2_controlnet_canny_demo.png
--- a/comfyui/hunyuan_dit_1.2_controlnet_canny_example.png
+++ b/comfyui/hunyuan_dit_1.2_controlnet_canny_example.png
--- a/comfyui/hunyuan_dit_1.2_example.png
+++ b/comfyui/hunyuan_dit_1.2_example.png
--- a/controlnet/README.md
+++ b/controlnet/README.md
+
+## Using HunyuanDiT ControlNet
+
+
+### Instructions
+
+ The dependencies and installation are basically the same as the [**base model**](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2).
+
+ We provide three types of ControlNet weights for you to test: canny, depth and pose ControlNet.
+ 
+ Download the model using the following commands:
+
+```bash
+cd HunyuanDiT
+# Use the huggingface-cli tool to download the model.
+# We recommend using distilled weights as the base model for ControlNet inference, as our provided pretrained weights are trained on them.
+huggingface-cli download Tencent-Hunyuan/HYDiT-ControlNet-v1.2 --local-dir ./ckpts/t2i/controlnet
+huggingface-cli download Tencent-Hunyuan/Distillation-v1.2 ./pytorch_model_distill.pt --local-dir ./ckpts/t2i/model
+
+# Quick start
+python sample_controlnet.py --infer-mode fa --no-enhance --load-key distill --infer-steps 50 --control-type canny --prompt "在夜晚的酒店门前，一座古老的中国风格的狮子雕像矗立着，它的眼睛闪烁着光芒，仿佛在守护着这座建筑。背景是夜晚的酒店前，构图方式是特写，平视，居中构图。这张照片呈现了真实摄影风格，蕴含了中国雕塑文化，同时展现了神秘氛围" --condition-image-path controlnet/asset/input/canny.jpg --control-weight 1.0
+```
+
+Examples of condition input and ControlNet results are as follows:
+<table>
+  <tr>
+    <td colspan="3" align="center">Condition Input</td>
+  </tr>
+  
+   <tr>
+    <td align="center">Canny ControlNet </td>
+    <td align="center">Depth ControlNet </td>
+    <td align="center">Pose ControlNet </td>
+  </tr>
+
+  <tr>
+    <td align="center">在夜晚的酒店门前，一座古老的中国风格的狮子雕像矗立着，它的眼睛闪烁着光芒，仿佛在守护着这座建筑。背景是夜晚的酒店前，构图方式是特写，平视，居中构图。这张照片呈现了真实摄影风格，蕴含了中国雕塑文化，同时展现了神秘氛围<br>（At night, an ancient Chinese-style lion statue stands in front of the hotel, its eyes gleaming as if guarding the building. The background is the hotel entrance at night, with a close-up, eye-level, and centered composition. This photo presents a realistic photographic style, embodies Chinese sculpture culture, and reveals a mysterious atmosphere.） </td>
+    <td align="center">在茂密的森林中，一只黑白相间的熊猫静静地坐在绿树红花中，周围是山川和海洋。背景是白天的森林，光线充足。照片采用特写、平视和居中构图的方式，呈现出写实的效果<br>（In the dense forest, a black and white panda sits quietly among the green trees and red flowers, surrounded by mountains and oceans. The background is a daytime forest with ample light. The photo uses a close-up, eye-level, and centered composition to create a realistic effect.） </td>
+    <td align="center">在白天的森林中，一位穿着绿色上衣的亚洲女性站在大象旁边。照片采用了中景、平视和居中构图的方式，呈现出写实的效果。这张照片蕴含了人物摄影文化，并展现了宁静的氛围<br>（In the daytime forest, an Asian woman wearing a green shirt stands beside an elephant. The photo uses a medium shot, eye-level, and centered composition to create a realistic effect. This picture embodies the character photography culture and conveys a serene atmosphere.） </td>
+  </tr>
+
+  <tr>
+    <td align="center"><img src="asset/input/canny.jpg" alt="Image 0" width="200"/></td>
+    <td align="center"><img src="asset/input/depth.jpg" alt="Image 1" width="200"/></td>
+    <td align="center"><img src="asset/input/pose.jpg" alt="Image 2" width="200"/></td>
+    
+  </tr>
+  
+  <tr>
+    <td colspan="3" align="center">ControlNet Output</td>
+  </tr>
+
+  <tr>
+    <td align="center"><img src="asset/output/canny.jpg" alt="Image 3" width="200"/></td>
+    <td align="center"><img src="asset/output/depth.jpg" alt="Image 4" width="200"/></td>
+    <td align="center"><img src="asset/output/pose.jpg" alt="Image 5" width="200"/></td>
+  </tr>
+ 
+  
+</table>
+
+
+### Training
+
+We utilize [**DWPose**](https://github.com/IDEA-Research/DWPose) for pose extraction. Please follow their guidelines to download the checkpoints and save them to `hydit/annotator/ckpts` directory. We provide serveral commands to quick install:
+```bash
+mkdir ./hydit/annotator/ckpts
+wget -O ./hydit/annotator/ckpts/dwpose.zip https://dit.hunyuan.tencent.com/download/HunyuanDiT/dwpose.zip
+unzip ./hydit/annotator/ckpts/dwpose.zip -d ./hydit/annotator/ckpts/
+```
+Additionally, ensure that you install the related dependencies.
+```bash
+pip install matplotlib==3.7.5
+pip install onnxruntime_gpu==1.16.3
+pip install opencv-python==4.8.1.78
+```
+
+
+We provide three types of weights for ControlNet training, `ema`, `module` and `distill`, and you can choose according to the actual effects. By default, we use `distill` weights. 
+
+Here is an example, we load the `distill` weights into the main model and conduct ControlNet training. 
+
+If apply multiple resolution training, you need to add the `--multireso` and `--reso-step 64` parameter. 
+
+```bash
+task_flag="canny_controlnet"                                   # the task flag is used to identify folders.
+control_type=canny
+resume_module_root=./ckpts/t2i/model/pytorch_model_distill.pt  # checkpoint root for resume
+index_file=/path/to/your/indexfile                             # index file for dataloader
+results_dir=./log_EXP                                          # save root for results
+batch_size=1                                                   # training batch size
+image_size=1024                                                # training image resolution
+grad_accu_steps=2                                              # gradient accumulation
+warmup_num_steps=0                                             # warm-up steps
+lr=0.0001                                                      # learning rate
+ckpt_every=10000                                               # create a ckpt every a few steps.
+ckpt_latest_every=5000                                         # create a ckpt named `latest.pt` every a few steps.
+epochs=100                                                     # total training epochs
+
+
+sh $(dirname "$0")/run_g_controlnet.sh \
+    --task-flag ${task_flag} \
+    --control-type ${control_type} \
+    --noise-schedule scaled_linear --beta-start 0.00085 --beta-end 0.018 \
+    --predict-type v_prediction \
+    --uncond-p 0.44 \
+    --uncond-p-t5 0.44 \
+    --index-file ${index_file} \
+    --random-flip \
+    --lr ${lr} \
+    --batch-size ${batch_size} \
+    --image-size ${image_size} \
+    --global-seed 999 \
+    --grad-accu-steps ${grad_accu_steps} \
+    --warmup-num-steps ${warmup_num_steps} \
+    --use-flash-attn \
+    --use-fp16 \
+    --results-dir ${results_dir} \
+    --resume \
+    --resume-module-root ${resume_module_root} \
+    --epochs ${epochs} \
+    --ckpt-every ${ckpt_every} \
+    --ckpt-latest-every ${ckpt_latest_every} \
+    --log-every 10 \
+    --deepspeed \
+    --deepspeed-optimizer \
+    --use-zero-stage 2 \
+    --gradient-checkpointing \
+    "$@"
+
+```
+
+Recommended parameter settings
+
+|     Parameter     |  Description  |          Recommended Parameter Value                               | Note|
+|:---------------:|:---------:|:---------------------------------------------------:|:--:|
+|   `--batch-size` |    Training batch size    |        1        | Depends on GPU memory|
+|   `--grad-accu-steps` |    Size of gradient accumulation    |       2        | - |
+|   `--lr` |    Learning rate  |        0.0001        | - |
+|   `--control-type` |   ControlNet condition type, support 3 types now (canny, depth and pose)  |        /        | - |
+
+
+### Inference
+You can use the following command line for inference.
+
+a. You can use a float to specify the weight for all layers, **or use a list to separately specify the weight for each layer**, for example, '[1.0 * (0.825 ** float(19 - i)) for i in range(19)]'
+```bash
+python sample_controlnet.py --infer-mode fa --control-weight "[1.0 * (0.825 ** float(19 - i)) for i in range(19)]" --no-enhance --load-key distill --infer-steps 50 --control-type canny --prompt "在夜晚的酒店门前，一座古老的中国风格的狮子雕像矗立着，它的眼睛闪烁着光芒，仿佛在守护着这座建筑。背景是夜晚的酒店前，构图方式是特写，平视，居中构图。这张照片呈现了真实摄影风格，蕴含了中国雕塑文化，同时展现了神秘氛围" --condition-image-path controlnet/asset/input/canny.jpg
+```
+
+b. Using canny ControlNet during inference
+
+```bash
+python sample_controlnet.py --infer-mode fa --control-weight 1.0 --no-enhance --load-key distill --infer-steps 50 --control-type canny --prompt "在夜晚的酒店门前，一座古老的中国风格的狮子雕像矗立着，它的眼睛闪烁着光芒，仿佛在守护着这座建筑。背景是夜晚的酒店前，构图方式是特写，平视，居中构图。这张照片呈现了真实摄影风格，蕴含了中国雕塑文化，同时展现了神秘氛围" --condition-image-path controlnet/asset/input/canny.jpg
+```
+
+c. Using depth ControlNet during inference
+
+```bash
+python sample_controlnet.py --infer-mode fa --control-weight 1.0 --no-enhance --load-key distill --infer-steps 50 --control-type depth --prompt "在茂密的森林中，一只黑白相间的熊猫静静地坐在绿树红花中，周围是山川和海洋。背景是白天的森林，光线充足。照片采用特写、平视和居中构图的方式，呈现出写实的效果" --condition-image-path controlnet/asset/input/depth.jpg
+```
+
+d. Using pose ControlNet during inference
+
+
+```bash
+python3 sample_controlnet.py --infer-mode fa --control-weight 1.0 --no-enhance --load-key distill --infer-steps 50 --control-type pose --prompt "在白天的森林中，一位穿着绿色上衣的亚洲女性站在大象旁边。照片采用了中景、平视和居中构图的方式，呈现出写实的效果。这张照片蕴含了人物摄影文化，并展现了宁静的氛围" --condition-image-path controlnet/asset/input/pose.jpg
+```
+
+## HunyuanDiT Controlnet v1.1
+
+### Instructions
+Download the v1.1 base model and controlnet using the following commands:
+```bash
+cd HunyuanDiT
+# Use the huggingface-cli tool to download the model.
+# We recommend using distilled weights as the base model for ControlNet inference, as our provided pretrained weights are trained on them.
+huggingface-cli download Tencent-Hunyuan/HYDiT-ControlNet-v1.1 --local-dir ./HunyuanDiT-v1.1/t2i/controlnet
+huggingface-cli download Tencent-Hunyuan/Distillation-v1.1 ./pytorch_model_distill.pt --local-dir ./HunyuanDiT-v1.1/t2i/model
+```
+
+### Training
+
+```bash
+task_flag="canny_controlnet"                                # the task flag is used to identify folders.
+control_type=canny
+resume_module_root=./ckpts/t2i/model/pytorch_model_distill.pt  # checkpoint root for resume
+index_file=/path/to/your/indexfile                           # index file for dataloader
+results_dir=./log_EXP                                        # save root for results
+batch_size=1                                                 # training batch size
+image_size=1024                                              # training image resolution
+grad_accu_steps=2                                            # gradient accumulation
+warmup_num_steps=0                                           # warm-up steps
+lr=0.0001                                                    # learning rate
+ckpt_every=10000                                             # create a ckpt every a few steps.
+ckpt_latest_every=5000                                       # create a ckpt named `latest.pt` every a few steps.
+epochs=100                                                   # total training epochs
+
+
+sh $(dirname "$0")/run_g_controlnet.sh \
+    --task-flag ${task_flag} \
+    --control-type ${control_type} \
+    --noise-schedule scaled_linear --beta-start 0.00085 --beta-end 0.03 \
+    --predict-type v_prediction \
+    --multireso \
+    --reso-step 64 \
+    --uncond-p 0.44 \
+    --uncond-p-t5 0.44 \
+    --index-file ${index_file} \
+    --random-flip \
+    --lr ${lr} \
+    --batch-size ${batch_size} \
+    --image-size ${image_size} \
+    --global-seed 999 \
+    --grad-accu-steps ${grad_accu_steps} \
+    --warmup-num-steps ${warmup_num_steps} \
+    --use-flash-attn \
+    --use-fp16 \
+    --results-dir ${results_dir} \
+    --resume \
+    --resume-module-root ${resume_module_root} \
+    --epochs ${epochs} \
+    --ckpt-every ${ckpt_every} \
+    --ckpt-latest-every ${ckpt_latest_every} \
+    --log-every 10 \
+    --deepspeed \
+    --deepspeed-optimizer \
+    --use-zero-stage 2 \
+    --use-style-cond \
+    --size-cond 1024 1024 \
+    "$@"
+```
+
+### Inference
+You can use the following command line for inference.
+
+a. Using canny ControlNet during inference
+
+```bash
+python3 sample_controlnet.py  --no-enhance --load-key distill --infer-steps 50 --control-type canny --prompt "在夜晚的酒店门前，一座古老的中国风格的狮子雕像矗立着，它的眼睛闪烁着光芒，仿佛在守护着这座建筑。背景是夜晚的酒店前，构图方式是特写，平视，居中构图。这张照片呈现了真实摄影风格，蕴含了中国雕塑文化，同时展现了神秘氛围" --condition-image-path controlnet/asset/input/canny.jpg --control-weight 1.0 --use-style-cond --size-cond 1024 1024 --beta-end 0.03
+```
+
+b. Using depth ControlNet during inference
+
+```bash
+python3 sample_controlnet.py  --no-enhance --load-key distill --infer-steps 50 --control-type depth --prompt "在茂密的森林中，一只黑白相间的熊猫静静地坐在绿树红花中，周围是山川和海洋。背景是白天的森林，光线充足" --condition-image-path controlnet/asset/input/depth.jpg --control-weight 1.0 --use-style-cond --size-cond 1024 1024 --beta-end 0.03
+```
+
+c. Using pose ControlNet during inference
+
+```bash
+python3 sample_controlnet.py  --no-enhance --load-key distill --infer-steps 50 --control-type pose --prompt "一位亚洲女性，身穿绿色上衣，戴着紫色头巾和紫色围巾，站在黑板前。背景是黑板。照片采用近景、平视和居中构图的方式呈现真实摄影风格" --condition-image-path controlnet/asset/input/pose.jpg --control-weight 1.0 --use-style-cond --size-cond 1024 1024 --beta-end 0.03
+```
--- a/controlnet/asset/input/canny.jpg
+++ b/controlnet/asset/input/canny.jpg
--- a/controlnet/asset/input/depth.jpg
+++ b/controlnet/asset/input/depth.jpg
--- a/controlnet/asset/input/pose.jpg
+++ b/controlnet/asset/input/pose.jpg
--- a/controlnet/asset/output/canny.jpg
+++ b/controlnet/asset/output/canny.jpg
--- a/controlnet/asset/output/depth.jpg
+++ b/controlnet/asset/output/depth.jpg
--- a/controlnet/asset/output/pose.jpg
+++ b/controlnet/asset/output/pose.jpg
--- a/controlnet/train_controlnet.sh
+++ b/controlnet/train_controlnet.sh
+task_flag="canny_controlnet"                                   # the task flag is used to identify folders.
+control_type=canny
+resume_module_root=./ckpts/t2i/model/pytorch_model_distill.pt  # checkpoint root for resume
+index_file=dataset/porcelain/jsons/porcelain.json                              # index file for dataloader
+results_dir=./log_EXP                                          # save root for results
+batch_size=1                                                   # training batch size
+image_size=1024                                                # training image resolution
+grad_accu_steps=2                                              # gradient accumulation
+warmup_num_steps=0                                             # warm-up steps
+lr=0.0001                                                      # learning rate
+ckpt_every=1000                                               # create a ckpt every a few steps.
+ckpt_latest_every=200                                         # create a ckpt named `latest.pt` every a few steps.
+epochs=100                                                     # total training epochs
+
+PYTHONPATH=. \
+sh ./hydit/run_g_controlnet.sh \
+    --task-flag ${task_flag} \
+    --control-type ${control_type} \
+    --noise-schedule scaled_linear --beta-start 0.00085 --beta-end 0.018 \
+    --predict-type v_prediction \
+    --uncond-p 0.44 \
+    --uncond-p-t5 0.44 \
+    --index-file ${index_file} \
+    --random-flip \
+    --lr ${lr} \
+    --batch-size ${batch_size} \
+    --image-size ${image_size} \
+    --global-seed 999 \
+    --grad-accu-steps ${grad_accu_steps} \
+    --warmup-num-steps ${warmup_num_steps} \
+    --use-flash-attn \
+    --use-fp16 \
+    --results-dir ${results_dir} \
+    --resume \
+    --resume-module-root ${resume_module_root} \
+    --epochs ${epochs} \
+    --ckpt-every ${ckpt_every} \
+    --ckpt-latest-every ${ckpt_latest_every} \
+    --log-every 10 \
+    --deepspeed \
+    --deepspeed-optimizer \
+    --use-zero-stage 2 \
+    --gradient-checkpointing \
+    "$@"
--- a/dataset/yamls/porcelain.yaml
+++ b/dataset/yamls/porcelain.yaml
+source:
+    - ./dataset/porcelain/arrows/00000.arrow
+
+filter:
+    column:
+        -   name: height
+            type: int
+            action: ge
+            target: 1024
+            default: 1024
+        -   name: width
+            type: int
+            action: ge
+            target: 1024
+            default: 1024
\ No newline at end of file
--- a/dataset/yamls/porcelain_mt.yaml
+++ b/dataset/yamls/porcelain_mt.yaml
+src:
+    - ./dataset/porcelain/jsons/porcelain.json
+base_size: 1024
+reso_step: 64
+min_size: 1024
\ No newline at end of file
--- a/diffusers/README.md
+++ b/diffusers/README.md
+# Hunyuan-DiT + 🤗 Diffusers
+
+You can use Hunyuan-DiT in 🤗 Diffusers library. Before using the pipelines, please install the latest version of 🤗 Diffusers with
+```bash
+pip install git+https://github.com/huggingface/diffusers.git
+```
+
+## Inference with th Base Model
+
+You can generate images with both Chinese and English prompts using the following Python script:
+```py
+import torch
+from diffusers import HunyuanDiTPipeline
+
+pipe = HunyuanDiTPipeline.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers", torch_dtype=torch.float16)
+pipe.to("cuda")
+
+# You may also use English prompt as HunyuanDiT supports both English and Chinese
+# prompt = "An astronaut riding a horse"
+prompt = "一个宇航员在骑马"
+image = pipe(prompt).images[0]
+```
+You can use our distilled model to generate images even faster:
+
+```py
+import torch
+from diffusers import HunyuanDiTPipeline
+
+pipe = HunyuanDiTPipeline.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers-Distilled", torch_dtype=torch.float16)
+pipe.to("cuda")
+
+# You may also use English prompt as HunyuanDiT supports both English and Chinese
+# prompt = "An astronaut riding a horse"
+prompt = "一个宇航员在骑马"
+image = pipe(prompt, num_inference_steps=25).images[0]
+```
+More details can be found in [HunyuanDiT-v1.2-Diffusers-Distilled](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers-Distilled)
+
+## LoRA
+LoRA can be integrated with Hunyuan-DiT inside the 🤗 Diffusers framework. 
+The following example loads and uses the pre-trained LoRA. To try it, please start by downloading our pre-trained LoRA checkpoints,
+```bash
+huggingface-cli download Tencent-Hunyuan/HYDiT-LoRA --local-dir ./ckpts/t2i/lora
+```
+Then run the following code snippet to use the jade LoRA:
+```python
+import torch
+from diffusers import HunyuanDiTPipeline
+
+### convert checkpoint to diffusers format
+num_layers = 40
+def load_hunyuan_dit_lora(transformer_state_dict, lora_state_dict, lora_scale):
+    for i in range(num_layers):
+        Wqkv = torch.matmul(lora_state_dict[f"blocks.{i}.attn1.Wqkv.lora_B.weight"], lora_state_dict[f"blocks.{i}.attn1.Wqkv.lora_A.weight"]) 
+        q, k, v = torch.chunk(Wqkv, 3, dim=0)
+        transformer_state_dict[f"blocks.{i}.attn1.to_q.weight"] += lora_scale * q
+        transformer_state_dict[f"blocks.{i}.attn1.to_k.weight"] += lora_scale * k
+        transformer_state_dict[f"blocks.{i}.attn1.to_v.weight"] += lora_scale * v
+
+        out_proj = torch.matmul(lora_state_dict[f"blocks.{i}.attn1.out_proj.lora_B.weight"], lora_state_dict[f"blocks.{i}.attn1.out_proj.lora_A.weight"]) 
+        transformer_state_dict[f"blocks.{i}.attn1.to_out.0.weight"] += lora_scale * out_proj
+
+        q_proj = torch.matmul(lora_state_dict[f"blocks.{i}.attn2.q_proj.lora_B.weight"], lora_state_dict[f"blocks.{i}.attn2.q_proj.lora_A.weight"])
+        transformer_state_dict[f"blocks.{i}.attn2.to_q.weight"] += lora_scale * q_proj
+
+        kv_proj = torch.matmul(lora_state_dict[f"blocks.{i}.attn2.kv_proj.lora_B.weight"], lora_state_dict[f"blocks.{i}.attn2.kv_proj.lora_A.weight"])
+        k, v = torch.chunk(kv_proj, 2, dim=0)
+        transformer_state_dict[f"blocks.{i}.attn2.to_k.weight"] += lora_scale * k
+        transformer_state_dict[f"blocks.{i}.attn2.to_v.weight"] += lora_scale * v
+
+        out_proj = torch.matmul(lora_state_dict[f"blocks.{i}.attn2.out_proj.lora_B.weight"], lora_state_dict[f"blocks.{i}.attn2.out_proj.lora_A.weight"]) 
+        transformer_state_dict[f"blocks.{i}.attn2.to_out.0.weight"] += lora_scale * out_proj
+    
+    q_proj = torch.matmul(lora_state_dict["pooler.q_proj.lora_B.weight"], lora_state_dict["pooler.q_proj.lora_A.weight"])
+    transformer_state_dict["time_extra_emb.pooler.q_proj.weight"] += lora_scale * q_proj
+    
+    return transformer_state_dict
+
+### use the diffusers pipeline with lora
+pipe = HunyuanDiTPipeline.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers", torch_dtype=torch.float16)
+pipe.to("cuda")
+
+from safetensors import safe_open
+
+lora_state_dict = {}
+with safe_open("./ckpts/t2i/lora/jade/adapter_model.safetensors", framework="pt", device=0) as f:
+    for k in f.keys():
+        lora_state_dict[k[17:]] = f.get_tensor(k) # remove 'basemodel.model'
+
+transformer_state_dict = pipe.transformer.state_dict()
+transformer_state_dict = load_hunyuan_dit_lora(transformer_state_dict, lora_state_dict, lora_scale=1.0)
+pipe.transformer.load_state_dict(transformer_state_dict)
+
+prompt = "玉石绘画风格，一只猫在追蝴蝶"
+image = pipe(
+    prompt, 
+    num_inference_steps=100,
+    guidance_scale=6.0, 
+).images[0]
+image.save('img.png')
+``` 
+
+You can control the strength of LoRA by changing the `lora_scale` parameter.
+
+## ControlNet
+Hunyuan-DiT + ControlNet is supported in 🤗 Diffusers. The following example shows how to use Hunyuan-DiT + Canny ControlNet.
+```py
+from diffusers import HunyuanDiT2DControlNetModel, HunyuanDiTControlNetPipeline
+import torch
+controlnet = HunyuanDiT2DControlNetModel.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.2-ControlNet-Diffusers-Canny", torch_dtype=torch.float16)
+
+pipe = HunyuanDiTControlNetPipeline.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.2-Diffusers", controlnet=controlnet, torch_dtype=torch.float16)
+pipe.to("cuda")
+
+from diffusers.utils import load_image
+cond_image = load_image('https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2-ControlNet-Diffusers-Canny/resolve/main/canny.jpg?download=true')
+
+## You may also use English prompt as HunyuanDiT supports both English and Chinese
+prompt="在夜晚的酒店门前，一座古老的中国风格的狮子雕像矗立着，它的眼睛闪烁着光芒，仿佛在守护着这座建筑。背景是夜晚的酒店前，构图方式是特写，平视，居中构图。这张照片呈现了真实摄影风格，蕴含了中国雕塑文化，同时展现了神秘氛围"
+#prompt="At night, an ancient Chinese-style lion statue stands in front of the hotel, its eyes gleaming as if guarding the building. The background is the hotel entrance at night, with a close-up, eye-level, and centered composition. This photo presents a realistic photographic style, embodies Chinese sculpture culture, and reveals a mysterious atmosphere."
+image = pipe(
+    prompt,
+    height=1024,
+    width=1024,
+    control_image=cond_image,
+    num_inference_steps=50,
+).images[0]
+```
+
+There are other pre-trained ControlNets available. Please have a look at [the official huggingface website of Tencent Hunyuan Team](https://huggingface.co/Tencent-Hunyuan)
+
--- a/environment.yml
+++ b/environment.yml
+name: HunyuanDiT
+channels:
+  - pytorch
+  - nvidia
+dependencies:
+  - python=3.8.12
+  - pytorch=1.13.1
+  - pip
--- a/example_prompts.txt
+++ b/example_prompts.txt
+一只聪明的狐狸走在阔叶树林里, 旁边是一条小溪, 细节真实, 摄影
+湖水清澈，天空湛蓝，阳光灿烂。一只优雅的白天鹅在湖边游泳。它周围有几只小鸭子，看起来非常可爱，整个画面给人一种宁静祥和的感觉。
+太阳微微升起，花园里的玫瑰花瓣上露珠晶莹剔透，一只瓢虫正在爬向露珠，背景是清晨的花园，微距镜头
+一位女明星，中国人，头发是黑色，衣服是纯白色短袖，人物风格清新，城市背景
+后印象主义风格，一条古老的石板路上面散落着金黄色的树叶。路旁的风车在静谧地转动，后面竖着两个风车。背景是一片向日葵田，蓝天上飘着几朵白云
+一幅细致的油画描绘了一只年轻獾轻轻嗅着一朵明亮的黄色玫瑰时错综复杂的皮毛。背景是一棵大树干的粗糙纹理，獾的爪子轻轻地挖进树皮。在柔和的背景中，一个宁静的瀑布倾泻而下，它的水在绿色植物中闪烁着蓝色。
+渔舟唱晚
+请将杞人忧天的样子画出来
+一只长靴猫手持亮银色的宝剑，身着铠甲，眼神坚毅，站在一堆金币上，背景是暗色调的洞穴，图像上有金币的光影点缀。
+插画风格，一只狐狸和一只刺猬坐在水边的石头上，刺猬手里拿着一杯茶，狐狸旁边放着一个玻璃杯。周围是茂密的绿色植物和树木，阳光透过树叶洒在水面上，画面宁静温馨。
+泥塑风格，一座五彩斑斓的花园在画面中展现，各种各样的花朵，绿色的叶子和一只正在嬉戏的小猫形成了一幅生动的图像，背景是蓝天和白云
+枯藤老树昏鸦，小桥流水人家
+一张细致的照片捕捉到了一尊雕像的形象，这尊雕像酷似一位古代法老，头上出人意料地戴着一副青铜蒸汽朋克护目镜。这座雕像穿着复古时髦，一件清爽的白色T恤和一件合身的黑色皮夹克，与传统的头饰形成鲜明对比。背景是简单的纯色，突出了雕像的非传统服装和蒸汽朋克眼镜的复杂细节。
+一朵鲜艳的红色玫瑰花，花瓣撒有一些水珠，晶莹剔透，特写镜头，
+一只可爱的猫, 细节真实, 摄影
+飞流直下三千尺，疑是银河落九天
+成语“鲤鱼跃龙门”
+一颗新鲜的草莓特写，红色的外表，表面布满许多种子，背景是淡绿色的叶子
+九寨沟
+摄影风格，在画面中心是一盘热气腾腾的麻婆豆腐，豆腐呈白色，上面撒着一层红色的辣酱，有些许绿色的葱花点缀，背景是深色木质餐桌，桌子上放有辣椒和葱花作为点缀。
+一位年轻女子站在春季的火车站月台上。她身着蓝灰色长风衣，白色衬衫。她的深棕色头发扎成低马尾，几缕碎发随风飘扬。她的眼神充满期待，阳光洒在她温暖的脸庞上。
+一只优雅的白鹤在湖边静静地站立，它的身体纯白色，翅膀轻轻展开，背景是湖面和远处的山脉
+国画风格，苏州园林中的小桥流水，周围是郁郁葱葱的树，池塘里有几朵绽放的荷花，背景是宁静的江南水乡
+现实主义风格，画面主要描述一个巴洛克风格的花瓶，带有金色的装饰边框，花瓶上盛开着各种色彩鲜艳的花，白色背景
+醉后不知天在水，满船清梦压星河
+长城
+一个亚洲中年男士在夕阳下的公园长椅上静坐。他穿着一件深蓝色的针织毛衣和灰色裤子。他的头发略显花白，手中拿着一本敞开的书。面带微笑，眼神温和，周围是落日余晖和四周的绿树。
+风格是写实，画面主要描述一个亚洲戏曲艺术家正在表演，她穿着华丽的戏服，脸上戴着精致的面具，身姿优雅，背景是古色古香的舞台，镜头是近景
\ No newline at end of file
--- a/hydit/__init__.py
+++ b/hydit/__init__.py
--- a/hydit/annotator/dwpose/__init__.py
+++ b/hydit/annotator/dwpose/__init__.py
+# Openpose
+# Original from CMU https://github.com/CMU-Perceptual-Computing-Lab/openpose
+# 2nd Edited by https://github.com/Hzzone/pytorch-openpose
+# 3rd Edited by ControlNet
+# 4th Edited by ControlNet (added face and correct hands)
+
+import os
+import random
+
+os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
+
+import torch
+import numpy as np
+from . import util
+from .wholebody import Wholebody
+
+
+def draw_pose(pose, H, W, draw_body=True):
+    bodies = pose["bodies"]
+    faces = pose["faces"]
+    hands = pose["hands"]
+    candidate = bodies["candidate"]
+    subset = bodies["subset"]
+    canvas = np.zeros(shape=(H, W, 3), dtype=np.uint8)
+
+    if draw_body:
+        canvas = util.draw_bodypose(canvas, candidate, subset)
+
+    canvas = util.draw_handpose(canvas, hands)
+
+    canvas = util.draw_facepose(canvas, faces)
+
+    return canvas
+
+
+def keypoint2bbox(keypoints):
+    valid_keypoints = keypoints[
+        keypoints[:, 0] >= 0
+    ]  # Ignore keypoints with confidence 0
+    if len(valid_keypoints) == 0:
+        return np.zeros(4)
+    x_min, y_min = np.min(valid_keypoints, axis=0)
+    x_max, y_max = np.max(valid_keypoints, axis=0)
+
+    return np.array([x_min, y_min, x_max, y_max])
+
+
+def expand_bboxes(bboxes, expansion_rate=0.5, image_shape=(0, 0)):
+    expanded_bboxes = []
+    for bbox in bboxes:
+        x_min, y_min, x_max, y_max = map(int, bbox)
+        width = x_max - x_min
+        height = y_max - y_min
+
+        # 扩展宽度和高度
+        new_width = width * (1 + expansion_rate)
+        new_height = height * (1 + expansion_rate)
+
+        # 计算新的边界框坐标
+        x_min_new = max(0, x_min - (new_width - width) / 2)
+        x_max_new = min(image_shape[1], x_max + (new_width - width) / 2)
+        y_min_new = max(0, y_min - (new_height - height) / 2)
+        y_max_new = min(image_shape[0], y_max + (new_height - height) / 2)
+
+        expanded_bboxes.append([x_min_new, y_min_new, x_max_new, y_max_new])
+
+    return expanded_bboxes
+
+
+def create_mask(image_width, image_height, bboxs):
+    mask = np.zeros((image_height, image_width), dtype=np.float32)
+    for bbox in bboxs:
+        x1, y1, x2, y2 = map(int, bbox)
+        mask[y1 : y2 + 1, x1 : x2 + 1] = 1.0
+    return mask
+
+
+threshold = 0.4
+
+
+class DWposeDetector:
+    def __init__(self):
+
+        self.pose_estimation = Wholebody()
+
+    def __call__(
+        self, oriImg, return_index=False, return_yolo=False, return_mask=False
+    ):
+        oriImg = oriImg.copy()
+        H, W, C = oriImg.shape
+        with torch.no_grad():
+            candidate, subset = self.pose_estimation(oriImg)
+            candidate = (
+                np.zeros((1, 134, 2), dtype=np.float32)
+                if candidate is None
+                else candidate
+            )
+            subset = np.zeros((1, 134), dtype=np.float32) if subset is None else subset
+            nums, keys, locs = candidate.shape
+            candidate[..., 0] /= float(W)
+            candidate[..., 1] /= float(H)
+            # import pdb; pdb.set_trace()
+            if return_yolo:
+                candidate[subset < threshold] = -0.1
+                subset = np.expand_dims(subset >= threshold, axis=-1)
+                keypoint = np.concatenate([candidate, subset], axis=-1)
+
+                # return pose + hand
+                return np.concatenate([keypoint[:, :18], keypoint[:, 92:]], axis=1)
+
+            body = candidate[:, :18].copy()
+            body = body.reshape(nums * 18, locs)
+            score = subset[:, :18]
+            for i in range(len(score)):
+                for j in range(len(score[i])):
+                    if score[i][j] > threshold:
+                        score[i][j] = int(18 * i + j)
+                    else:
+                        score[i][j] = -1
+
+            un_visible = subset < threshold
+            candidate[un_visible] = -1
+
+            foot = candidate[:, 18:24]
+
+            faces = candidate[:, 24:92]
+
+            hands1 = candidate[:, 92:113]
+            hands2 = candidate[:, 113:]
+            hands = np.vstack([hands1, hands2])
+
+            # import pdb; pdb.set_trace()
+            hands_ = hands[hands.max(axis=(1, 2)) > 0]
+            if len(hands_) == 0:
+                bbox = np.array([0, 0, 0, 0]).astype(int)
+            else:
+                hand_random = random.choice(hands_)
+                bbox = (keypoint2bbox(hand_random) * H).astype(int)  # [0, 1] -> [h, w]
+
+            bodies = dict(candidate=body, subset=score)
+            pose = dict(bodies=bodies, hands=hands, faces=faces)
+
+            if return_mask:
+                bbox = [(keypoint2bbox(hand) * H).astype(int) for hand in hands_]
+                # bbox = expand_bboxes(bbox, expansion_rate=0.5, image_shape=(H, W))
+                mask = create_mask(W, H, bbox)
+                return draw_pose(pose, H, W), mask
+
+            if return_index:
+                return pose
+            else:
+                return draw_pose(pose, H, W), bbox