Initial commit

2b3ebe0c · luopl · 2b3ebe0c · 2b3ebe0c · 2b3ebe0c · 2b3ebe0c
Commit 2b3ebe0c authored May 16, 2025 by luopl
20 changed files
--- a/.gitignore
+++ b/.gitignore
+tests
+debug
+dev
+*.egg-info
+__pycache__
+lib
+results
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+MIT License
+
+Copyright (c) 2025 stepfun-ai
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
+# Step-Video-TI2V
+
+## 论文
+`
+Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
+`
+- https://arxiv.org/abs/2503.11251
+
+## 模型结构
+Step-Video-TI2V 是基于 Step-Video-T2V 进行训练的。引入了两个关键改进：图像条件和运动条件。这些增强功能支持从给定图像生成视频，同时允许用户调整输出视频的动态程度。
+<div align=center>
+    <img src="./assets/model.png"/>
+</div>
+
+## 算法原理
+
+为了将图像条件作为生成视频的第一帧，Step-Video-TI2V
+使用 Step-Video-T2V 的 Video-VAE 将其编码为潜在表示， 并在视频潜在表示的通道维度上进行拼接。
+此外，引入了一个运动分数条件，允许用户控制从图像条件下生成的视频的动态程度。
+
+## 环境配置
+
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换
+docker run -it --name TI2V_test --shm-size=1024G  --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/Step-Video-TI2V_pytorch:/home/Step-Video-TI2V_pytorch <your IMAGE ID> /bin/bash
+cd /home/Step-Video-TI2V_pytorch
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
+#注意fix.sh里面的xfuser包的位置根据自己安装包的位置自行调整
+sh fix.sh
+
+```
+### Dockerfile（方法二）
+```
+cd /home/Step-Video-T2V_pytorch/docker
+docker build --no-cache -t Step-Video-TI2V:latest .
+docker run -it --name TI2V_test --shm-size=1024G  --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/Step-Video-TI2V_pytorch:/home/Step-Video-TI2V_pytorch Step-Video-TI2V /bin/bash
+cd /home/Step-Video-TI2V_pytorch
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
+#注意fix.sh里面的xfuser包的位置根据自己安装包的位置自行调整
+sh fix.sh
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.sourcefind.cn/tool/
+```
+DTK驱动:dtk25.04
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+flash-attn:2.6.1
+
+```
+
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/Step-Video-TI2V_pytorch
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+注意fix.sh里面的xfuser包的位置根据自己安装包的位置自行调整
+sh fix.sh
+```
+
+## 数据集
+`无`
+
+## 训练
+`无`
+
+## 推理
+预训练权重目录结构：
+```
+/home/Step-Video-T2V_pytorch
+    └── stepfun-ai/stepvideo-t2v
+``` 
+
+### 单机多卡
+```
+#根据自己的DCU架构调整TORCH_CUDA_ARCH_LIST值
+export TORCH_CUDA_ARCH_LIST="8.0"
+#注意修改where_you_download_dir为自己的模型地址
+HIP_VISIBLE_DEVICES=0 python api/call_remote_server.py --model_dir where_you_download_dir &
+#注意为了避免超显存服务端和客户端尽量选择不同的卡号,run.sh里的其他参数也可根据自己的硬件资源自行调整
+export HIP_VISIBLE_DEVICES=1
+sh run.sh
+```
+
+更多资料可参考源项目中的[`README_orgin`](./README_orgin.md)。
+
+## result
+视频生成效果示例：
+
+![infer result](./assets/笑起来-2025-05-13.mp4)
+
+
+### 精度
+`无`
+
+## 应用场景
+### 算法类别
+`视频生成`
+### 热点应用行业
+`影视,电商,教育,广媒`
+## 预训练权重
+huggingface权重下载地址为：
+
+- [stepfun-ai/stepvideo-ti2v](https://huggingface.co/stepfun-ai/stepvideo-ti2v)
+
+`注：建议加镜像源下载：export HF_ENDPOINT=https://hf-mirror.com`
+## 源码仓库及问题反馈
+- https://developer.sourcefind.cn/codes/modelzoo/step-video-ti2v_pytorch
+## 参考资料
+- https://github.com/stepfun-ai/Step-Video-TI2V
+
--- a/README_orgin.md
+++ b/README_orgin.md
+<p align="center">
+  <img src="assets/logo.png"  height=100>
+</p>
+<div align="center">
+  <a href="https://yuewen.cn/videos"><img src="https://img.shields.io/static/v1?label=Step-Video&message=Web&color=green"></a> &ensp;
+  <a href="https://arxiv.org/abs/2503.11251"><img src="https://img.shields.io/static/v1?label=Tech Report&message=Arxiv&color=red"></a> &ensp;
+  <a href="https://x.com/StepFun_ai"><img src="https://img.shields.io/static/v1?label=X.com&message=Web&color=blue"></a> &ensp;
+</div>
+
+<div align="center">
+  <a href="https://huggingface.co/stepfun-ai/stepvideo-ti2v"><img src="https://img.shields.io/static/v1?label=Step-Video-TI2V&message=HuggingFace&color=yellow"></a> &ensp;
+</div>
+
+## 🔥🔥🔥 News!!
+* Mar 17, 2025: 👋 We release the inference code and model weights of Step-Video-TI2V. [Download](https://huggingface.co/stepfun-ai/stepvideo-ti2v)
+* Mar 17, 2025: 👋 We release a new TI2V benchmark [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-TI2V/tree/main/benchmark/Step-Video-TI2V-Eval)
+* Mar 17, 2025: 👋 Step-Video-TI2V has been integrated into [ComfyUI-Stepvideo-ti2v](https://github.com/stepfun-ai/ComfyUI-StepVideo). Enjoy!
+* Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2503.11251)
+
+
+
+
+## Motion Control
+
+<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
+  <tr>
+    <th style="width: 33%;">战马跳跃</th>
+    <th style="width: 33%;">战马蹲下</th>
+    <th style="width: 33%;">战马向前奔跑，然后转身</th>
+  </tr>
+  <tr>
+    <td><video src="https://github.com/user-attachments/assets/e664f45c-b8cd-4f89-9858-eaaef54aa0f6" width="30%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/eb2d09b0-cc37-4f27-85c7-a31b6840fa69" width="30%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/d17eba41-82f6-4ee2-8a99-3f21af112af0" width="30%" controls autoplay loop muted></video></td>
+  </tr>
+</table>
+
+## Motion Dynamics Control
+
+
+<table border="0" style="width: 100%; text-align: center; margin-top: 10px;">
+  <tr>
+    <th style="width: 33%;">两名男子在互相拳击，镜头环绕两人拍摄。(motion_score: 2)</th>
+    <th style="width: 33%;">两名男子在互相拳击，镜头环绕两人拍摄。(motion_score: 5)</th>
+    <th style="width: 33%;">两名男子在互相拳击，镜头环绕两人拍摄。(motion_score: 20)</th>
+  </tr>
+  <tr>
+    <td><video src="https://github.com/user-attachments/assets/31c48385-fe83-4961-bd42-7bd2b1edeb19" width="33%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/913a407e-55ca-4a33-bafe-bd5e38eec5f5" width="33%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/119a3673-014f-4772-b846-718307a4a412" width="33%" controls autoplay loop muted></video></td>
+  </tr>
+</table>
+
+🎯 Tips: 
+The default motion_score = 5 is suitable for general use. If you need more stability, set motion_score = 2, though it may lack dynamism in certain movements. For greater movement flexibility, you can use motion_score = 10 or motion_score = 20 to enable more intense actions. Feel free to customize the motion_score based on your creative needs to fit different use cases.
+
+## Camera Control
+
+<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
+  <tr>
+    <th style="width: 33%;">镜头环绕女孩，女孩在跳舞</th>
+    <th style="width: 33%;">镜头缓慢推进，女孩在跳舞</th>
+    <th style="width: 33%;">镜头拉远，女孩在跳舞</th>
+  </tr>
+  <tr>
+    <td><video src="https://github.com/user-attachments/assets/257847bc-5967-45ba-a649-505859476aad" height="30%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/d310502a-4f7e-4a78-882f-95c46b4dfe67" height="30%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/f6426fc7-2a18-474c-9766-fc8ae8d8d40d" height="30%" controls autoplay loop muted></video></td>
+  </tr>
+</table>
+
+
+### Supported Camera Movements | 支持的运镜方式
+
+| Camera Movement               | 运镜方式           |
+|--------------------------------|--------------------|
+| **Fixed Camera**               | 固定镜头           |
+| **Pan Up/Down/Left/Right**     | 镜头上/下/左/右移 |
+| **Tilt Up/Down/Left/Right**    | 镜头上/下/左/右摇 |
+| **Zoom In/Out**                | 镜头放大/缩小       |
+| **Dolly In/Out**               | 镜头推进/拉远       |
+| **Camera Rotation**            | 镜头旋转           |
+| **Tracking Shot**  | 镜头跟随 |
+| **Orbit Shot** | 镜头环绕  |
+| **Rack Focus**  | 焦点转移           |
+
+
+🔧 Motion Score Considerations:
+motion_score = 5 or 10 offers smoother and more accurate motion than motion_score = 2, with motion_score = 10 providing the best responsiveness and camera tracking. Choosing the suitable setting enhances motion precision and fluidity.
+ 
+## Anime-Style Generation
+
+<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
+  <tr>
+    <th style="width: 33%;">女生向前行走，背景是虚化模糊的效果</th>
+    <th style="width: 33%;">女人眨眼，然后对着镜头做飞吻的动作。</th>
+    <th style="width: 10%;">狸猫战士双手缓缓上扬，雷电从手中向四周扩散，<br>身后灵兽影像的双眼闪烁强光，</br>张开巨口发出低吼</th>
+  </tr>
+  <tr>
+    <td><video src="https://github.com/user-attachments/assets/80be13a1-ea65-45c5-b7f4-c2488acbf2a3" height="33%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/67038b85-19d4-4313-b386-f578b75dcad7" height="33%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/73ffd269-a5c8-4255-8809-161501273bfd" height="33%" controls autoplay loop muted></video></td>
+  </tr>
+
+</table>
+Step-Video-TI2V excels in anime-style generation, enabling you to explore various anime-style images and create customized videos to match your preferences.
+
+## Table of Contents
+
+1. [Introduction](#1-introduction)
+2. [Model Summary](#2-model-summary)
+3. [Model Download](#3-model-download)
+4. [Model Usage](#4-model-usage)
+5. [Comparisons](#5-Comparisons)
+6. [Online Engine](#6-online-engine)
+7. [Citation](#7-citation)
+
+
+
+## 1. Introduction
+We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames
+based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V
+with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the
+image-to-video generation task.
+
+## 2. Model Summary
+Step-Video-TI2V is trained based on Step-Video-T2V. To incorporate the image condition as the first frame of the generated video, we encode it into latent representations using Step-Video-T2V’s Video-VAE and concatenate them along the channel dimension of the video latent. Additionally, we introduce a motion score condition, enabling users to control the dynamic level of the video generated from the image condition. 
+
+<p align="center">
+  <img width="80%" src="assets/model.png">
+</p>
+
+## 3. Model Download
+| Models              | 🤗 Huggingface  | 🤖 Modelscope  | 🎛️ ComfyUI  |
+|:------------------:|:--------------:|:-------------:|:-----------------:|
+| Step-Video-TI2V   | [Download](https://huggingface.co/stepfun-ai/stepvideo-ti2v)  | [Download](https://modelscope.cn/models/stepfun-ai/stepvideo-ti2v) | [Link](https://github.com/stepfun-ai/ComfyUI-StepVideo) |
+
+
+
+## 4. Model Usage
+
+### 📜 4.1  Dependencies and Installation
+
+```bash
+git clone https://github.com/stepfun-ai/Step-Video-TI2V.git
+conda create -n stepvideo python=3.10
+conda activate stepvideo
+
+cd Step-Video-TI2V
+pip install -e .
+```
+
+###  🚀 4.2. Inference Scripts
+```bash
+python api/call_remote_server.py --model_dir where_you_download_dir &  ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
+
+parallel=1 or 4  # or parallel=8 Single GPU can also predict the results, although it will take longer
+url='127.0.0.1'
+model_dir=where_you_download_dir
+
+torchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url  --ulysses_degree  $parallel --prompt "笑起来" --first_image_path ./assets/demo.png --infer_steps 50  --cfg_scale 9.0 --time_shift 13.0 --motion_score 5.0
+```
+
+We list some more useful configurations for easy usage:
+
+|        Argument        |  Default  |                Description                |
+|:----------------------:|:---------:|:-----------------------------------------:|
+|       `--model_dir`       |   None    |   The model checkpoint for video generation    |
+|     `--prompt`     | “笑起来”  |      The text prompt for I2V generation      |
+|    `first_image_path`    |    ./assets/demo.png    |     The reference image path for I2V task.     |
+|    `--infer_steps`     |    50     |     The number of steps for sampling      |
+| `--cfg_scale` |    9.0    |    Embedded  Classifier free guidance scale       |
+|     `--time_shift`     |    7.0    | Shift factor for flow matching schedulers. |
+|     `--motion_score`   |    5.0  | Score to control the motion level of the video. |
+|        `--seed`        |     None  |   The random seed for generating video, if None, we init a random seed    |
+|  `--use-cpu-offload`   |   False   |    Use CPU offload for the model load to save more memory, necessary for high-res video generation    |
+|     `--save-path`      | ./results |     Path to save the generated video      |
+
+
+
+## 5. Comparisons
+
+To evaluate the performance of Step-Video-TI2V, We leverage [VBench-I2V](https://arxiv.org/html/2411.13503v1) to systematically compare Step-Video-TI2V with recently released leading open-source models. The detailed results presented in the table below, highlight our model’s superior performance over these models. We presented two results of Step-Video-TI2V, with the motion set to 5 and 10, respectively. As expected, this mechanism effectively balances the motion dynamics and stability (or consistency) of the generated videos. Additionally, we submitted our results to the [VBench-I2V leaderboard](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard), where Step-Video-TI2V achieved the top-ranking position. 
+We also introduce a new benchmark dataset, [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-TI2V/tree/main/benchmark/Step-Video-TI2V-Eval), specifically designed for the TI2V task to support future research and evaluation. The dataset includes 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. 
+
+
+<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
+  <tr>
+    <th style="width: 20%;">Scores</th>
+    <th style="width: 20%;">Step-Video-TI2V (motion=10)</th>
+    <th style="width: 20%;">Step-Video-TI2V (motion=5)</th>
+    <th style="width: 20%;">OSTopA</th>
+    <th style="width: 20%;">OSTopB</th>
+  </tr>
+  <tr>
+    <td><strong>Total Score</strong></td>
+    <td style="background-color: lightgreen;"><strong>87.98</strong></td>
+    <td>87.80</td>
+    <td>87.49</td>
+    <td>86.77</td>
+  </tr>
+  <tr>
+    <td><strong>I2V Score</strong></td>
+    <td>95.11</td>
+    <td style="background-color: lightgreen;"><strong>95.50</strong></td>
+    <td>94.63</td>
+    <td>93.25</td>
+  </tr>
+  <tr>
+    <td>Video-Text Camera Motion</td>
+    <td>48.15</td>
+    <td style="background-color: lightgreen;"><strong>49.22</strong></td>
+    <td>29.58</td>
+    <td>46.45</td>
+  </tr>
+  <tr>
+    <td>Video-Image Subject Consistency</td>
+    <td>97.44</td>
+    <td style="background-color: lightgreen;"><strong>97.85</strong></td>
+    <td>97.73</td>
+    <td>95.88</td>
+  </tr>
+  <tr>
+    <td>Video-Image Background Consistency</td>
+    <td>98.45</td>
+    <td>98.63</td>
+    <td style="background-color: lightgreen;"><strong>98.83</strong></td>
+    <td>96.47</td>
+  </tr>
+  <tr>
+    <td><strong>Quality Score</strong></td>
+    <td style="background-color: lightgreen;"><strong>80.86</strong></td>
+    <td>80.11</td>
+    <td>80.36</td>
+    <td>80.28</td>
+  </tr>
+  <tr>
+    <td>Subject Consistency</td>
+    <td>95.62</td>
+    <td style="background-color: lightgreen;"><strong>96.02</strong></td>
+    <td>94.52</td>
+    <td style="background-color: lightgreen;"><strong>96.28</strong></td>
+  </tr>
+  <tr>
+    <td>Background Consistency</td>
+    <td>96.92</td>
+    <td>97.06</td>
+    <td>96.47</td>
+    <td style="background-color: lightgreen;"><strong>97.38</strong></td>
+  </tr>
+  <tr>
+    <td>Motion Smoothness</td>
+    <td>99.08</td>
+    <td style="background-color: lightgreen;"><strong>99.24</strong></td>
+    <td>98.09</td>
+    <td>99.10</td>
+  </tr>
+  <tr>
+    <td>Dynamic Degree</td>
+    <td>48.78</td>
+    <td>36.58</td>
+    <td style="background-color: lightgreen;"><strong>53.41</strong></td>
+    <td>38.13</td>
+  </tr>
+  <tr>
+    <td>Aesthetic Quality</td>
+    <td>61.74</td>
+    <td style="background-color: lightgreen;"><strong>62.29</strong></td>
+    <td>61.04</td>
+    <td>61.82</td>
+  </tr>
+  <tr>
+    <td>Imaging Quality</td>
+    <td>70.17</td>
+    <td>70.43</td>
+    <td style="background-color: lightgreen;"><strong>71.12</strong></td>
+    <td>70.82</td>
+  </tr>
+</table>
+
+
+![figure1](assets/vbench.png "figure1")
+
+
+## 6. Online Engine
+The online version of Step-Video-TI2V is available on [跃问视频](https://yuewen.cn/videos), where you can also explore some impressive examples.
+
+## 7. Citation
+```
+@misc{huang2025step,
+      title={Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model}, 
+      author={Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen
+  Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao
+  Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun,
+  Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong,
+  Jiaxin He, Jianchang Wu, Jianlong Yuan, Jie Wu, Jiashuai Liu, Junjing Guo,
+  Kaijun Tan, Liangyu Chen, Qiaohui Chen, Ran Sun, Shanshan Yuan, Shengming
+  Yin, Sitong Liu, Wei Chen, Yaqi Dai, Yuchu Luo, Zheng Ge, Zhisheng Guan,
+  Xiaoniu Song, Yu Zhou, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou,
+  Xiangyu Zhang, Yi Xiu, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang},
+      year={2025},
+      eprint={2503.11251},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2503.11251}, 
+}
+```
+
+```
+@misc{ma2025stepvideot2vtechnicalreportpractice,
+      title={Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model}, 
+      author={Guoqing Ma and Haoyang Huang and Kun Yan and Liangyu Chen and Nan Duan and Shengming Yin and Changyi Wan and Ranchen Ming and Xiaoniu Song and Xing Chen and Yu Zhou and Deshan Sun and Deyu Zhou and Jian Zhou and Kaijun Tan and Kang An and Mei Chen and Wei Ji and Qiling Wu and Wen Sun and Xin Han and Yanan Wei and Zheng Ge and Aojie Li and Bin Wang and Bizhu Huang and Bo Wang and Brian Li and Changxing Miao and Chen Xu and Chenfei Wu and Chenguang Yu and Dapeng Shi and Dingyuan Hu and Enle Liu and Gang Yu and Ge Yang and Guanzhe Huang and Gulin Yan and Haiyang Feng and Hao Nie and Haonan Jia and Hanpeng Hu and Hanqi Chen and Haolong Yan and Heng Wang and Hongcheng Guo and Huilin Xiong and Huixin Xiong and Jiahao Gong and Jianchang Wu and Jiaoren Wu and Jie Wu and Jie Yang and Jiashuai Liu and Jiashuo Li and Jingyang Zhang and Junjing Guo and Junzhe Lin and Kaixiang Li and Lei Liu and Lei Xia and Liang Zhao and Liguo Tan and Liwen Huang and Liying Shi and Ming Li and Mingliang Li and Muhua Cheng and Na Wang and Qiaohui Chen and Qinglin He and Qiuyan Liang and Quan Sun and Ran Sun and Rui Wang and Shaoliang Pang and Shiliang Yang and Sitong Liu and Siqi Liu and Shuli Gao and Tiancheng Cao and Tianyu Wang and Weipeng Ming and Wenqing He and Xu Zhao and Xuelin Zhang and Xianfang Zeng and Xiaojia Liu and Xuan Yang and Yaqi Dai and Yanbo Yu and Yang Li and Yineng Deng and Yingming Wang and Yilei Wang and Yuanwei Lu and Yu Chen and Yu Luo and Yuchu Luo and Yuhe Yin and Yuheng Feng and Yuxiang Yang and Zecheng Tang and Zekai Zhang and Zidong Yang and Binxing Jiao and Jiansheng Chen and Jing Li and Shuchang Zhou and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu and Heung-Yeung Shum and Daxin Jiang},
+      year={2025},
+      eprint={2502.10248},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2502.10248}, 
+}
+```
+
--- a/api/call_remote_server.py
+++ b/api/call_remote_server.py
+import torch
+import os
+from flask import Flask, Response, jsonify, request, Blueprint
+from flask_restful import Api, Resource
+import pickle
+import argparse
+import threading
+import argparse
+
+
+device = f'cuda:{torch.cuda.device_count()-1}'
+torch.cuda.set_device(device)
+dtype = torch.bfloat16
+
+def parsed_args():
+    parser = argparse.ArgumentParser(description="StepVideo API Functions")
+    parser.add_argument('--model_dir', type=str)
+    parser.add_argument('--clip_dir', type=str, default='hunyuan_clip')
+    parser.add_argument('--llm_dir', type=str, default='step_llm')
+    parser.add_argument('--vae_dir', type=str, default='vae')
+    parser.add_argument('--port', type=str, default='8080')
+    args = parser.parse_args()
+    return args
+
+
+
+class StepVaePipeline(Resource):
+    def __init__(self, vae_dir, version=2):
+        self.vae = self.build_vae(vae_dir, version)
+        self.scale_factor = 1.0
+
+    def build_vae(self, vae_dir, version=2):
+        from stepvideo.vae.vae import AutoencoderKL
+        (model_name, z_channels) = ("vae_v2.safetensors", 64) if version == 2 else ("vae.safetensors", 16)
+        model_path = os.path.join(vae_dir, model_name) 
+        
+        model = AutoencoderKL(
+            z_channels=z_channels,
+            model_path=model_path,
+            version=version,
+        ).to(dtype).to(device).eval()
+        print("Inintialized vae...")
+        return model
+ 
+    def decode(self, samples, *args, **kwargs):
+        with torch.no_grad():
+            try:
+                dtype = next(self.vae.parameters()).dtype
+                device = next(self.vae.parameters()).device
+                samples = self.vae.decode(samples.to(dtype).to(device) / self.scale_factor)
+                if hasattr(samples,'sample'):
+                    samples = samples.sample
+                return samples
+            except Exception as err:
+                print(f"vae decode error: {err}")
+                torch.cuda.empty_cache()
+                return None
+            
+    def encode(self, videos, *args, **kwargs):
+        with torch.no_grad():
+            try:
+                dtype = next(self.vae.parameters()).dtype
+                device = next(self.vae.parameters()).device
+                latents = self.vae.encode(videos.to(dtype).to(device))*self.scale_factor
+                if hasattr(latents,'sample'):
+                    latents = latents.sample
+                return latents
+            except Exception as err:
+                print(f"vae encode error: {err}")
+                torch.cuda.empty_cache()
+                return None
+
+lock = threading.Lock()
+class VAEapi(Resource):
+    def __init__(self, vae_pipeline):
+        self.vae_pipeline = vae_pipeline
+        
+    def get(self):
+        with lock:
+            try:
+                feature = pickle.loads(request.get_data())
+                feature['api'] = 'vae'
+            
+                feature = {k:v for k, v in feature.items() if v is not None}
+                video_latents = self.vae_pipeline.decode(**feature)
+                response = pickle.dumps(video_latents)
+
+            except Exception as e:
+                print("Caught Exception: ", e)
+                return Response(e)
+            
+            return Response(response)
+
+
+class VAEEncodeapi(Resource):
+    def __init__(self, vae_pipeline):
+        self.vae_pipeline = vae_pipeline
+        
+    def get(self):
+        with lock:
+            try:
+                feature = pickle.loads(request.get_data())
+                feature['api'] = 'vae-encode'
+            
+                feature = {k:v for k, v in feature.items() if v is not None}
+                video_latents = self.vae_pipeline.encode(**feature)
+                response = pickle.dumps(video_latents)
+
+            except Exception as e:
+                print("Caught Exception: ", e)
+                return Response(e)
+            
+            return Response(response)
+
+
+class CaptionPipeline(Resource):
+    def __init__(self, llm_dir, clip_dir):
+        self.text_encoder = self.build_llm(llm_dir)
+        self.clip = self.build_clip(clip_dir)
+        
+    def build_llm(self, model_dir):
+        from stepvideo.text_encoder.stepllm import STEP1TextEncoder
+        text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
+        print("Inintialized text encoder...")
+        return text_encoder
+        
+    def build_clip(self, model_dir):
+        from stepvideo.text_encoder.clip import HunyuanClip
+        clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
+        print("Inintialized clip encoder...")
+        return clip
+ 
+    def embedding(self, prompts, *args, **kwargs):
+        with torch.no_grad():
+            try:
+                y, y_mask = self.text_encoder(prompts)
+                    
+                clip_embedding, _ = self.clip(prompts)
+                
+                len_clip = clip_embedding.shape[1]
+                y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1)   ## pad attention_mask with clip's length 
+
+                data = {
+                    'y': y.detach().cpu(),
+                    'y_mask': y_mask.detach().cpu(),
+                    'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
+                }
+
+                return data
+            except Exception as err:
+                print(f"{err}")
+                return None
+
+
+
+lock = threading.Lock()
+class Captionapi(Resource):
+    def __init__(self, caption_pipeline):
+        self.caption_pipeline = caption_pipeline
+        
+    def get(self):
+        with lock:
+            try:
+                feature = pickle.loads(request.get_data())
+                feature['api'] = 'caption'
+            
+                feature = {k:v for k, v in feature.items() if v is not None}
+                embeddings = self.caption_pipeline.embedding(**feature)
+                response = pickle.dumps(embeddings)
+
+            except Exception as e:
+                print("Caught Exception: ", e)
+                return Response(e)
+            
+            return Response(response)
+
+
+
+
+class RemoteServer(object):
+    def __init__(self, args) -> None:
+        self.app = Flask(__name__)
+        root = Blueprint("root", __name__)
+        self.app.register_blueprint(root)
+        api = Api(self.app)
+        
+        self.vae_pipeline = StepVaePipeline(
+            vae_dir=os.path.join(args.model_dir, args.vae_dir)
+        )
+        api.add_resource(
+            VAEapi,
+            "/vae-api",
+            resource_class_args=[self.vae_pipeline],
+        )
+        
+        api.add_resource(
+            VAEEncodeapi,
+            "/vae-encode-api",
+            resource_class_args=[self.vae_pipeline],
+        )
+        
+        self.caption_pipeline = CaptionPipeline(
+            llm_dir=os.path.join(args.model_dir, args.llm_dir), 
+            clip_dir=os.path.join(args.model_dir, args.clip_dir)
+        )
+        api.add_resource(
+            Captionapi,
+            "/caption-api",
+            resource_class_args=[self.caption_pipeline],
+        )
+
+
+    def run(self, host="0.0.0.0", port=8080):
+        self.app.run(host, port=port, threaded=True, debug=False)
+
+
+if __name__ == "__main__":
+    args = parsed_args()
+    flask_server = RemoteServer(args)
+    flask_server.run(host="0.0.0.0", port=args.port)
+    
\ No newline at end of file
--- a/assets/compare_1.png
+++ b/assets/compare_1.png
--- a/assets/compare_2.png
+++ b/assets/compare_2.png
--- a/assets/demo.png
+++ b/assets/demo.png
--- a/assets/logo.png
+++ b/assets/logo.png
--- a/assets/model.png
+++ b/assets/model.png
--- a/assets/vbench.png
+++ b/assets/vbench.png
--- a/assets/笑起来-2025-05-13.mp4
+++ b/assets/笑起来-2025-05-13.mp4
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
\ No newline at end of file
--- a/fix.sh
+++ b/fix.sh
+#!/bin/bash
+
+cp modified/config.py /usr/local/lib/python3.10/dist-packages/xfuser/config/
+
+cp modified/envs.py /usr/local/lib/python3.10/dist-packages/xfuser/
\ No newline at end of file
--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=1547
+# 模型名称
+modelName=Step-Video-TI2V_pytorch
+# 模型描述
+modelDescription=Step-Video-TI2V 是一款基于文本驱动的图像到视频生成模型，拥有30B参数，能够根据文本和图像输入生成最长102帧的视频。
+# 应用场景
+appScenario=推理,视频生成,影视,电商,教育,广媒
+# 框架类型
+frameType=Pytorch
--- a/modified/config.py
+++ b/modified/config.py
+import os
+import torch
+import torch.distributed as dist
+from packaging import version
+from dataclasses import dataclass, fields
+
+from torch import distributed as dist
+
+from xfuser.logger import init_logger
+import xfuser.envs as envs
+# from xfuser.envs import CUDA_VERSION, TORCH_VERSION, PACKAGES_CHECKER
+from xfuser.envs import TORCH_VERSION, PACKAGES_CHECKER
+
+logger = init_logger(__name__)
+
+from typing import Union, Optional, List
+
+env_info = PACKAGES_CHECKER.get_packages_info()
+HAS_LONG_CTX_ATTN = env_info["has_long_ctx_attn"]
+HAS_FLASH_ATTN = env_info["has_flash_attn"]
+
+
+def check_packages():
+    import diffusers
+
+    if not version.parse(diffusers.__version__) > version.parse("0.30.2"):
+        raise RuntimeError(
+            "This project requires diffusers version > 0.30.2. Currently, you can not install a correct version of diffusers by pip install."
+            "Please install it from source code!"
+        )
+
+
+def check_env():
+    # https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/cudagraph.html
+    #if CUDA_VERSION < version.parse("11.3"):
+    #    raise RuntimeError("NCCL CUDA Graph support requires CUDA 11.3 or above")
+    if TORCH_VERSION < version.parse("2.2.0"):
+        # https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
+        raise RuntimeError(
+            "CUDAGraph with NCCL support requires PyTorch 2.2.0 or above. "
+            "If it is not released yet, please install nightly built PyTorch "
+            "with `pip3 install --pre torch torchvision torchaudio --index-url "
+            "https://download.pytorch.org/whl/nightly/cu121`"
+        )
+
+
+@dataclass
+class ModelConfig:
+    model: str
+    download_dir: Optional[str] = None
+    trust_remote_code: bool = False
+
+
+@dataclass
+class RuntimeConfig:
+    warmup_steps: int = 1
+    dtype: torch.dtype = torch.float16
+    use_cuda_graph: bool = False
+    use_parallel_vae: bool = False
+    use_profiler: bool = False
+    use_torch_compile: bool = False
+    use_onediff: bool = False
+    use_fp8_t5_encoder: bool = False
+
+    def __post_init__(self):
+        check_packages()
+        if self.use_cuda_graph:
+            check_env()
+
+
+@dataclass
+class FastAttnConfig:
+    use_fast_attn: bool = False
+    n_step: int = 20
+    n_calib: int = 8
+    threshold: float = 0.5
+    window_size: int = 64
+    coco_path: Optional[str] = None
+    use_cache: bool = False
+
+    def __post_init__(self):
+        assert self.n_calib > 0, "n_calib must be greater than 0"
+        assert self.threshold > 0.0, "threshold must be greater than 0"
+
+
+@dataclass
+class DataParallelConfig:
+    dp_degree: int = 1
+    use_cfg_parallel: bool = False
+    world_size: int = 1
+
+    def __post_init__(self):
+        assert self.dp_degree >= 1, "dp_degree must greater than or equal to 1"
+
+        # set classifier_free_guidance_degree parallel for split batch
+        if self.use_cfg_parallel:
+            self.cfg_degree = 2
+        else:
+            self.cfg_degree = 1
+        assert self.dp_degree * self.cfg_degree <= self.world_size, (
+            "dp_degree * cfg_degree must be less than or equal to "
+            "world_size because of classifier free guidance"
+        )
+        assert (
+            self.world_size % (self.dp_degree * self.cfg_degree) == 0
+        ), "world_size must be divisible by dp_degree * cfg_degree"
+
+
+@dataclass
+class SequenceParallelConfig:
+    ulysses_degree: Optional[int] = None
+    ring_degree: Optional[int] = None
+    world_size: int = 1
+
+    def __post_init__(self):
+        if self.ulysses_degree is None:
+            self.ulysses_degree = 1
+            logger.info(
+                f"Ulysses degree not set, " f"using default value {self.ulysses_degree}"
+            )
+        if self.ring_degree is None:
+            self.ring_degree = 1
+            logger.info(
+                f"Ring degree not set, " f"using default value {self.ring_degree}"
+            )
+        self.sp_degree = self.ulysses_degree * self.ring_degree
+
+        if not HAS_LONG_CTX_ATTN and self.sp_degree > 1:
+            raise ImportError(
+                f"Sequence Parallel kit 'yunchang' not found but "
+                f"sp_degree is {self.sp_degree}, please set it "
+                f"to 1 or install 'yunchang' to use it"
+            )
+
+
+@dataclass
+class TensorParallelConfig:
+    tp_degree: int = 1
+    split_scheme: Optional[str] = "row"
+    world_size: int = 1
+
+    def __post_init__(self):
+        assert self.tp_degree >= 1, "tp_degree must greater than 1"
+        assert (
+            self.tp_degree <= self.world_size
+        ), "tp_degree must be less than or equal to world_size"
+
+
+@dataclass
+class PipeFusionParallelConfig:
+    pp_degree: int = 1
+    num_pipeline_patch: Optional[int] = None
+    attn_layer_num_for_pp: Optional[List[int]] = (None,)
+    world_size: int = 1
+
+    def __post_init__(self):
+        assert (
+            self.pp_degree is not None and self.pp_degree >= 1
+        ), "pipefusion_degree must be set and greater than 1 to use pipefusion"
+        assert (
+            self.pp_degree <= self.world_size
+        ), "pipefusion_degree must be less than or equal to world_size"
+        if self.num_pipeline_patch is None:
+            self.num_pipeline_patch = self.pp_degree
+            logger.info(
+                f"Pipeline patch number not set, "
+                f"using default value {self.pp_degree}"
+            )
+        if self.attn_layer_num_for_pp is not None:
+            logger.info(
+                f"attn_layer_num_for_pp set, splitting attention layers"
+                f"to {self.attn_layer_num_for_pp}"
+            )
+            assert len(self.attn_layer_num_for_pp) == self.pp_degree, (
+                "attn_layer_num_for_pp must have the same "
+                "length as pp_degree if not None"
+            )
+        if self.pp_degree == 1 and self.num_pipeline_patch > 1:
+            logger.warning(
+                f"Pipefusion degree is 1, pipeline will not be used,"
+                f"num_pipeline_patch will be ignored"
+            )
+            self.num_pipeline_patch = 1
+
+
+@dataclass
+class ParallelConfig:
+    dp_config: DataParallelConfig
+    sp_config: SequenceParallelConfig
+    pp_config: PipeFusionParallelConfig
+    tp_config: TensorParallelConfig
+    world_size: int = 1 # FIXME: remove this
+    worker_cls: str = "xfuser.ray.worker.worker.Worker"
+
+    def __post_init__(self):
+        assert self.tp_config is not None, "tp_config must be set"
+        assert self.dp_config is not None, "dp_config must be set"
+        assert self.sp_config is not None, "sp_config must be set"
+        assert self.pp_config is not None, "pp_config must be set"
+        parallel_world_size = (
+            self.dp_config.dp_degree
+            * self.dp_config.cfg_degree
+            * self.sp_config.sp_degree
+            * self.tp_config.tp_degree
+            * self.pp_config.pp_degree
+        )
+        world_size = self.world_size
+        assert parallel_world_size == world_size, (
+            f"parallel_world_size {parallel_world_size} "
+            f"must be equal to world_size {self.world_size}"
+        )
+        assert (
+            world_size % (self.dp_config.dp_degree * self.dp_config.cfg_degree) == 0
+        ), "world_size must be divisible by dp_degree * cfg_degree"
+        assert (
+            world_size % self.pp_config.pp_degree == 0
+        ), "world_size must be divisible by pp_degree"
+        assert (
+            world_size % self.sp_config.sp_degree == 0
+        ), "world_size must be divisible by sp_degree"
+        assert (
+            world_size % self.tp_config.tp_degree == 0
+        ), "world_size must be divisible by tp_degree"
+        self.dp_degree = self.dp_config.dp_degree
+        self.cfg_degree = self.dp_config.cfg_degree
+        self.sp_degree = self.sp_config.sp_degree
+        self.pp_degree = self.pp_config.pp_degree
+        self.tp_degree = self.tp_config.tp_degree
+
+        self.ulysses_degree = self.sp_config.ulysses_degree
+        self.ring_degree = self.sp_config.ring_degree
+
+
+@dataclass(frozen=True)
+class EngineConfig:
+    model_config: ModelConfig
+    runtime_config: RuntimeConfig
+    parallel_config: ParallelConfig
+    fast_attn_config: FastAttnConfig
+
+    def __post_init__(self):
+        world_size = self.parallel_config.world_size
+        if self.fast_attn_config.use_fast_attn:
+            assert self.parallel_config.dp_degree == world_size, f"world_size must be equal to dp_degree when using DiTFastAttn"
+
+    def to_dict(self):
+        """Return the configs as a dictionary, for use in **kwargs."""
+        return dict((field.name, getattr(self, field.name)) for field in fields(self))
+
+
+@dataclass
+class InputConfig:
+    height: int = 1024
+    width: int = 1024
+    num_frames: int = 49
+    use_resolution_binning: bool = (True,)
+    batch_size: Optional[int] = None
+    img_file_path: Optional[str] = None
+    prompt: Union[str, List[str]] = ""
+    negative_prompt: Union[str, List[str]] = ""
+    num_inference_steps: int = 20
+    max_sequence_length: int = 256
+    seed: int = 42
+    output_type: str = "pil"
+
+    def __post_init__(self):
+        if isinstance(self.prompt, list):
+            assert (
+                len(self.prompt) == len(self.negative_prompt)
+                or len(self.negative_prompt) == 0
+            ), "prompts and negative_prompts must have the same quantities"
+            self.batch_size = self.batch_size or len(self.prompt)
+        else:
+            self.batch_size = self.batch_size or 1
+        assert self.output_type in [
+            "pil",
+            "latent",
+            "pt",
+        ], "output_pil must be either 'pil' or 'latent'"
--- a/modified/envs.py
+++ b/modified/envs.py
+import os
+import torch
+import diffusers
+from typing import TYPE_CHECKING, Any, Callable, Dict, Optional
+from packaging import version
+
+from xfuser.logger import init_logger
+
+logger = init_logger(__name__)
+
+if TYPE_CHECKING:
+    MASTER_ADDR: str = ""
+    MASTER_PORT: Optional[int] = None
+    CUDA_HOME: Optional[str] = None
+    LOCAL_RANK: int = 0
+    CUDA_VISIBLE_DEVICES: Optional[str] = None
+    XDIT_LOGGING_LEVEL: str = "INFO"
+    CUDA_VERSION: version.Version
+    TORCH_VERSION: version.Version
+
+
+environment_variables: Dict[str, Callable[[], Any]] = {
+    # ================== Runtime Env Vars ==================
+    # used in distributed environment to determine the master address
+    "MASTER_ADDR": lambda: os.getenv("MASTER_ADDR", ""),
+    # used in distributed environment to manually set the communication port
+    "MASTER_PORT": lambda: (
+        int(os.getenv("MASTER_PORT", "0")) if "MASTER_PORT" in os.environ else None
+    ),
+    # path to cudatoolkit home directory, under which should be bin, include,
+    # and lib directories.
+    "CUDA_HOME": lambda: os.environ.get("CUDA_HOME", None),
+    # local rank of the process in the distributed setting, used to determine
+    # the GPU device id
+    "LOCAL_RANK": lambda: int(os.environ.get("LOCAL_RANK", "0")),
+    # used to control the visible devices in the distributed setting
+    "CUDA_VISIBLE_DEVICES": lambda: os.environ.get("CUDA_VISIBLE_DEVICES", None),
+    # this is used for configuring the default logging level
+    "XDIT_LOGGING_LEVEL": lambda: os.getenv("XDIT_LOGGING_LEVEL", "INFO"),
+}
+
+variables: Dict[str, Callable[[], Any]] = {
+    # ================== Other Vars ==================
+    # used in version checking
+    # "CUDA_VERSION": lambda: version.parse(torch.version.cuda),
+    "CUDA_VERSION": "gfx928",
+    "TORCH_VERSION": lambda: version.parse(
+        version.parse(torch.__version__).base_version
+    ),
+}
+
+
+class PackagesEnvChecker:
+    _instance = None
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super(PackagesEnvChecker, cls).__new__(cls)
+            cls._instance.initialize()
+        return cls._instance
+
+    def initialize(self):
+        self.packages_info = {
+            "has_flash_attn": self.check_flash_attn(),
+            "has_long_ctx_attn": self.check_long_ctx_attn(),
+            "diffusers_version": self.check_diffusers_version(),
+        }
+
+    def check_flash_attn(self):
+        try:
+            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+            gpu_name = torch.cuda.get_device_name(device)
+            if "Turing" in gpu_name or "Tesla" in gpu_name or "T4" in gpu_name:
+                return False
+            else:
+                from flash_attn import flash_attn_func
+                from flash_attn import __version__
+
+                if __version__ < "2.6.0":
+                    raise ImportError(f"install flash_attn >= 2.6.0")
+                return True
+        except ImportError:
+            logger.warning(
+                f'Flash Attention library "flash_attn" not found, '
+                f"using pytorch attention implementation"
+            )
+            return False
+
+    def check_long_ctx_attn(self):
+        try:
+            from yunchang import (
+                set_seq_parallel_pg,
+                ring_flash_attn_func,
+                UlyssesAttention,
+                LongContextAttention,
+                LongContextAttentionQKVPacked,
+            )
+
+            return True
+        except ImportError:
+            logger.warning(
+                f'Ring Flash Attention library "yunchang" not found, '
+                f"using pytorch attention implementation"
+            )
+            return False
+
+    def check_diffusers_version(self):
+        if version.parse(
+            version.parse(diffusers.__version__).base_version
+        ) < version.parse("0.30.0"):
+            raise RuntimeError(
+                f"Diffusers version: {version.parse(version.parse(diffusers.__version__).base_version)} is not supported,"
+                f"please upgrade to version > 0.30.0"
+            )
+        return version.parse(version.parse(diffusers.__version__).base_version)
+
+    def get_packages_info(self):
+        return self.packages_info
+
+
+PACKAGES_CHECKER = PackagesEnvChecker()
+
+
+def __getattr__(name):
+    # lazy evaluation of environment variables
+    if name in environment_variables:
+        return environment_variables[name]()
+    if name in variables:
+        return variables[name]()
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
+
+
+def __dir__():
+    return list(environment_variables.keys())
--- a/run.sh
+++ b/run.sh
+parallel=1  # or parallel=8 Single GPU can also predict the results, although it will take longer
+url='127.0.0.1'
+model_dir=/home/luopl1/stepfun-ai/stepvideo-ti2v/
+
+torchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url  --ulysses_degree  $parallel --prompt "笑起来" --first_image_path ./assets/demo.png --infer_steps 20  --width 400 --height 400 --cfg_scale 9.0 --time_shift 13.0 --motion_score 5.0
--- a/run_parallel.py
+++ b/run_parallel.py
+from stepvideo.diffusion.video_pipeline import StepVideoPipeline
+import torch.distributed as dist
+import torch
+from stepvideo.config import parse_args
+from stepvideo.parallel import initialize_parall_group, get_parallel_group
+from stepvideo.utils import setup_seed
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    initialize_parall_group(ring_degree=args.ring_degree, ulysses_degree=args.ulysses_degree)
+    
+    local_rank = get_parallel_group().local_rank
+    device = torch.device(f"cuda:{local_rank}")
+    
+    setup_seed(args.seed)
+        
+    pipeline = StepVideoPipeline.from_pretrained(args.model_dir).to(dtype=torch.bfloat16, device="cpu")
+
+    pipeline.transformer = pipeline.transformer.to(device)
+    pipeline.setup_pipeline(args)
+    
+    
+    prompt = args.prompt
+    videos = pipeline(
+        prompt=prompt, 
+        first_image=args.first_image_path,
+        num_frames=args.num_frames, 
+        height=args.height, 
+        width=args.width,
+        num_inference_steps = args.infer_steps,
+        guidance_scale=args.cfg_scale,
+        time_shift=args.time_shift,
+        pos_magic=args.pos_magic,
+        neg_magic=args.neg_magic,
+        output_file_name=args.output_file_name or prompt[:50],
+        motion_score=args.motion_score,
+    )
+    
+    dist.destroy_process_group()
\ No newline at end of file