"Initial commit"

09d54a38 · luopl · 09d54a38 · 09d54a38 · 09d54a38 · 09d54a38
Commit 09d54a38 authored May 08, 2025 by luopl
20 changed files
--- a/.gitignore
+++ b/.gitignore
+tests
+debug
+dev
+*.egg-info
+__pycache__
+lib
+results
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+MIT License
+Copyright (c) 2025 stepfun-ai
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
+# Step-Video-T2V
+## 论文
+`
+Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
+`
+- https://arxiv.org/abs/2502.10248
+## 模型结构
+Step-Video-T2V的整体架构如下图所示。这是一个最先进的文本到视频预训练模型，拥有300亿个参数，能够生成最长204帧的视频。
+<div align=center>
+    <img src="./assets/model_architecture.png"/>
+</div>
+## 算法原理
+Step-Video-T2 V是一种使用流匹配训练的基于扩散Transformer（DiT）的模型。具体设计如下：
+- 一个深度压缩变分自动编码器，Video-VAE，专为视频生成任务而设计，实现了16x16的空间压缩比和8x的时间压缩比，同时保持了卓越的视频重建质量。
+- 用户提示使用两个双语文本编码器进行编码，以处理英语和中文。 使用流匹配训练具有3D全注意力机制的DiT，并用于将输入噪声去噪为潜在帧。 
+- 应用基于视频的DPO方法，Video-DPO，以减少伪影并提高生成视频的视觉质量。
+## 环境配置
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换
+docker run -it --name T2V_test --shm-size=1024G  --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/Step-Video-TI2V_pytorch:/home/Step-Video-T2V_pytorch <your IMAGE ID> /bin/bash
+cd /home/Step-Video-T2V_pytorch
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
+pip uninstall asyncio
+sh fix.sh
+```
+### Dockerfile（方法二）
+```
+cd /home/Step-Video-T2V_pytorch/docker
+docker build --no-cache -t Step-Video-TI2V:latest .
+docker run -it --name T2V_test --shm-size=1024G  --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/Step-Video-T2V_pytorch:/home/Step-Video-TI2V_pytorch Step-Video-T2V /bin/bash
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
+pip uninstall asyncio
+sh fix.sh
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.hpccube.com/tool/
+```
+DTK驱动:dtk24.04.3
+python:python3.10
+torch:2.3.0
+torchvision:0.18.1
+triton:2.1.0
+flash-attn:2.6.1
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/Step-Video-T2V_pytorch
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+pip uninstall asyncio
+sh fix.sh
+```
+## 数据集
+`无`
+## 训练
+`无`
+## 推理
+预训练权重目录结构：
+```
+/home/Step-Video-T2V_pytorch
+    └── stepfun-ai/stepvideo-t2v
+``` 
+### 单机多卡
+```
+#根据自己的DCU架构调整TORCH_CUDA_ARCH_LIST值
+export TORCH_CUDA_ARCH_LIST="8.0"
+#注意修改where_you_download_dir为自己的模型地址
+HIP_VISIBLE_DEVICES=0 python api/call_remote_server.py --model_dir where_you_download_dir &
+#注意为了避免超显存服务端和客户端尽量选择不同的卡号,run.sh里的其他参数也可根据自己的硬件资源自行调整
+export HIP_VISIBLE_DEVICES=1,2
+sh run.sh
+```
+更多资料可参考源项目中的[`README_orgin`](./README_orgin.md)。
+## result
+视频生成效果示例：
+![infer result](./assets/一名宇航员在月球上.mp4)
+### 精度
+`无`
+## 应用场景
+### 算法类别
+`视频生成`
+### 热点应用行业
+`影视,电商,教育,广媒`
+## 预训练权重
+huggingface权重下载地址为：
+- [stepfun-ai/stepvideo-t2v](https://huggingface.co/stepfun-ai/stepvideo-t2v)
+`注：建议加镜像源下载：export HF_ENDPOINT=https://hf-mirror.com`
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/Step-Video-T2V_pytorch.git
+## 参考资料
+- https://github.com/stepfun-ai/Step-Video-T2V
--- a/README_orgin.md
+++ b/README_orgin.md
+<p align="center">
+  <img src="assets/logo.png"  height=100>
+</p>
+<div align="center">
+  <a href="https://yuewen.cn/videos"><img src="https://img.shields.io/static/v1?label=Step-Video&message=Web&color=green"></a> &ensp;
+  <a href="https://arxiv.org/abs/2502.10248"><img src="https://img.shields.io/static/v1?label=Tech Report&message=Arxiv&color=red"></a> &ensp;
+  <a href="https://x.com/StepFun_ai"><img src="https://img.shields.io/static/v1?label=X.com&message=Web&color=blue"></a> &ensp;
+</div>
+<div align="center">
+  <a href="https://huggingface.co/stepfun-ai/stepvideo-t2v"><img src="https://img.shields.io/static/v1?label=Step-Video-T2V&message=HuggingFace&color=yellow"></a> &ensp;
+  <a href="https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo"><img src="https://img.shields.io/static/v1?label=Step-Video-T2V-Turbo&message=HuggingFace&color=yellow"></a> &ensp;
+</div>
+## 🔥🔥🔥 News!!
+* Mar 17, 2025: 👋 We release the [Step-Video-TI2V](https://github.com/stepfun-ai/Step-Video-Ti2V), an image-to-video model based on Step-Video-T2V.
+* Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V. [Download](https://huggingface.co/stepfun-ai/stepvideo-t2v)
+* Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V-Turbo. [Download](https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo)
+* Feb 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
+## Video Demos
+<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
+  <tr>
+    <td><video src="https://github.com/user-attachments/assets/9274b351-595d-41fb-aba3-f58e6e91603a" width="100%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/2f6b3ad5-e93b-436b-98bc-4701182d8652" width="100%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/67d20ee7-ad78-4b8f-80f6-3fdb00fb52d8" width="100%" controls autoplay loop muted></video></td>
+  </tr>
+  <tr>
+    <td><video src="https://github.com/user-attachments/assets/9abce409-105d-4a8a-ad13-104a98cc8a0b" width="100%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/8d1e1a47-048a-49ce-85f6-9d013f2d8e89" width="100%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/32cf4bd1-ec1f-4f77-a488-cd0284aa81bb" width="100%" controls autoplay loop muted></video></td>
+  </tr>
+  <tr>
+    <td><video src="https://github.com/user-attachments/assets/f95a7a49-032a-44ea-a10f-553d4e5d21c6" width="100%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/3534072e-87d9-4128-a87f-28fcb5d951e0" width="100%" controls autoplay loop muted></video></td>
+    <td><video src="https://github.com/user-attachments/assets/6d893dad-556d-4527-a882-666cba3d10e9" width="100%" controls autoplay loop muted></video></td>
+  </tr>
+</table>
+## Table of Contents
+1. [Introduction](#1-introduction)
+2. [Model Summary](#2-model-summary)
+3. [Model Download](#3-model-download)
+4. [Model Usage](#4-model-usage)
+5. [Benchmark](#5-benchmark)
+6. [Online Engine](#6-online-engine)
+7. [Citation](#7-citation)
+8. [Acknowledgement](#8-ackownledgement)
+## 1. Introduction
+We present **Step-Video-T2V**, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, **Step-Video-T2V-Eval**, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.
+## 2. Model Summary
+In Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs.
+<p align="center">
+  <img width="80%" src="assets/model_architecture.png">
+</p>
+### 2.1. Video-VAE
+A deep compression Variational Autoencoder (VideoVAE) is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. This compression not only accelerates training and inference but also aligns with the diffusion process's preference for condensed representations.
+<p align="center">
+  <img width="70%" src="assets/dcvae.png">
+</p>
+### 2.2. DiT w/ 3D Full Attention
+Step-Video-T2V is built on the DiT architecture, which has 48 layers, each containing 48 attention heads, with each head’s dimension set to 128. AdaLN-Single is leveraged to incorporate the timestep condition, while QK-Norm in the self-attention mechanism is introduced to ensure training stability. Additionally, 3D RoPE is employed, playing a critical role in handling sequences of varying video lengths and resolutions.
+<p align="center">
+  <img width="80%" src="assets/dit.png">
+</p>
+### 2.3. Video-DPO
+In Step-Video-T2V, we incorporate human feedback through Direct Preference Optimization (DPO) to further enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring that the generated content aligns more closely with human expectations. The overall DPO pipeline is shown below, highlighting its critical role in improving both the consistency and quality of the video generation process.
+<p align="center">
+  <img width="100%" src="assets/dpo_pipeline.png">
+</p>
+## 3. Model Download
+| Models   | 🤗Huggingface    |  🤖Modelscope |
+|:-------:|:-------:|:-------:|
+| Step-Video-T2V | [download](https://huggingface.co/stepfun-ai/stepvideo-t2v) | [download](https://www.modelscope.cn/models/stepfun-ai/stepvideo-t2v)
+| Step-Video-T2V-Turbo (Inference Step Distillation) | [download](https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo) | [download](https://www.modelscope.cn/models/stepfun-ai/stepvideo-t2v-turbo)
+## 4. Model Usage
+### 📜 4.1  Requirements
+The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos:
+|     Model    |  height/width/frame |  Peak GPU Memory | 50 steps w flash-attn | 50 steps w/o flash-attn |
+|:------------:|:------------:|:------------:|:------------:|:------------:|
+| Step-Video-T2V   |        768px768px204f      |  78.55 GB | 860 s | 1437 s |
+| Step-Video-T2V   |        544px992px204f      |  77.64 GB | 743 s | 1232 s |
+| Step-Video-T2V   |        544px992px136f      |  72.48 GB | 408 s | 605 s |
+* An NVIDIA GPU with CUDA support is required. 
+  * The model is tested on four GPUs.
+  * **Recommended**: We recommend to use GPUs with 80GB of memory for better generation quality.
+* Tested operating system: Linux
+* The self-attention in text-encoder (step_llm) only supports CUDA capabilities sm_80 sm_86 and sm_90
+### 🔧 4.2 Dependencies and Installation
+- Python >= 3.10.0 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html))
+- [PyTorch >= 2.3-cu121](https://pytorch.org/)
+- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
+- [FFmpeg](https://www.ffmpeg.org/) 
+```bash
+git clone https://github.com/stepfun-ai/Step-Video-T2V.git
+conda create -n stepvideo python=3.10
+conda activate stepvideo
+cd Step-Video-T2V
+pip install -e .
+pip install flash-attn --no-build-isolation  ## flash-attn is optional
+```
+###  🚀 4.3 Inference Scripts
+#### Multi-GPU Parallel Deployment
+- We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
+```bash
+python api/call_remote_server.py --model_dir where_you_download_dir &  ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
+parallel=4  # or parallel=8
+url='127.0.0.1'
+model_dir=where_you_download_dir
+tp_degree=2
+ulysses_degree=2
+# make sure tp_degree x ulysses_degree = parallel
+torchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url  --ulysses_degree $ulysses_degree --tensor_parallel_degree $tp_degree --prompt "一名宇航员在月球上发现一块石碑，上面印有“stepfun”字样，闪闪发光" --infer_steps 50  --cfg_scale 9.0 --time_shift 13.0
+```
+#### Single-GPU Inference and Quantization
+- The open-source project DiffSynth-Studio by ModelScope offers single-GPU inference and quantization support, which can significantly reduce the VRAM required. Please refer to [their examples](https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/stepvideo) for more information.
+###  🚀 4.4 Best-of-Practice Inference settings
+Step-Video-T2V exhibits robust performance in inference settings, consistently generating high-fidelity and dynamic videos. However, our experiments reveal that variations in inference hyperparameters can have a substantial effect on the trade-off between video fidelity and dynamics. To achieve optimal results, we recommend the following best practices for tuning inference parameters:
+| Models   | infer_steps   | cfg_scale  | time_shift | num_frames |
+|:-------:|:-------:|:-------:|:-------:|:-------:|
+| Step-Video-T2V | 30-50 | 9.0 |  13.0 | 204
+| Step-Video-T2V-Turbo (Inference Step Distillation) | 10-15 | 5.0 | 17.0 | 204 |
+For more performance results, please refer to the [benchmark metrics](https://github.com/xdit-project/xDiT/blob/main/docs/performance/stepvideo.md) from the xDiT team:
+## 5. Benchmark
+We are releasing [Step-Video-T2V Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval) as a new benchmark, featuring 128 Chinese prompts sourced from real users. This benchmark is designed to evaluate the quality of generated videos across 11 distinct categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style.
+## 6. Online Engine
+The online version of Step-Video-T2V is available on [跃问视频](https://yuewen.cn/videos), where you can also explore some impressive examples.
+## 7. Citation
+```
+@misc{ma2025stepvideot2vtechnicalreportpractice,
+      title={Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model}, 
+      author={Guoqing Ma and Haoyang Huang and Kun Yan and Liangyu Chen and Nan Duan and Shengming Yin and Changyi Wan and Ranchen Ming and Xiaoniu Song and Xing Chen and Yu Zhou and Deshan Sun and Deyu Zhou and Jian Zhou and Kaijun Tan and Kang An and Mei Chen and Wei Ji and Qiling Wu and Wen Sun and Xin Han and Yanan Wei and Zheng Ge and Aojie Li and Bin Wang and Bizhu Huang and Bo Wang and Brian Li and Changxing Miao and Chen Xu and Chenfei Wu and Chenguang Yu and Dapeng Shi and Dingyuan Hu and Enle Liu and Gang Yu and Ge Yang and Guanzhe Huang and Gulin Yan and Haiyang Feng and Hao Nie and Haonan Jia and Hanpeng Hu and Hanqi Chen and Haolong Yan and Heng Wang and Hongcheng Guo and Huilin Xiong and Huixin Xiong and Jiahao Gong and Jianchang Wu and Jiaoren Wu and Jie Wu and Jie Yang and Jiashuai Liu and Jiashuo Li and Jingyang Zhang and Junjing Guo and Junzhe Lin and Kaixiang Li and Lei Liu and Lei Xia and Liang Zhao and Liguo Tan and Liwen Huang and Liying Shi and Ming Li and Mingliang Li and Muhua Cheng and Na Wang and Qiaohui Chen and Qinglin He and Qiuyan Liang and Quan Sun and Ran Sun and Rui Wang and Shaoliang Pang and Shiliang Yang and Sitong Liu and Siqi Liu and Shuli Gao and Tiancheng Cao and Tianyu Wang and Weipeng Ming and Wenqing He and Xu Zhao and Xuelin Zhang and Xianfang Zeng and Xiaojia Liu and Xuan Yang and Yaqi Dai and Yanbo Yu and Yang Li and Yineng Deng and Yingming Wang and Yilei Wang and Yuanwei Lu and Yu Chen and Yu Luo and Yuchu Luo and Yuhe Yin and Yuheng Feng and Yuxiang Yang and Zecheng Tang and Zekai Zhang and Zidong Yang and Binxing Jiao and Jiansheng Chen and Jing Li and Shuchang Zhou and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu and Heung-Yeung Shum and Daxin Jiang},
+      year={2025},
+      eprint={2502.10248},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2502.10248}, 
+}
+```
+## 8. Acknowledgement
+- We would like to express our sincere thanks to the [xDiT](https://github.com/xdit-project/xDiT) team for their invaluable support and parallelization strategy. 
+- Our code will be integrated into the official repository of [Huggingface/Diffusers](https://github.com/huggingface/diffusers).
+- We thank the [FastVideo](https://github.com/hao-ai-lab/FastVideo) team for their continued collaboration and look forward to launching inference acceleration solutions together in the near future.
--- a/api/call_remote_server.py
+++ b/api/call_remote_server.py
+import torch
+import os
+from flask import Flask, Response, jsonify, request, Blueprint
+from flask_restful import Api, Resource
+import pickle
+import argparse
+import threading
+import argparse
+device = f'cuda:{torch.cuda.device_count()-1}'
+torch.cuda.set_device(device)
+dtype = torch.bfloat16
+def parsed_args():
+    parser = argparse.ArgumentParser(description="StepVideo API Functions")
+    parser.add_argument('--model_dir', type=str)
+    parser.add_argument('--clip_dir', type=str, default='hunyuan_clip')
+    parser.add_argument('--llm_dir', type=str, default='step_llm')
+    parser.add_argument('--vae_dir', type=str, default='vae')
+    parser.add_argument('--port', type=str, default='8080')
+    args = parser.parse_args()
+    return args
+class StepVaePipeline(Resource):
+    def __init__(self, vae_dir, version=2):
+        self.vae = self.build_vae(vae_dir, version)
+        self.scale_factor = 1.0
+    def build_vae(self, vae_dir, version=2):
+        from stepvideo.vae.vae import AutoencoderKL
+        (model_name, z_channels) = ("vae_v2.safetensors", 64) if version == 2 else ("vae.safetensors", 16)
+        model_path = os.path.join(vae_dir, model_name) 
+        model = AutoencoderKL(
+            z_channels=z_channels,
+            model_path=model_path,
+            version=version,
+        ).to(dtype).to(device).eval()
+        print("Inintialized vae...")
+        return model
+    def decode(self, samples, *args, **kwargs):
+        with torch.no_grad():
+            try:
+                dtype = next(self.vae.parameters()).dtype
+                device = next(self.vae.parameters()).device
+                samples = self.vae.decode(samples.to(dtype).to(device) / self.scale_factor)
+                if hasattr(samples,'sample'):
+                    samples = samples.sample
+                return samples
+            except:
+                torch.cuda.empty_cache()
+                return None
+lock = threading.Lock()
+class VAEapi(Resource):
+    def __init__(self, vae_pipeline):
+        self.vae_pipeline = vae_pipeline
+    def get(self):
+        with lock:
+            try:
+                feature = pickle.loads(request.get_data())
+                feature['api'] = 'vae'
+                feature = {k:v for k, v in feature.items() if v is not None}
+                video_latents = self.vae_pipeline.decode(**feature)
+                response = pickle.dumps(video_latents)
+            except Exception as e:
+                print("Caught Exception: ", e)
+                return Response(e)
+            return Response(response)
+class CaptionPipeline(Resource):
+    def __init__(self, llm_dir, clip_dir):
+        self.text_encoder = self.build_llm(llm_dir)
+        self.clip = self.build_clip(clip_dir)
+    def build_llm(self, model_dir):
+        from stepvideo.text_encoder.stepllm import STEP1TextEncoder
+        text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
+        print("Inintialized text encoder...")
+        return text_encoder
+    def build_clip(self, model_dir):
+        from stepvideo.text_encoder.clip import HunyuanClip
+        clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
+        print("Inintialized clip encoder...")
+        return clip
+    def embedding(self, prompts, *args, **kwargs):
+        with torch.no_grad():
+            try:
+                y, y_mask = self.text_encoder(prompts)
+                clip_embedding, _ = self.clip(prompts)
+                len_clip = clip_embedding.shape[1]
+                y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1)   ## pad attention_mask with clip's length 
+                data = {
+                    'y': y.detach().cpu(),
+                    'y_mask': y_mask.detach().cpu(),
+                    'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
+                }
+                return data
+            except Exception as err:
+                print(f"{err}")
+                return None
+lock = threading.Lock()
+class Captionapi(Resource):
+    def __init__(self, caption_pipeline):
+        self.caption_pipeline = caption_pipeline
+    def get(self):
+        with lock:
+            try:
+                feature = pickle.loads(request.get_data())
+                feature['api'] = 'caption'
+                feature = {k:v for k, v in feature.items() if v is not None}
+                embeddings = self.caption_pipeline.embedding(**feature)
+                response = pickle.dumps(embeddings)
+            except Exception as e:
+                print("Caught Exception: ", e)
+                return Response(e)
+            return Response(response)
+class RemoteServer(object):
+    def __init__(self, args) -> None:
+        self.app = Flask(__name__)
+        root = Blueprint("root", __name__)
+        self.app.register_blueprint(root)
+        api = Api(self.app)
+        self.vae_pipeline = StepVaePipeline(
+            vae_dir=os.path.join(args.model_dir, args.vae_dir)
+        )
+        api.add_resource(
+            VAEapi,
+            "/vae-api",
+            resource_class_args=[self.vae_pipeline],
+        )
+        self.caption_pipeline = CaptionPipeline(
+            llm_dir=os.path.join(args.model_dir, args.llm_dir), 
+            clip_dir=os.path.join(args.model_dir, args.clip_dir)
+        )
+        api.add_resource(
+            Captionapi,
+            "/caption-api",
+            resource_class_args=[self.caption_pipeline],
+        )
+    def run(self, host="0.0.0.0", port=8080):
+        self.app.run(host, port=port, threaded=True, debug=False)
+if __name__ == "__main__":
+    args = parsed_args()
+    flask_server = RemoteServer(args)
+    flask_server.run(host="0.0.0.0", port=args.port)
\ No newline at end of file
--- a/assets/Step-Video-T2V.pdf
+++ b/assets/Step-Video-T2V.pdf
--- a/assets/dcvae.png
+++ b/assets/dcvae.png
--- a/assets/dit.png
+++ b/assets/dit.png
--- a/assets/dpo_pipeline.png
+++ b/assets/dpo_pipeline.png
--- a/assets/logo.png
+++ b/assets/logo.png
--- a/assets/model_architecture.png
+++ b/assets/model_architecture.png
--- a/assets/一名宇航员在月球上.mp4
+++ b/assets/一名宇航员在月球上.mp4
--- a/benchmark/Step-Video Prompt Guildlines.pdf
+++ b/benchmark/Step-Video Prompt Guildlines.pdf
--- a/benchmark/Step-Video 提示词指南.pdf
+++ b/benchmark/Step-Video 提示词指南.pdf
--- a/benchmark/Step-Video-T2V-Eval
+++ b/benchmark/Step-Video-T2V-Eval
+一名足球运动员在球场上带球奔跑，快速传球，然后射门得分。镜头聚焦于运动员的脚下动作和射门瞬间。
+篮球运动员运球过人，跳起投篮，篮球飞入篮筐，运动员庆祝进球。画面只专注于这一个投篮动作。
+网球选手准备发球，力量十足地将球击出。画面聚焦在发球的瞬间，球飞越球网，落向对手场地。
+一位跑步者在跑道上快速奔跑，镜头跟随他的步伐，展示腿部肌肉的紧张和跑步的速度感。
+羽毛球运动员在场地中央，跳起扣杀，羽毛球快速飞向对手的场地，聚焦在扣杀的瞬间。
+乒乓球选手专注挥拍，打出一个快速旋转的乒乓球，球在桌上弹跳并越过球网。
+健身爱好者在健身房举哑铃，展示力量训练的一个简单动作，聚焦于哑铃起落的过程。
+高尔夫选手站在球场上，挥杆击球，高尔夫球飞向远处的目标，镜头跟随球的飞行。
+滑板爱好者在平坦的道路上滑行，做一个简单的翻板动作，聚焦滑板腾空并旋转的瞬间。
+滑雪者在雪山坡道上高速滑行，聚焦滑雪板与雪面的接触，展示滑雪的流畅与速度感。
+冲浪者在大海中驾驭一波浪花，稳稳站在冲浪板上，镜头聚焦冲浪者与海浪的互动。
+一个运动员助跑起跳，落到沙坑里。
+展示厨师用大火炒菜的场景，食材有青菜、红椒、肉片等。特写炒勺翻炒时食材的色泽与油光，锅边飘出热气。慢镜头展示淋酱油或调味料的瞬间，最后定格在一道色香味俱全的成品热炒盘。
+厨师细致地在餐盘上摆盘，精致的法式烤鸭或牛排，佐以红酒酱汁，背景是柔和的烛光。展示刀叉切开肉质的细节，慢镜头特写美食的纹理和酱汁流淌。结束于成品菜整齐摆放在白色瓷盘中。
+展示厨师拉伸新鲜制作的意大利面条，随后将面条投入沸水中煮沸，慢镜头捕捉泡沫。接着是意面与番茄酱拌匀的过程，最后是撒上帕玛森芝士的画面。最终定格在热气腾腾的意大利面成品。
+展现寿司师傅切割新鲜三文鱼片的精准手法，之后将鱼片放在醋饭上轻压成型。慢镜头展示手握寿司的过程，最后将寿司排列整齐在木制餐盘上，旁边摆放一碟酱油和几块腌姜。
+展示厨师翻动锅中的咖喱，黄油鸡块在浓郁酱汁中沸腾。特写印度烤饼（Naan）放入炭炉烘烤的过程，最后将咖喱和烤饼放在盘中，撒上香菜。慢镜头展示热腾腾的咖喱香味四溢。
+展示厨师在平底锅上烹制玉米饼，随后将切好的牛肉、鳄梨、番茄丁等食材放入饼中，慢镜头展示浇上酸奶酱的瞬间。最后是多个塔可摆在木盘上，配上墨西哥辣椒。
+展现厨师在大锅中翻炒泰式炒河粉，捕捉食材在高温下翻滚的瞬间。随后挤入青柠汁和撒入碎花生，最后是热气腾腾的炒河粉在盘中呈现。特写展示薄荷叶和香茅装饰的细节。
+展示厨师在烤箱中烤制新鲜的海鲈鱼，鱼皮逐渐变得酥脆。接着是特写淋上橄榄油和撒上迷迭香的画面，最后定格在一盘鱼与烤蔬菜的组合上，背景柔和自然。
+展示厨师将鹰嘴豆泥舀入盘中，慢镜头展示撒入橄榄油和撒上烟熏辣椒粉的过程。接着展示烤肉串在炭火上翻烤的细节，最后定格在摆满各种中东小食的餐桌上。
+展示厨师在烤架上翻动汉堡肉饼，慢镜头捕捉芝士在肉饼上慢慢融化的瞬间。随后展示将番茄、生菜和酱料加入汉堡中，最后定格在堆满薯条和汉堡的餐盘上。
+厨师切割土耳其烤肉，然后把肉片装盘。
+镜头从一片青翠的山谷慢慢升起，远处连绵起伏的山脉逐渐展现在眼前，云雾缭绕在山峰之间，阳光透过云层洒在山坡上，给大地染上一层金色光辉，山间的小溪静静流淌，鸟儿在天空中飞翔。
+从海平面缓缓推向远处，广阔的蓝色海洋在阳光下闪烁着粼粼波光。远方，海天相接处，白帆点点，海风轻拂，海浪轻轻拍打着海岸的礁石，海鸥在低空盘旋，天空一片湛蓝，仿佛与海洋融为一体。
+镜头穿过茂密的森林树冠，阳光透过高大的树木缝隙洒在地面上，绿叶随着微风轻轻摇曳。地面上覆盖着厚厚的苔藓和落叶，偶尔一只松鼠迅速从一棵树跳到另一棵，整个画面充满了森林的静谧与生命的气息。
+镜头从远处逐渐推近到一座巨大的瀑布，水流从悬崖顶端倾泻而下，形成雾气缭绕的白色水幕。瀑布下方的水潭宁静而清澈，水流缓缓向前汇入一条小河，四周绿树环绕，整个画面充满自然的力量与宁静。
+镜头缓慢移动，展现陆家嘴天际线的壮丽景象，金茂大厦、上海中心大厦和东方明珠塔在晨曦或傍晚时分的余晖中熠熠生辉。黄浦江静静流淌，江上的船只缓缓驶过，镜头平稳推进，捕捉到城市与自然和谐交融的瞬间。
+镜头平滑扫过广阔的红场，展示标志性的圣瓦西里大教堂和克里姆林宫城墙。背景中，红场的鹅卵石广场在晨曦或夕阳的映衬下显得古老而庄严，人们慢慢走过，整个场景充满历史与现代交织的气息。
+从远处缓慢推近，金门大桥红色的雄伟桥身在晨雾或夕阳下格外显眼。桥下的海水缓缓流动，偶尔有帆船经过，镜头平稳地从桥底滑过，展现其宏伟的结构，直至镜头缓慢拉远，将桥梁与海湾景观尽收眼底。
+镜头从维多利亚港的水面缓慢上升，远处的高楼大厦和山脉依次展现。渡轮慢慢驶过，水面波光粼粼，远处的天际线清晰可见，九龙与香港岛的对比在镜头中自然显现，整个画面流畅而富有层次感。
+从地面仰视，镜头慢慢上升，上海环球金融中心的玻璃外墙在阳光下闪闪发光，随着镜头逐渐升高，周围的高楼和城市景观尽收眼底，最后镜头俯瞰整个上海天际线，展现现代都市的恢弘与繁华。
+镜头从塞纳河的水面缓缓向前推移，河边的古老建筑和桥梁在镜头中逐渐展开，河上的游船悠闲地驶过。岸边的行人漫步，秋天的树叶在微风中轻轻摇曳，整条河流在日落余晖的映衬下显得宁静而浪漫。
+从东京六本木繁华的街道平缓推进，霓虹灯光逐渐点亮，街道两旁高耸的建筑物展现出现代都市的活力。镜头沿着人行道缓缓移动，捕捉到行人、咖啡馆和商店的热闹景象，最后镜头推向六本木之丘，远处的东京塔在夜空中闪烁。
+天安门城楼下一队护卫队员在升旗。
+一只狮子在广袤的草原上缓慢前行，阳光洒在它的鬃毛上。它停下脚步，盯着前方的猎物，随后突然跃起，飞奔向猎物。
+在一片绿色的公园里，一只狗欢快地奔跑着，追逐着一只飞碟。它快速跳起，咬住飞碟，兴奋地跑回来，尾巴开心地摇摆着。
+一只鸟在郁郁葱葱的森林中自由飞翔，双翅展开穿梭在树枝之间。它停在一根树枝上，唱起动听的鸣叫声，树叶在微风中轻轻摇曳。
+在碧蓝的海水中，一群海豚欢快地游动。一只海豚跃出水面，在阳光下划出优美的弧线，溅起的水花在阳光下闪闪发光。
+在非洲的草原上，一只大象缓慢地走着，用长长的鼻子卷起一束草丛。它迈步走到一处小水坑，用鼻子喷洒水花，洒落在它的背上。
+在深蓝的海底，一只鲨鱼优雅地游动，身体在水中轻轻摆动。它突然加速，朝着一群鱼群冲过去，鱼群四散逃开。
+在一片茂密的丛林中，一条蛇安静地在地面上蜿蜒前进。它缓缓靠近一棵树，轻巧地盘绕在树枝上，观察着周围的动静。
+在开满鲜花的花园里，一只蝴蝶轻盈地飞舞，它在不同的花朵上翩翩起舞，最后停在一朵鲜艳的花瓣上，轻轻扇动着翅膀。
+在澳大利亚的荒野中，一只袋鼠用强壮的后腿跳跃着，尘土在它脚下飞扬。它停下脚步，转头张望，看到远处的另一只袋鼠，随后继续跳跃前进。
+在冰雪覆盖的南极，一群企鹅排成一列在冰面上滑行。一只企鹅跳入冰冷的水中，快速游动，划过冰下清澈的海水。
+在开阔的草原上，一只猎豹突然爆发出惊人的速度，迅速追逐前方的猎物。它身体紧贴地面，流线型的身影在阳光下闪动。
+一只小猫在家中拨弄小球玩，追着小球到处跑。
+一位穿着圣诞装的小女孩在屋内装饰圣诞树，她将最后一个星星挂上顶端，然后笑着走向壁炉前的礼物堆，兴奋地拆开礼物，拿出一只红色毛绒玩具，紧紧抱住。
+一个小男孩穿着红色新衣，在家门口挂起灯笼，接着放下灯笼，转身奔向厨房，与家人一起包饺子。最后，他拿起一个煮好的饺子，开心地吃下。
+一个孩子穿着南瓜造型的万圣节服装，在夜晚手提南瓜灯走过街道。他敲开一户人家的门，主人微笑着递给他糖果，他高兴地接过，放进小南瓜篮子里。
+一家人围坐在餐桌前，父亲将烤火鸡放在餐桌中央，大家举起饮料相互碰杯致谢。然后，孩子们争先恐后地拿起盘子，开心地夹取火鸡肉。
+一家三口在月光下的庭院中坐着，母亲切开月饼递给孩子。孩子拿起一块月饼，仰头看着明亮的圆月，笑着咬下一口。
+一对情侣在公园的长椅上坐着，男生从背后拿出一束红玫瑰，递给女生。女生惊喜地接过，随后两人一起开心地笑着，互相靠近。
+在阳光明媚的户外，一群人拿着水桶和水枪互相泼水。一个男孩从桶里舀水洒向朋友，朋友笑着用水枪回击，大家的衣服都湿透了，但笑容灿烂。
+在大街上，一群小学生挥舞着小国旗，齐声唱着爱国歌曲。一个孩子举起国旗，面向镜头笑得灿烂，其他孩子围绕在旁边一起欢呼。
+一个小女孩在草地上搜寻彩蛋，她蹲下来发现一个彩蛋，开心地捡起并放进篮子里。然后，她笑着继续寻找，镜头跟随她的动作。
+一家人围坐在餐桌前，桌上摆满丰富的美食。家长带领大家一起祈祷，之后互相微笑点头，一同享用餐食。
+划手们奋力划着龙舟，在江面上进行比赛。
+在一个充满活力的户外运动场上，人们进行跑步和瑜伽锻炼，旁边的桌子上摆满了色彩缤纷的健康美食。运动结束后，参与者围坐在一起享受营养餐点，形成运动与美食的和谐结合。
+一个现代化的研究实验室设置在茂密的森林中，窗外是自然景观。实验人员使用高科技设备分析自然资源，实验室里的屏幕显示着森林数据的实时变化，展现自然与科技的共生。
+画家在一张巨大的画布上作画，颜料仿佛在空中漂浮。镜头逐渐转向画作的细节，这些细节与显微镜下的细胞结构惊人地相似，展现艺术与科学的交融。
+一个时装秀舞台，模特身着可持续材料制成的衣物在T台上行走。背景是自然风光，随着镜头移动，衣物的材质和设计灵感与大自然的元素互相呼应，体现时尚与环保的紧密联系。
+一个宁静的房间里，音乐治疗师轻柔地弹奏钢琴，病患安静地坐在沙发上闭目冥想。音乐的旋律与人的心跳同步，灯光变得柔和，房间氛围逐渐呈现出放松与治愈的效果。
+现代智能家居环境，智能灯光系统随着屋主的步伐自动调节。镜头从客厅的角落缓慢移动到厨房，智能冰箱和语音助手自动响应，整个过程中没有场景切换，展示智能科技如何无缝融入日常生活。
+一位旅行者在咖啡馆里工作，桌上摆放着笔记本电脑，背景是世界各地的标志性建筑物，窗外的风景不断变化。这个场景展现了旅行与远程工作的无缝融合。
+一位作家在书房里写作，他的想象化作游戏角色的冒险情境。文字渐渐变成三维世界，书中的故事与互动游戏场景逐步重叠，形成文学与游戏的交汇。
+电影院里观众戴着VR头盔，随着屏幕中的电影情节发展，他们的座椅也同步运动。观众不仅在观看，还在与场景互动，沉浸式电影体验与现实环境融为一体。
+镜头从一片有机农田的土壤中慢慢升起，展示雨水收集系统和自然灌溉技术的应用，背景是远处的温室大棚。镜头平稳移动，展示农民通过智能手机控制灌溉和监测作物生长。
+一座繁华的城市与茂密的森林并存。城市中的摩天大楼使用太阳能面板，绿色能源供应着经济活动，展现经济发展与环境保护的平衡与共存。
+自动化农业机器在农田中播种和收割，无人机在空中监控作物的生长情况。
+天空中漂浮着破碎的城市废墟，建筑在失重的状态下缓慢旋转，植物从裂缝中生长出来，而城市的居民生活在这些漂浮的碎片上，仿佛重力已不再存在。
+无尽的沙漠中，地面像镜子一样清晰地倒映着星空，仿佛天地逆转。每当主角走过时，脚下的星星会如水纹般散开，营造出一种梦幻的效果。
+天空中漂浮着巨大、半透明的生物，像鲸鱼一样在云层中游动，它们的身影在地面上投下巨大的阴影，每一次呼吸都会改变天气和大气的状态。
+一片广阔的湖泊上，瀑布从湖底向上飞流直上，逆着重力喷向天空，最终消失在无尽的云层中。主角必须攀登这奇异的倒置瀑布，寻找隐藏在瀑布尽头的秘密。
+一个全是镜面结构的星球，所有的地形、建筑、甚至天空都由光滑的镜面组成，反射着无尽的空间。主角在这片迷宫中迷失，映像与现实难以区分，虚实交错让他们的心智受到挑战。
+蜘蛛侠在纽约市高楼间灵活穿梭，镜头紧跟着他的一举一动。蛛丝不断射出，附着在摩天大楼上，他轻盈地在空中荡过。
+钢铁侠从地面快速升空，燃烧的推进器在他身后喷射出强烈的蓝色火焰。镜头紧随他穿越云层，直冲高空，远处的地平线慢慢拉开，城市的灯光逐渐变得渺小。
+闪电侠在城市的街道上以光速奔跑，地面在他脚下迅速模糊拉长，周围的世界仿佛停止了运动。他的每一次加速都伴随着闪电的迸发，建筑物和车辆在他的身边变成模糊的影子。
+奇异博士站在街道中央，挥动手中的魔法符印，周围的建筑开始像万花筒一样变形、旋转。
+纳美人骑上了翼兽，腾空而起，镜头随着他飞行，掠过发光的森林、悬浮的山峰和瀑布，巨大的星球悬挂在天边。
+擎天柱身上的机械部件重新组装，从机器人变成一辆卡车。
+在明亮办公室内，一名工作者专注地在电脑前敲打键盘，偶尔抬头思考，轻轻皱眉，旁边的同事在小声交谈。
+教室里，学生认真听讲，笔记本上快速写字，偶尔抬头看向黑板，微笑点头。阳光透过窗户洒进教室，环境宁静而专注。
+画室中，艺术家专注地用画笔在画布上涂抹颜色，偶尔停下来后退一步，凝视自己的作品，眉头微皱思索着下一步。
+教师在教室中温和地向学生解释问题，学生们点头回应，有的在笔记本上写下关键点，教室气氛轻松而专注。
+医生在诊室里认真查看患者的病历，偶尔对患者微笑表示安慰，轻轻点头，耐心地解释治疗方案，患者安心聆听。
+科学家在实验室中调整仪器，凝视数据屏幕，偶尔和旁边的同事讨论，轻轻点头表示认可，整个过程严谨而专注。
+工程师站在工地前，手持平板电脑查看设计图，偶尔转身指向正在建造的建筑，与同事讨论，表情认真而自信。
+农田中，农民弯腰检查作物的生长情况，轻轻拨弄土壤，脸上流露出满足的微笑，阳光洒在农田上，场景平静祥和。
+军人在训练场上，专注进行体能训练，汗水滑落，偶尔与战友互动，相互鼓励，眼神坚定，动作沉稳有力。
+警察在街头巡逻，眼神警惕，偶尔微笑向路人点头示意，保持警觉与友善的互动，展示社区守护者的职责感。
+运动员在体育馆内训练，专注于力量训练和速度跑步，汗水浸透衣服，教练在一旁鼓励，眼中充满奋斗的决心。
+一栋居民楼着火了，一个消防员登上云梯把阳台上的小女孩抱了下来。
+巴斯光年站在镜头前，做出一个充满自信的英雄姿势，胸前闪耀着他的徽章。他随即按下手腕上的“雷射”按钮，模仿激光发射的声音，接着他展开翅膀，脸上露出英勇的微笑。
+擎天柱缓慢地变形，从卡车形态转换为机器人。转换完成后，他拔出能量斧，做出挥斧动作，表情坚定，眼睛发出蓝色光芒，最后以双手抱胸的姿势站立，展示领袖风范。
+辛巴在草地上缓慢走动，脸上露出自信而坚毅的神情。随后他仰头向天空发出一声吼叫，镜头特写他的脸庞，展现他作为狮王的威严。
+瓦力蹲在地上捡起一块垃圾，仔细观察后把它放进体内。接着他眨了眨大眼睛，做出有点害羞的表情，然后快速展开轮子，滑动起来，开心地挥舞着手臂。
+史瑞克坐在沼泽旁，手里拿着一只虫子，犹豫片刻后放入口中，脸上露出调皮的表情。他站起身，双手叉腰大笑，整个场景充满他的怪趣幽默感。
+苏利文在怪兽训练场上，准备展示力量。他弯腰抓住一块巨石，脸上带着努力的表情，用力将石头抛向空中，接着他得意地拍了拍手掌，脸上露出满意的笑容。
+阿宝拿着他的功夫棒，试图做出高难度的功夫动作。几次失败后，他笨拙地摔倒在地，揉了揉头，随后站起来大笑，做出一个搞笑的武术姿势，展现他的幽默和乐观。
+闪电麦昆在赛道上快速驶过，轮胎与地面摩擦发出刺耳的声音。镜头特写他的表情，充满自信与兴奋。然后他快速急刹，漂移转弯，喷出一股沙尘，最终得意地微笑。
+希德手里拿着一个坚果，做出非常滑稽的动作来摆弄它，结果不小心把坚果掉到地上。希德耸耸肩，随后做出一个搞怪的表情，抬起手继续向前走，边走边跳舞。
+小黄人拿着一根香蕉，满脸喜悦。他快速剥开香蕉，准备吃时却不小心滑了一跤。随后他在地上翻滚，手舞足蹈，依然兴高采烈地咯咯笑着。
+阿凡达角色在潘多拉的丛林中快速奔跑，灵巧地跳过岩石和树藤。镜头特写他那双明亮的眼睛，他缓缓停下，深深地呼吸，脸上露出一种与自然相连的宁静表情。
+尼莫在海底的植物中游动，钻进了一个大大的海葵。
+一个平静的早晨，镜头从街道的一端缓慢地移动，捕捉行人、车辆的自然运动。行人穿过街道，咖啡店外有人交谈，所有细节都在同一个连续镜头中，毫无剪辑痕迹，观众仿佛亲历其中。（场景：城市街道）
+一个孩子接球的瞬间被放慢，足球飞向空中，阳光穿过树叶，捕捉到球的每一个旋转和孩子脸上的兴奋表情。镜头慢慢追踪足球和孩子的动作，强化了那一刻的紧张感。（场景：公园里踢足球）
+镜头在每个锻炼动作间快速切换。举重、跑步机上的汗水飞溅、快速的俯卧撑，交替出现的短促镜头让观众感受到强烈的节奏和能量。画面在锻炼器材、运动鞋特写和人物表情之间不断切换，强化了紧张感。（场景：健身房锻炼）
+镜头从山腰缓缓升起，越过树顶，俯瞰山脉和远处的湖泊。随着无人机平稳飞行，整个自然风光一览无余，没有突然的视角转换，给观众带来宽广的视野和宁静的感觉。（场景：山顶的风景）
+镜头从画廊的入口处缓慢推向墙上的一幅画，随着画面逐渐靠近，画的细节和质感开始显现。观众逐渐被吸引到画的世界中，仿佛走进了作品本身。（场景：画廊中的一幅画）
+镜头从远处的草地上开始，逐渐向远山变焦。随着变焦的推进，山的轮廓变得越来越清晰，仿佛观众自己在向山靠近，创造一种逐渐揭示的视觉张力。（场景：郊外的一座远山）
+人物跳入泳池的瞬间，水花四溅，突然定格在空中。观众可以仔细欣赏每滴水珠和人物的姿势，定格画面让这一刻永恒。（场景：人物跳入水中的瞬间）
+广场上的人群和交通快速流动，日落时天空的色彩从明亮到橙红再到深蓝。时间流逝，城市从白天进入夜晚，城市的灯光点亮，整个过程无缝连接，呈现出城市的活力。（场景：繁忙的城市广场）
+观众以第一人称视角走在森林的小径上，树枝从身边擦过，地上的落叶发出沙沙的声音。视角自然摇晃，仿佛观众自己正在行走，带来沉浸式的体验。（场景：穿越森林的小路）
+镜头对焦在前景中正在交谈的两个人，背景中的窗外风景也同样清晰可见。观众不仅能专注于对话，还能注意到窗外行人和路过的车辆，形成丰富的画面层次。（场景：咖啡馆里的对话）
+树叶挡住大部分画面，可以隐隐约约地看见一个女孩在花园里赏花。
+一辆经典的50年代轿车行驶在老式小镇街道上，夕阳照耀，橘黄色的光洒在复古的霓虹灯招牌和砖瓦建筑上，街边行人穿着复古服饰，背景有微微闪烁的霓虹灯。
+一个充满高科技元素的未来城市，摩天大楼由金属和玻璃构成，充满了流线型设计。空中有无声的飞行汽车穿梭，地面上行人穿着简约的光滑服饰，街道上的广告屏幕播放着数字化的图像。画面颜色冷峻，带有金属光泽。
+柔和的光线洒在湖面上，行人漫步在湖边的小路上，湖面反射出摇曳的树影。空气中飘着轻柔的风，花朵和树叶随着微风轻轻摆动。画面中充满着明亮的色彩和朦胧的边缘，给人以温暖的感受。
+空旷的沙漠中，一个人正在无重力地漂浮，沙漠中的时钟融化着滴落，天空中漂浮着巨大的眼睛和漂浮的楼梯。行人飘浮而过，时间和空间交错，带来梦幻般的体验。
+黄昏时分，天空燃烧着橙红色的云朵，路人在昏暗的街道上快速行走，神情凝重。建筑物和树木的轮廓被夸张拉长，整个场景充满情绪化的张力，暗示着不安和焦虑的氛围。
+工业化城市的一角，工人们在精确的格子结构中操作机械，整个场景被几何形状的建筑和桥梁分割。人物的动作简洁有力，伴随着机器的节奏感，场景呈现出工业与几何的完美结合。
+一条装饰着复杂花卉图案的走廊，人物穿梭于曲线优美的建筑中。每一步伴随着背景中精致的植物和自然图案的变化，动与静之间流畅地过渡，场景带有柔和而华丽的视觉效果。
+一个荒诞的空间中，各种无意义的物品悬浮在空中，人物在怪异的姿势下行走，时而停顿，时而突然转向。背景中的物体杂乱无章，镜头捕捉了整个场景的混乱和无序，打破了常规逻辑。
+街道两旁的墙上布满了涂鸦和壁画，行人走过时，涂鸦中的图案仿佛在他们背后活了过来。镜头跟随他们的步伐，墙上的艺术作品不断变化，呈现出一种街头与艺术交融的动态画面。
+一个空旷的白色房间中，只有一个人缓缓走动，周围一切都极为简洁，没有多余的装饰。每一步都在回响，镜头专注于简单的线条和干净的背景，极简的动作与场景完美契合。
+一片色彩斑斓的森林中，鸟儿和动物以不合常理的亮丽色彩在奔跑和飞翔，树叶与天空的颜色对比强烈。一个人穿过森林，背景中的色彩不断增强，形成了大胆、粗放的视觉冲击。
+一个邋遢的男人在一个超级无敌乱的屋子里吃泡面。
\ No newline at end of file
--- a/benchmark/evaluation.py
+++ b/benchmark/evaluation.py
+from stepvideo.diffusion.video_pipeline import StepVideoPipeline
+import torch.distributed as dist
+import torch
+from stepvideo.config import parse_args
+from stepvideo.utils import setup_seed
+from stepvideo.parallel import initialize_parall_group, get_parallel_group
+def load_bmk_prompt(path):
+    prompts = []
+    with open(path, 'r', encoding='utf-8') as file:
+        for line in file:
+            prompts.append(line.strip()) 
+    return prompts
+if __name__ == "__main__":
+    args = parse_args()
+    initialize_parall_group(ring_degree=args.ring_degree, ulysses_degree=args.ulysses_degree)
+    local_rank = get_parallel_group().local_rank
+    device = torch.device(f"cuda:{local_rank}")
+    setup_seed(args.seed)
+    pipeline = StepVideoPipeline.from_pretrained(args.model_dir).to(dtype=torch.bfloat16, device=device)
+    pipeline.setup_api(
+        vae_url = args.vae_url,
+        caption_url = args.caption_url,
+    )
+    prompts = load_bmk_prompt('benchmark/Step-Video-T2V-Eval')
+    for prompt in prompts:
+        videos = pipeline(
+            prompt=prompt, 
+            num_frames=args.num_frames, 
+            height=args.height, 
+            width=args.width,
+            num_inference_steps = args.infer_steps,
+            guidance_scale=args.cfg_scale,
+            time_shift=args.time_shift,
+            pos_magic=args.pos_magic,
+            neg_magic=args.neg_magic,
+            output_file_name=prompt[:50]
+        )
+    dist.destroy_process_group()
\ No newline at end of file
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
\ No newline at end of file
--- a/fix.sh
+++ b/fix.sh
+#!/bin/bash
+cp modified/config.py /usr/local/lib/python3.10/site-packages/xfuser/config/
+cp modified/envs.py /usr/local/lib/python3.10/site-packages/xfuser/
\ No newline at end of file
--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=1535
+# 模型名称
+modelName=Step-Video-T2V_pytorch
+# 模型描述
+modelDescription=Step-Video-T2V是一个目前最先进的文本到视频预训练模型，拥有300亿个参数，能够生成最长204帧的视频。
+# 应用场景
+appScenario=推理,视频生成,影视,电商,教育,广媒
+# 框架类型
+frameType=Pytorch