Commit 09d54a38 authored by luopl's avatar luopl
Browse files

"Initial commit"

parents
Pipeline #2695 failed with stages
in 0 seconds
tests
debug
dev
*.egg-info
__pycache__
lib
results
\ No newline at end of file
MIT License
Copyright (c) 2025 stepfun-ai
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# Step-Video-T2V
## 论文
`
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
`
- https://arxiv.org/abs/2502.10248
## 模型结构
Step-Video-T2V的整体架构如下图所示。这是一个最先进的文本到视频预训练模型,拥有300亿个参数,能够生成最长204帧的视频。
<div align=center>
<img src="./assets/model_architecture.png"/>
</div>
## 算法原理
Step-Video-T2 V是一种使用流匹配训练的基于扩散Transformer(DiT)的模型。具体设计如下:
- 一个深度压缩变分自动编码器,Video-VAE,专为视频生成任务而设计,实现了16x16的空间压缩比和8x的时间压缩比,同时保持了卓越的视频重建质量。
- 用户提示使用两个双语文本编码器进行编码,以处理英语和中文。 使用流匹配训练具有3D全注意力机制的DiT,并用于将输入噪声去噪为潜在帧。
- 应用基于视频的DPO方法,Video-DPO,以减少伪影并提高生成视频的视觉质量。
## 环境配置
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
# <your IMAGE ID>为以上拉取的docker的镜像ID替换
docker run -it --name T2V_test --shm-size=1024G --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/Step-Video-TI2V_pytorch:/home/Step-Video-T2V_pytorch <your IMAGE ID> /bin/bash
cd /home/Step-Video-T2V_pytorch
pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
pip uninstall asyncio
sh fix.sh
```
### Dockerfile(方法二)
```
cd /home/Step-Video-T2V_pytorch/docker
docker build --no-cache -t Step-Video-TI2V:latest .
docker run -it --name T2V_test --shm-size=1024G --device=/dev/kfd --device=/dev/dri/ --privileged --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v /opt/hyhal:/opt/hyhal:ro -v $PWD/Step-Video-T2V_pytorch:/home/Step-Video-TI2V_pytorch Step-Video-T2V /bin/bash
pip install -e . -i https://mirrors.aliyun.com/pypi/simple/
pip uninstall asyncio
sh fix.sh
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.hpccube.com/tool/
```
DTK驱动:dtk24.04.3
python:python3.10
torch:2.3.0
torchvision:0.18.1
triton:2.1.0
flash-attn:2.6.1
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
cd /home/Step-Video-T2V_pytorch
pip install -e . -i https://mirrors.aliyun.com/pypi/simple
pip uninstall asyncio
sh fix.sh
```
## 数据集
`无`
## 训练
`无`
## 推理
预训练权重目录结构:
```
/home/Step-Video-T2V_pytorch
└── stepfun-ai/stepvideo-t2v
```
### 单机多卡
```
#根据自己的DCU架构调整TORCH_CUDA_ARCH_LIST值
export TORCH_CUDA_ARCH_LIST="8.0"
#注意修改where_you_download_dir为自己的模型地址
HIP_VISIBLE_DEVICES=0 python api/call_remote_server.py --model_dir where_you_download_dir &
#注意为了避免超显存服务端和客户端尽量选择不同的卡号,run.sh里的其他参数也可根据自己的硬件资源自行调整
export HIP_VISIBLE_DEVICES=1,2
sh run.sh
```
更多资料可参考源项目中的[`README_orgin`](./README_orgin.md)
## result
视频生成效果示例:
![infer result](./assets/一名宇航员在月球上.mp4)
### 精度
`无`
## 应用场景
### 算法类别
`视频生成`
### 热点应用行业
`影视,电商,教育,广媒`
## 预训练权重
huggingface权重下载地址为:
- [stepfun-ai/stepvideo-t2v](https://huggingface.co/stepfun-ai/stepvideo-t2v)
`注:建议加镜像源下载:export HF_ENDPOINT=https://hf-mirror.com`
## 源码仓库及问题反馈
- http://developer.sourcefind.cn/codes/modelzoo/Step-Video-T2V_pytorch.git
## 参考资料
- https://github.com/stepfun-ai/Step-Video-T2V
<p align="center">
<img src="assets/logo.png" height=100>
</p>
<div align="center">
<a href="https://yuewen.cn/videos"><img src="https://img.shields.io/static/v1?label=Step-Video&message=Web&color=green"></a> &ensp;
<a href="https://arxiv.org/abs/2502.10248"><img src="https://img.shields.io/static/v1?label=Tech Report&message=Arxiv&color=red"></a> &ensp;
<a href="https://x.com/StepFun_ai"><img src="https://img.shields.io/static/v1?label=X.com&message=Web&color=blue"></a> &ensp;
</div>
<div align="center">
<a href="https://huggingface.co/stepfun-ai/stepvideo-t2v"><img src="https://img.shields.io/static/v1?label=Step-Video-T2V&message=HuggingFace&color=yellow"></a> &ensp;
<a href="https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo"><img src="https://img.shields.io/static/v1?label=Step-Video-T2V-Turbo&message=HuggingFace&color=yellow"></a> &ensp;
</div>
## 🔥🔥🔥 News!!
* Mar 17, 2025: 👋 We release the [Step-Video-TI2V](https://github.com/stepfun-ai/Step-Video-Ti2V), an image-to-video model based on Step-Video-T2V.
* Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V. [Download](https://huggingface.co/stepfun-ai/stepvideo-t2v)
* Feb 17, 2025: 👋 We release the inference code and model weights of Step-Video-T2V-Turbo. [Download](https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo)
* Feb 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
## Video Demos
<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
<tr>
<td><video src="https://github.com/user-attachments/assets/9274b351-595d-41fb-aba3-f58e6e91603a" width="100%" controls autoplay loop muted></video></td>
<td><video src="https://github.com/user-attachments/assets/2f6b3ad5-e93b-436b-98bc-4701182d8652" width="100%" controls autoplay loop muted></video></td>
<td><video src="https://github.com/user-attachments/assets/67d20ee7-ad78-4b8f-80f6-3fdb00fb52d8" width="100%" controls autoplay loop muted></video></td>
</tr>
<tr>
<td><video src="https://github.com/user-attachments/assets/9abce409-105d-4a8a-ad13-104a98cc8a0b" width="100%" controls autoplay loop muted></video></td>
<td><video src="https://github.com/user-attachments/assets/8d1e1a47-048a-49ce-85f6-9d013f2d8e89" width="100%" controls autoplay loop muted></video></td>
<td><video src="https://github.com/user-attachments/assets/32cf4bd1-ec1f-4f77-a488-cd0284aa81bb" width="100%" controls autoplay loop muted></video></td>
</tr>
<tr>
<td><video src="https://github.com/user-attachments/assets/f95a7a49-032a-44ea-a10f-553d4e5d21c6" width="100%" controls autoplay loop muted></video></td>
<td><video src="https://github.com/user-attachments/assets/3534072e-87d9-4128-a87f-28fcb5d951e0" width="100%" controls autoplay loop muted></video></td>
<td><video src="https://github.com/user-attachments/assets/6d893dad-556d-4527-a882-666cba3d10e9" width="100%" controls autoplay loop muted></video></td>
</tr>
</table>
## Table of Contents
1. [Introduction](#1-introduction)
2. [Model Summary](#2-model-summary)
3. [Model Download](#3-model-download)
4. [Model Usage](#4-model-usage)
5. [Benchmark](#5-benchmark)
6. [Online Engine](#6-online-engine)
7. [Citation](#7-citation)
8. [Acknowledgement](#8-ackownledgement)
## 1. Introduction
We present **Step-Video-T2V**, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, **Step-Video-T2V-Eval**, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.
## 2. Model Summary
In Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs.
<p align="center">
<img width="80%" src="assets/model_architecture.png">
</p>
### 2.1. Video-VAE
A deep compression Variational Autoencoder (VideoVAE) is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. This compression not only accelerates training and inference but also aligns with the diffusion process's preference for condensed representations.
<p align="center">
<img width="70%" src="assets/dcvae.png">
</p>
### 2.2. DiT w/ 3D Full Attention
Step-Video-T2V is built on the DiT architecture, which has 48 layers, each containing 48 attention heads, with each head’s dimension set to 128. AdaLN-Single is leveraged to incorporate the timestep condition, while QK-Norm in the self-attention mechanism is introduced to ensure training stability. Additionally, 3D RoPE is employed, playing a critical role in handling sequences of varying video lengths and resolutions.
<p align="center">
<img width="80%" src="assets/dit.png">
</p>
### 2.3. Video-DPO
In Step-Video-T2V, we incorporate human feedback through Direct Preference Optimization (DPO) to further enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring that the generated content aligns more closely with human expectations. The overall DPO pipeline is shown below, highlighting its critical role in improving both the consistency and quality of the video generation process.
<p align="center">
<img width="100%" src="assets/dpo_pipeline.png">
</p>
## 3. Model Download
| Models | 🤗Huggingface | 🤖Modelscope |
|:-------:|:-------:|:-------:|
| Step-Video-T2V | [download](https://huggingface.co/stepfun-ai/stepvideo-t2v) | [download](https://www.modelscope.cn/models/stepfun-ai/stepvideo-t2v)
| Step-Video-T2V-Turbo (Inference Step Distillation) | [download](https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo) | [download](https://www.modelscope.cn/models/stepfun-ai/stepvideo-t2v-turbo)
## 4. Model Usage
### 📜 4.1 Requirements
The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos:
| Model | height/width/frame | Peak GPU Memory | 50 steps w flash-attn | 50 steps w/o flash-attn |
|:------------:|:------------:|:------------:|:------------:|:------------:|
| Step-Video-T2V | 768px768px204f | 78.55 GB | 860 s | 1437 s |
| Step-Video-T2V | 544px992px204f | 77.64 GB | 743 s | 1232 s |
| Step-Video-T2V | 544px992px136f | 72.48 GB | 408 s | 605 s |
* An NVIDIA GPU with CUDA support is required.
* The model is tested on four GPUs.
* **Recommended**: We recommend to use GPUs with 80GB of memory for better generation quality.
* Tested operating system: Linux
* The self-attention in text-encoder (step_llm) only supports CUDA capabilities sm_80 sm_86 and sm_90
### 🔧 4.2 Dependencies and Installation
- Python >= 3.10.0 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html))
- [PyTorch >= 2.3-cu121](https://pytorch.org/)
- [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
- [FFmpeg](https://www.ffmpeg.org/)
```bash
git clone https://github.com/stepfun-ai/Step-Video-T2V.git
conda create -n stepvideo python=3.10
conda activate stepvideo
cd Step-Video-T2V
pip install -e .
pip install flash-attn --no-build-isolation ## flash-attn is optional
```
### 🚀 4.3 Inference Scripts
#### Multi-GPU Parallel Deployment
- We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
```bash
python api/call_remote_server.py --model_dir where_you_download_dir & ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
parallel=4 # or parallel=8
url='127.0.0.1'
model_dir=where_you_download_dir
tp_degree=2
ulysses_degree=2
# make sure tp_degree x ulysses_degree = parallel
torchrun --nproc_per_node $parallel run_parallel.py --model_dir $model_dir --vae_url $url --caption_url $url --ulysses_degree $ulysses_degree --tensor_parallel_degree $tp_degree --prompt "一名宇航员在月球上发现一块石碑,上面印有“stepfun”字样,闪闪发光" --infer_steps 50 --cfg_scale 9.0 --time_shift 13.0
```
#### Single-GPU Inference and Quantization
- The open-source project DiffSynth-Studio by ModelScope offers single-GPU inference and quantization support, which can significantly reduce the VRAM required. Please refer to [their examples](https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/stepvideo) for more information.
### 🚀 4.4 Best-of-Practice Inference settings
Step-Video-T2V exhibits robust performance in inference settings, consistently generating high-fidelity and dynamic videos. However, our experiments reveal that variations in inference hyperparameters can have a substantial effect on the trade-off between video fidelity and dynamics. To achieve optimal results, we recommend the following best practices for tuning inference parameters:
| Models | infer_steps | cfg_scale | time_shift | num_frames |
|:-------:|:-------:|:-------:|:-------:|:-------:|
| Step-Video-T2V | 30-50 | 9.0 | 13.0 | 204
| Step-Video-T2V-Turbo (Inference Step Distillation) | 10-15 | 5.0 | 17.0 | 204 |
For more performance results, please refer to the [benchmark metrics](https://github.com/xdit-project/xDiT/blob/main/docs/performance/stepvideo.md) from the xDiT team:
## 5. Benchmark
We are releasing [Step-Video-T2V Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval) as a new benchmark, featuring 128 Chinese prompts sourced from real users. This benchmark is designed to evaluate the quality of generated videos across 11 distinct categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style.
## 6. Online Engine
The online version of Step-Video-T2V is available on [跃问视频](https://yuewen.cn/videos), where you can also explore some impressive examples.
## 7. Citation
```
@misc{ma2025stepvideot2vtechnicalreportpractice,
title={Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model},
author={Guoqing Ma and Haoyang Huang and Kun Yan and Liangyu Chen and Nan Duan and Shengming Yin and Changyi Wan and Ranchen Ming and Xiaoniu Song and Xing Chen and Yu Zhou and Deshan Sun and Deyu Zhou and Jian Zhou and Kaijun Tan and Kang An and Mei Chen and Wei Ji and Qiling Wu and Wen Sun and Xin Han and Yanan Wei and Zheng Ge and Aojie Li and Bin Wang and Bizhu Huang and Bo Wang and Brian Li and Changxing Miao and Chen Xu and Chenfei Wu and Chenguang Yu and Dapeng Shi and Dingyuan Hu and Enle Liu and Gang Yu and Ge Yang and Guanzhe Huang and Gulin Yan and Haiyang Feng and Hao Nie and Haonan Jia and Hanpeng Hu and Hanqi Chen and Haolong Yan and Heng Wang and Hongcheng Guo and Huilin Xiong and Huixin Xiong and Jiahao Gong and Jianchang Wu and Jiaoren Wu and Jie Wu and Jie Yang and Jiashuai Liu and Jiashuo Li and Jingyang Zhang and Junjing Guo and Junzhe Lin and Kaixiang Li and Lei Liu and Lei Xia and Liang Zhao and Liguo Tan and Liwen Huang and Liying Shi and Ming Li and Mingliang Li and Muhua Cheng and Na Wang and Qiaohui Chen and Qinglin He and Qiuyan Liang and Quan Sun and Ran Sun and Rui Wang and Shaoliang Pang and Shiliang Yang and Sitong Liu and Siqi Liu and Shuli Gao and Tiancheng Cao and Tianyu Wang and Weipeng Ming and Wenqing He and Xu Zhao and Xuelin Zhang and Xianfang Zeng and Xiaojia Liu and Xuan Yang and Yaqi Dai and Yanbo Yu and Yang Li and Yineng Deng and Yingming Wang and Yilei Wang and Yuanwei Lu and Yu Chen and Yu Luo and Yuchu Luo and Yuhe Yin and Yuheng Feng and Yuxiang Yang and Zecheng Tang and Zekai Zhang and Zidong Yang and Binxing Jiao and Jiansheng Chen and Jing Li and Shuchang Zhou and Xiangyu Zhang and Xinhao Zhang and Yibo Zhu and Heung-Yeung Shum and Daxin Jiang},
year={2025},
eprint={2502.10248},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.10248},
}
```
## 8. Acknowledgement
- We would like to express our sincere thanks to the [xDiT](https://github.com/xdit-project/xDiT) team for their invaluable support and parallelization strategy.
- Our code will be integrated into the official repository of [Huggingface/Diffusers](https://github.com/huggingface/diffusers).
- We thank the [FastVideo](https://github.com/hao-ai-lab/FastVideo) team for their continued collaboration and look forward to launching inference acceleration solutions together in the near future.
import torch
import os
from flask import Flask, Response, jsonify, request, Blueprint
from flask_restful import Api, Resource
import pickle
import argparse
import threading
import argparse
device = f'cuda:{torch.cuda.device_count()-1}'
torch.cuda.set_device(device)
dtype = torch.bfloat16
def parsed_args():
parser = argparse.ArgumentParser(description="StepVideo API Functions")
parser.add_argument('--model_dir', type=str)
parser.add_argument('--clip_dir', type=str, default='hunyuan_clip')
parser.add_argument('--llm_dir', type=str, default='step_llm')
parser.add_argument('--vae_dir', type=str, default='vae')
parser.add_argument('--port', type=str, default='8080')
args = parser.parse_args()
return args
class StepVaePipeline(Resource):
def __init__(self, vae_dir, version=2):
self.vae = self.build_vae(vae_dir, version)
self.scale_factor = 1.0
def build_vae(self, vae_dir, version=2):
from stepvideo.vae.vae import AutoencoderKL
(model_name, z_channels) = ("vae_v2.safetensors", 64) if version == 2 else ("vae.safetensors", 16)
model_path = os.path.join(vae_dir, model_name)
model = AutoencoderKL(
z_channels=z_channels,
model_path=model_path,
version=version,
).to(dtype).to(device).eval()
print("Inintialized vae...")
return model
def decode(self, samples, *args, **kwargs):
with torch.no_grad():
try:
dtype = next(self.vae.parameters()).dtype
device = next(self.vae.parameters()).device
samples = self.vae.decode(samples.to(dtype).to(device) / self.scale_factor)
if hasattr(samples,'sample'):
samples = samples.sample
return samples
except:
torch.cuda.empty_cache()
return None
lock = threading.Lock()
class VAEapi(Resource):
def __init__(self, vae_pipeline):
self.vae_pipeline = vae_pipeline
def get(self):
with lock:
try:
feature = pickle.loads(request.get_data())
feature['api'] = 'vae'
feature = {k:v for k, v in feature.items() if v is not None}
video_latents = self.vae_pipeline.decode(**feature)
response = pickle.dumps(video_latents)
except Exception as e:
print("Caught Exception: ", e)
return Response(e)
return Response(response)
class CaptionPipeline(Resource):
def __init__(self, llm_dir, clip_dir):
self.text_encoder = self.build_llm(llm_dir)
self.clip = self.build_clip(clip_dir)
def build_llm(self, model_dir):
from stepvideo.text_encoder.stepllm import STEP1TextEncoder
text_encoder = STEP1TextEncoder(model_dir, max_length=320).to(dtype).to(device).eval()
print("Inintialized text encoder...")
return text_encoder
def build_clip(self, model_dir):
from stepvideo.text_encoder.clip import HunyuanClip
clip = HunyuanClip(model_dir, max_length=77).to(device).eval()
print("Inintialized clip encoder...")
return clip
def embedding(self, prompts, *args, **kwargs):
with torch.no_grad():
try:
y, y_mask = self.text_encoder(prompts)
clip_embedding, _ = self.clip(prompts)
len_clip = clip_embedding.shape[1]
y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1) ## pad attention_mask with clip's length
data = {
'y': y.detach().cpu(),
'y_mask': y_mask.detach().cpu(),
'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
}
return data
except Exception as err:
print(f"{err}")
return None
lock = threading.Lock()
class Captionapi(Resource):
def __init__(self, caption_pipeline):
self.caption_pipeline = caption_pipeline
def get(self):
with lock:
try:
feature = pickle.loads(request.get_data())
feature['api'] = 'caption'
feature = {k:v for k, v in feature.items() if v is not None}
embeddings = self.caption_pipeline.embedding(**feature)
response = pickle.dumps(embeddings)
except Exception as e:
print("Caught Exception: ", e)
return Response(e)
return Response(response)
class RemoteServer(object):
def __init__(self, args) -> None:
self.app = Flask(__name__)
root = Blueprint("root", __name__)
self.app.register_blueprint(root)
api = Api(self.app)
self.vae_pipeline = StepVaePipeline(
vae_dir=os.path.join(args.model_dir, args.vae_dir)
)
api.add_resource(
VAEapi,
"/vae-api",
resource_class_args=[self.vae_pipeline],
)
self.caption_pipeline = CaptionPipeline(
llm_dir=os.path.join(args.model_dir, args.llm_dir),
clip_dir=os.path.join(args.model_dir, args.clip_dir)
)
api.add_resource(
Captionapi,
"/caption-api",
resource_class_args=[self.caption_pipeline],
)
def run(self, host="0.0.0.0", port=8080):
self.app.run(host, port=port, threaded=True, debug=False)
if __name__ == "__main__":
args = parsed_args()
flask_server = RemoteServer(args)
flask_server.run(host="0.0.0.0", port=args.port)
\ No newline at end of file
This diff is collapsed.
from stepvideo.diffusion.video_pipeline import StepVideoPipeline
import torch.distributed as dist
import torch
from stepvideo.config import parse_args
from stepvideo.utils import setup_seed
from stepvideo.parallel import initialize_parall_group, get_parallel_group
def load_bmk_prompt(path):
prompts = []
with open(path, 'r', encoding='utf-8') as file:
for line in file:
prompts.append(line.strip())
return prompts
if __name__ == "__main__":
args = parse_args()
initialize_parall_group(ring_degree=args.ring_degree, ulysses_degree=args.ulysses_degree)
local_rank = get_parallel_group().local_rank
device = torch.device(f"cuda:{local_rank}")
setup_seed(args.seed)
pipeline = StepVideoPipeline.from_pretrained(args.model_dir).to(dtype=torch.bfloat16, device=device)
pipeline.setup_api(
vae_url = args.vae_url,
caption_url = args.caption_url,
)
prompts = load_bmk_prompt('benchmark/Step-Video-T2V-Eval')
for prompt in prompts:
videos = pipeline(
prompt=prompt,
num_frames=args.num_frames,
height=args.height,
width=args.width,
num_inference_steps = args.infer_steps,
guidance_scale=args.cfg_scale,
time_shift=args.time_shift,
pos_magic=args.pos_magic,
neg_magic=args.neg_magic,
output_file_name=prompt[:50]
)
dist.destroy_process_group()
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
\ No newline at end of file
#!/bin/bash
cp modified/config.py /usr/local/lib/python3.10/site-packages/xfuser/config/
cp modified/envs.py /usr/local/lib/python3.10/site-packages/xfuser/
\ No newline at end of file
icon.png

70.5 KB

# 模型唯一标识
modelCode=1535
# 模型名称
modelName=Step-Video-T2V_pytorch
# 模型描述
modelDescription=Step-Video-T2V是一个目前最先进的文本到视频预训练模型,拥有300亿个参数,能够生成最长204帧的视频。
# 应用场景
appScenario=推理,视频生成,影视,电商,教育,广媒
# 框架类型
frameType=Pytorch
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment