".github/vscode:/vscode.git/clone" did not exist on "39fd89308c0bbe26311db583cf9729f81ffa9a94"
Commit 463544a1 authored by luopl's avatar luopl
Browse files

Initial commit

parents
Pipeline #2694 failed with stages
in 0 seconds
# Step1X-Edit
## 论文
`
Step1X-Edit: A Practical Framework for General Image Editing
`
- https://arxiv.org/abs/2504.17761
## 模型结构
Step1X-Edit通过结合多模态大语言模型(MLLM)和扩散图像解码器来处理参考图像和用户的编辑指令,提取潜在嵌入以获得目标图像。
<div align=center>
<img src="./assets/frame_work.png"/>
</div>
## 算法原理
Step1X-Edit的核心在于结合多模态大语言模型(MLLM)和扩散Transformer(DiT)架构。 具体来说:
- 输入的编辑指令和参考图像首先由MLLM(如Qwen-VL)处理,生成与编辑任务直接对齐的令牌嵌入。
- 提取的嵌入随后被输入到一个轻量级的连接模块(如令牌细化器),重新结构化为更紧凑的文本特征表示。
- 进一步,通过计算Qwen所有输出嵌入的平均值并将其投影到线性层,生成全局视觉引导向量,以增强模型的语义理解能力。
- 在训练过程中,采用联合学习设置,同时优化连接模块和下游DiT,初始权重来自预训练的Qwen和DiT文本到图像模型。
## 环境配置
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
# <your IMAGE ID>为以上拉取的docker的镜像ID替换
docker run -it --shm-size=64G -v $PWD/Step1X-Edit_pytorch:/home/Step1X-Edit_pytorch -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name step1x_edit <your IMAGE ID> bash
cd /home/Step1X-Edit_pytorch
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
wget --content-disposition 'https://download.sourcefind.cn:65024/directlink/4/triton/DAS1.3/triton-2.1.0+das.opt1.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl'
pip install triton-2.1.0+das.opt1.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl
```
### Dockerfile(方法二)
```
cd /home/Step1X-Edit_pytorch/docker
docker build --no-cache -t step1x_edit:latest .
docker run --shm-size=64G --name step1x_edit -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/Step1X-Edit_pytorch:/home/Step1X-Edit_pytorch -it step1x_edit bash
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
wget --content-disposition 'https://download.sourcefind.cn:65024/directlink/4/triton/DAS1.3/triton-2.1.0+das.opt1.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl'
pip install triton-2.1.0+das.opt1.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.hpccube.com/tool/
```
DTK驱动:dtk24.04.3
python:python3.10
torch:2.3.0
torchvision:0.18.1
triton:2.1.0
flash-attn:2.6.1
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
cd /home/Step1X-Edit_pytorch
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
wget --content-disposition 'https://download.sourcefind.cn:65024/directlink/4/triton/DAS1.3/triton-2.1.0+das.opt1.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl'
pip install triton-2.1.0+das.opt1.dtk24043-cp310-cp310-manylinux_2_28_x86_64.whl
```
## 数据集
`无`
## 训练
`无`
## 推理
预训练权重目录结构:
```
/home/Step1X-Edit_pytorch
└── Qwen/Qwen2.5-VL-7B-Instruct
└── meimeilook/Step1X-Edit-FP8
```
### 单机单卡
```
bash scripts/run_examples.sh
```
更多资料可参考源项目中的[`README_orgin`](./README_orgin.md)
## result
图像编辑效果示例:
输入:
- `prompt:给这个女生的脖子上戴一个带有红宝石的吊坠。 `
<div align=center>
<img src="./examples/0000.jpg"/>
</div>
- `prompt:让她哭。 `
<div align=center>
<img src="./examples/0001.png"/>
</div>
输出:
<div align=center>
<img src="./assets/0000.jpg"/>
</div>
<div align=center>
<img src="./assets/0001.png"/>
</div>
### 精度
`无`
## 应用场景
### 算法类别
`多模态`
### 热点应用行业
`绘画,动漫,媒体,制造,广媒,家居,教育`
## 预训练权重
huggingface权重下载地址为:
- [meimeilook/Step1X-Edit-FP8](https://huggingface.co/meimeilook/Step1X-Edit-FP8)
- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
`注:建议加镜像源下载:export HF_ENDPOINT=https://hf-mirror.com`
## 源码仓库及问题反馈
- http://developer.sourcefind.cn/codes/modelzoo/Step1X-Edit_pytorch.git
## 参考资料
- https://github.com/stepfun-ai/Step1X-Edit
<div align="center">
<img src="assets/logo.png" height=100>
</div>
<div align="center">
<a href="https://step1x-edit.github.io/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Web&color=green"></a> &ensp;
<a href="https://arxiv.org/abs/2504.17761"><img src="https://img.shields.io/static/v1?label=Tech%20Report&message=Arxiv&color=red"></a> &ensp;
<a href="https://www.modelscope.cn/models/stepfun-ai/Step1X-Edit"><img src="https://img.shields.io/static/v1?label=Model&message=ModelScope&color=blue"></a> &ensp;
<a href="https://discord.gg/j3qzuAyn"><img src="https://img.shields.io/static/v1?label=Discord%20Channel&message=Discord&color=purple"></a> &ensp;
<a href="https://huggingface.co/stepfun-ai/Step1X-Edit"><img src="https://img.shields.io/static/v1?label=Model&message=HuggingFace&color=yellow"></a> &ensp;
<a href="https://huggingface.co/spaces/stepfun-ai/Step1X-Edit"><img src="https://img.shields.io/static/v1?label=Online%20Demo&message=HuggingFace&color=yellow"></a> &ensp;
<a href="https://huggingface.co/datasets/stepfun-ai/GEdit-Bench"><img src="https://img.shields.io/static/v1?label=GEdit-Bench&message=HuggingFace&color=yellow"></a> &ensp;
[![Run on Replicate](https://replicate.com/zsxkib/step1x-edit/badge)](https://replicate.com/zsxkib/step1x-edit) &ensp;
</div>
## 🔥🔥🔥 News!!
* Apr 30, 2025: 🎉 Step1X-Edit ComfyUI Plugin is available now, thanks for the community contribution! [quank123wip/ComfyUI-Step1X-Edit](https://github.com/quank123wip/ComfyUI-Step1X-Edit) & [raykindle/ComfyUI_Step1X-Edit](https://github.com/raykindle/ComfyUI_Step1X-Edit).
* Apr 27, 2025: 🎉 With community support, we update the inference code and model weights of Step1X-Edit-FP8. [meimeilook/Step1X-Edit-FP8](https://huggingface.co/meimeilook/Step1X-Edit-FP8) & [rkfg/Step1X-Edit-FP8](https://huggingface.co/rkfg/Step1X-Edit-FP8).
* Apr 26, 2025: 🎉 Step1X-Edit is now live — you can try editing images directly in the online demo! [Online Demo](https://huggingface.co/spaces/stepfun-ai/Step1X-Edit)
* Apr 25, 2025: 👋 We release the evaluation code and benchmark data of Step1X-Edit. [Download GEdit-Bench](https://huggingface.co/datasets/stepfun-ai/GEdit-Bench)
* Apr 25, 2025: 👋 We release the inference code and model weights of Step1X-Edit. [ModelScope](https://www.modelscope.cn/models/stepfun-ai/Step1X-Edit) & [HuggingFace](https://huggingface.co/stepfun-ai/Step1X-Edit) models.
* Apr 25, 2025: 👋 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2504.17761)
<!-- ## Image Edit Demos -->
<div align="center">
<img width="720" alt="demo" src="assets/image_edit_demo.gif">
<p><b>Step1X-Edit:</b> a unified image editing model performs impressively on various genuine user instructions. </p>
</div>
## 🧩 Community Contributions
If you develop/use Step1X-Edit in your projects, welcome to let us know 🎉.
- FP8 model weights: [meimeilook/Step1X-Edit-FP8](https://huggingface.co/meimeilook/Step1X-Edit-FP8) by [meimeilook](https://huggingface.co/meimeilook); [rkfg/Step1X-Edit-FP8](https://huggingface.co/rkfg/Step1X-Edit-FP8) by [rkfg](https://huggingface.co/rkfg)
- Step1X-Edit ComfyUI Plugin: [quank123wip/ComfyUI-Step1X-Edit](https://github.com/quank123wip/ComfyUI-Step1X-Edit) by [quank123wip](https://github.com/quank123wip); [raykindle/ComfyUI_Step1X-Edit](https://github.com/raykindle/ComfyUI_Step1X-Edit) by [raykindle](https://github.com/raykindle)
## 📑 Open-source Plan
- [x] Inference & Checkpoints
- [x] Online demo (Gradio)
- [ ] Fine-tuning scripts
- [ ] Diffusers
- [ ] Multi-gpus Sequence Parallel inference
- [x] FP8 Quantified weight
- [x] ComfyUI
## 1. Introduction
we introduce a state-of-the-art image editing model, **Step1X-Edit**, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash.
More specifically, we adopt the Multimodal LLM to process the reference image and user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset.
For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
More details please refer to our [technical report](https://arxiv.org/abs/2504.17761).
## 2. Model Usage
### 2.1 Requirements
The following table shows the requirements for running Step1X-Edit model (batch size = 1, w/o cfg distillation) to edit images:
| Model | Peak GPU Memory (512 / 786 / 1024) | 28 steps w flash-attn(512 / 786 / 1024) |
|:------------:|:------------:|:------------:|
| Step1X-Edit | 42.5GB / 46.5GB / 49.8GB | 5s / 11s / 22s |
| Step1X-Edit-FP8 | 31GB / 31.5GB / 34GB | 6.8s / 13.5s / 25s |
| Step1X-Edit + offload | 25.9GB / 27.3GB / 29.1GB | 49.6s / 54.1s / 63.2s |
| Step1X-Edit-FP8 + offload | 18GB / 18GB / 18GB | 35s / 40s / 51s |
* The model is tested on one H800 GPUs.
* We recommend to use GPUs with 80GB of memory for better generation quality and efficiency.
* The Step1X-Edit-FP8 model we tested comes from [meimeilook/Step1X-Edit-FP8](https://huggingface.co/meimeilook/Step1X-Edit-FP8).
### 2.2 Dependencies and Installation
python >=3.10.0 and install [torch](https://pytorch.org/get-started/locally/) >= 2.2 with cuda toolkit and corresponding torchvision. We test our model using torch==2.3.1 and torch==2.5.1 with cuda-12.1.
Install requirements:
``` bash
pip install -r requirements.txt
```
Install [`flash-attn`](https://github.com/Dao-AILab/flash-attention), here we provide a script to help find the pre-built wheel suitable for your system.
```bash
python scripts/get_flash_attn.py
```
The script will generate a wheel name like `flash_attn-2.7.2.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl`, which could be found in [the release page of flash-attn](https://github.com/Dao-AILab/flash-attention/releases).
Then you can download the corresponding pre-built wheel and install it following the instructions in [`flash-attn`](https://github.com/Dao-AILab/flash-attention).
### 2.3 Inference Scripts
After downloading the [model weights](https://huggingface.co/stepfun-ai/Step1X-Edit), you can use the following scripts to edit images:
```
bash scripts/run_examples.sh
```
The default script runs the inference code with non-quantified weights. If you want to save the GPU memory usage, you can 1) download the FP8 weights and set the `--quantized` flag in the script, or 2) set the `--offload` flag in the script to offload some modules to CPU.
This default script runs the inference code on example inputs. The results will look like:
<div align="center">
<img width="1080" alt="results" src="assets/results_show.png">
</div>
## 3. Benchmark
We release [GEdit-Bench](https://huggingface.co/datasets/stepfun-ai/GEdit-Bench) as a new benchmark, grounded in real-world usages is developed to support more authentic and comprehensive evaluation. This benchmark, which is carefully curated to reflect actual user editing needs and a wide range of editing scenarios, enables more authentic and comprehensive evaluations of image editing models.
The evaluation process and related code can be found in [GEdit-Bench/EVAL.md](GEdit-Bench/EVAL.md). Part results of the benchmark are shown below:
<div align="center">
<img width="1080" alt="results" src="assets/eval_res_en.png">
</div>
## 4. Citation
```
@article{liu2025step1x-edit,
title={Step1X-Edit: A Practical Framework for General Image Editing},
author={Shiyu Liu and Yucheng Han and Peng Xing and Fukun Yin and Rui Wang and Wei Cheng and Jiaqi Liao and Yingming Wang and Honghao Fu and Chunrui Han and Guopeng Li and Yuang Peng and Quan Sun and Jingwei Wu and Yan Cai and Zheng Ge and Ranchen Ming and Lei Xia and Xianfang Zeng and Yibo Zhu and Binxing Jiao and Xiangyu Zhang and Gang Yu and Daxin Jiang},
journal={arXiv preprint arXiv:2504.17761},
year={2025}
}
```
## 5. Acknowledgement
We would like to express our sincere thanks to the contributors of [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Qwen](https://github.com/QwenLM/Qwen2.5), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) teams, for their open research and exploration.
## 6. Disclaimer
The results produced by this image editing model are entirely determined by user input and actions. The development team and this open-source project are not responsible for any outcomes or consequences arising from its use.
## 7. LICENSE
Step1X-Edit is licensed under the Apache License 2.0. You can find the license files in the respective github and HuggingFace repositories.
This image diff could not be displayed because it is too large. You can view the blob instead.
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-ubuntu22.04-dtk24.04.3-py3.10
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment