v1.0

109f0842 · chenzk · 109f0842 · 109f0842 · 109f0842 · 109f0842
Commit 109f0842 authored Jul 19, 2025 by chenzk
20 changed files
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# OmniGen2
+引入反思机制，多模态任务生成屠榜，一键解锁AI绘图「哆啦 A 梦」任意门。
+## 论文
+`OmniGen2: Exploration to Advanced Multimodal Generation`
+- https://arxiv.org/pdf/2506.18871
+## 模型结构
+OmniGen2 利用一个基础的多模态大语言模型Transformer来处理文本与图像输入，对于文本生成任务，采用自回归语言头，而图像生成则通过专用的扩散模块完成，Transformer主干由Qwen2.5-VL-3B初始化。
+<div align=center>
+    <img src="./doc/OmniGen2.png"/>
+</div>
+## 算法原理
+MLLM在训练过程中其大部分参数保持冻结状态，以保留其多模态理解能力，仅新引入的特殊 token “<|img|>” 被更新，扩散模型从零开始训练，初期专注于文本到图像（T2I）生成任务，随后采用混合任务训练策略以适应多种目标，在反思训练阶段，所有模型参数解冻，允许模型生成反思性的文本描述并迭代优化图像输出。
+<div align=center>
+    <img src="./doc/Reflection.png"/>
+</div>
+## 环境配置
+```
+mv OmniGen2_pytorch OmniGen2
+```
+### 硬件需求
+DCU型号：K100AI，节点数量：1 台，卡数：1 张。
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04.1-py3.10
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：e50d644287fd
+docker run -it --shm-size=64G -v $PWD/OmniGen2:/home/OmniGen2 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name og2 <your IMAGE ID> bash
+cd /home/OmniGen2
+pip install -r requirements.txt # requirements.txt
+```
+### Dockerfile（方法二）
+```
+cd /home/OmniGen2/docker
+docker build --no-cache -t og2:latest .
+docker run --shm-size=64G --name yolov13 -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../OmniGen2:/home/OmniGen2 -it og2 bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.sourcefind.cn/tool/
+```
+DTK驱动:25.04.1
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+onnxruntime:1.19.2
+```
+不同深度学习库可支持的DCU型号可在此处查询：[DAS资源下载](https://das.sourcefind.cn:55011/portal/#/home)
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/OmniGen2
+pip install -r requirements.txt # requirements.txt
+```
+## 数据集
+`无`
+## 训练
+`无`
+## 推理
+预训练权重目录结构：
+```
+/home/OmniGen2/
+    └── OmniGen2/OmniGen2
+```
+### 单机单卡
+```
+export HIP_VISIBLE_DEVICES=0
+cd /home/OmniGen2
+# Visual Understanding
+bash example_understanding.sh # 其它功能的推理脚本见run_example.sh
+```
+更多资料可参考源项目的[`README_origin`](./README_origin.md)
+## result
+`输入: `
+```
+instruction: "Please describe this image briefly." 
+input_image_path: example_images/02.jpg
+```
+<div align=center>
+    <img src="./doc/02.png"/>
+</div>
+`输出:`
+```
+Text: The image shows a plush toy bear sitting on a grassy surface. The bear has a brown body with white paws and a white muzzle. It is wearing a blue bow on its head and a white bib with the text "Get Well" written on it. The background consists of green grass with some clover leaves visible.
+```
+官方其它演示效果示例：
+<div align=center>
+    <img src="./doc/replace.png"/>
+</div>
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+## 应用场景
+### 算法类别
+`多模态`
+### 热点应用行业
+`制造,广媒,金融,能源,医疗,家居,教育`
+## 预训练权重
+HF下载地址为：[OmniGen2/OmniGen2](https://huggingface.co/OmniGen2/OmniGen2)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/OmniGen2_pytorch.git
+## 参考资料
+- https://github.com/VectorSpaceLab/OmniGen2.git
--- a/README_origin.md
+++ b/README_origin.md
+<p align="center">
+  <img src="assets/brand.png" width="65%">
+</p>
+<p align="center">
+  <a href="https://vectorspacelab.github.io/OmniGen2"><img src="https://img.shields.io/badge/Project%20Page-OmniGen2-yellow" alt="project page"></a>
+  <a href="https://arxiv.org/abs/2506.18871"><img src="https://img.shields.io/badge/arXiv%20paper-2506.18871-b31b1b.svg" alt="arxiv"></a>
+  <a href="https://github.com/VectorSpaceLab/OmniGen2?tab=readme-ov-file#-gradio-demo"><img src="https://img.shields.io/badge/Online%20Demo-🤗-blue" alt="demo"></a>
+  <a href="https://huggingface.co/spaces/OmniGen2/OmniGen2"><img src="https://img.shields.io/badge/HF%20Spaces-🤗-lightblue" alt="demo"></a>
+  <a href="https://huggingface.co/OmniGen2/OmniGen2"><img src="https://img.shields.io/badge/Model-🤗-yellow" alt="model"></a>
+  <a href="https://huggingface.co/datasets/OmniGen2/OmniContext"><img src="https://img.shields.io/badge/Benchmark-🤗-yellow" alt="model"></a>
+  <a href="https://huggingface.co/datasets/OmniGen2/X2I2"><img src="https://img.shields.io/badge/Dataset-🤗-yellow" alt="model"></a>
+</p>
+<h4 align="center">
+    <p>
+        <a href=#-news>News</a> |
+        <a href=#-quick-start>Quick Start</a> |
+        <a href=#-usage-tips>Usage Tips</a> |
+        <a href=#-limitations-and-suggestions>Limitations</a> |
+        <a href=#-gradio-demo>Online Demos</a> |
+        <a href=#%EF%B8%8F-citing-us>Citation</a>
+    <p>
+</h4>
+## 🔥 News
+- **2025-07-05**: Training datasets [X2I2](https://huggingface.co/datasets/OmniGen2/X2I2) are available.
+- **2025-07-03**: OmniGen2 now supports [TeaCache](https://github.com/ali-vilab/TeaCache) and [TaylorSeer](https://github.com/Shenyi-Z/TaylorSeer) for faster inference, see [Usage Tips](#-usage-tips) for details. Thanks @legitnull for great [TeaCache-PR](https://github.com/VectorSpaceLab/OmniGen2/pull/52) and [TaylorSeer-PR](https://github.com/VectorSpaceLab/OmniGen2/pull/76).
+- **2025-07-01**: OmniGen2 is supported by [ComfyUI official](https://comfyanonymous.github.io/ComfyUI_examples/omnigen), thanks !!
+- **2025-06-30**: Training code is available, see [fine-tuning](docs/FINETUNE.md) for details.
+- **2025-06-28**: We release [OmniContext](https://huggingface.co/datasets/OmniGen2/OmniContext) benchmark. The evaluation codes are in [omnicontext](https://github.com/VectorSpaceLab/OmniGen2/tree/main/omnicontext).
+- **2025-06-24**: [Technical Report](https://arxiv.org/abs/2506.18871) is available.
+- **2025-06-23**: We’ve updated our code and HF model—OmniGen2 now runs *without* `flash-attn`. Users can still install it for optimal performance.
+- **2025-06-20**: Updated [resource requirements](#-resources-requirement), adding CPU offload support for devices with limited VRAM.
+- **2025-06-16**: [Gradio](https://github.com/VectorSpaceLab/OmniGen2?tab=readme-ov-file#-gradio-demo) and [Jupyter](https://github.com/VectorSpaceLab/OmniGen2/blob/main/example.ipynb) is available. Online Gradio Demo: [Demo1](https://9c4426d27c3b9ecbed.gradio.live); [Chat-Demo1](https://0351497834a4d7226c.gradio.live); see more demo links in [gradio section](https://github.com/VectorSpaceLab/OmniGen2?tab=readme-ov-file#-gradio-demo)
+- **2025-06-16**: We release **OmniGen2**, a multimodal generation model, model weights can be accessed in [huggingface](https://huggingface.co/OmniGen2/OmniGen2) and [modelscope](https://www.modelscope.cn/models/OmniGen2/OmniGen2).
+## Introduction
+**OmniGen2** is a powerful and efficient generative model. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. OmniGen2 has competitive performance across four primary capabilities:
+- **Visual Understanding**: Inherits the robust ability to interpret and analyze image content from its Qwen-VL-2.5 foundation.
+- **Text-to-Image Generation**: Creates high-fidelity and aesthetically pleasing images from textual prompts.
+- **Instruction-guided Image Editing**: Executes complex, instruction-based image modifications with high precision, achieving state-of-the-art performance among open-source models.
+- **In-context Generation**: A versatile capability to process and flexibly combine diverse inputs—including humans, reference objects, and scenes—to produce novel and coherent visual outputs.
+**We will release the training code and dataset. Stay tuned!**
+Some good cases of OmniGen2:
+<p align="center">
+  <img src="assets/teaser.jpg" width="95%">
+  <br>
+  <em>Demonstrations.</em>
+</p>
+<p align="center">
+  <img src="assets/examples_edit.png" width="95%">
+  <br>
+  <em> Good demonstrations of OmniGen2's image editing capabilities.</em>
+</p>
+<p align="center">
+  <img src="assets/examples_subject.png" width="95%">
+  <br>
+  <em> Good demonstrations of OmniGen2's in-context generation capabilities.</em>
+</p>
+## 📌 TODO
+- [x] Technical report.
+- [x] Support CPU offload and improve inference efficiency.
+- [x] In-context generation benchmark: **OmniContext**.
+- [ ] Integration of diffusers.
+- [x] Training datasets.
+- [ ] Training data construction pipeline.
+- [ ] ComfyUI Demo (**commuity support will be greatly appreciated!**).
+## 🚀 Quick Start
+### 🛠️ Environment Setup
+#### ✅ Recommended Setup
+```bash
+# 1. Clone the repo
+git clone git@github.com:VectorSpaceLab/OmniGen2.git
+cd OmniGen2
+# 2. (Optional) Create a clean Python environment
+conda create -n omnigen2 python=3.11
+conda activate omnigen2
+# 3. Install dependencies
+# 3.1 Install PyTorch (choose correct CUDA version)
+pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124
+# 3.2 Install other required packages
+pip install -r requirements.txt
+# Note: Version 2.7.4.post1 is specified for compatibility with CUDA 12.4.
+# Feel free to use a newer version if you use CUDA 12.6 or they fixed this compatibility issue.
+# OmniGen2 runs even without flash-attn, though we recommend install it for best performance.
+pip install flash-attn==2.7.4.post1 --no-build-isolation
+```
+#### 🌏 For users in Mainland China
+```bash
+# Install PyTorch from a domestic mirror
+pip install torch==2.6.0 torchvision --index-url https://mirror.sjtu.edu.cn/pytorch-wheels/cu124
+# Install other dependencies from Tsinghua mirror
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+# Note: Version 2.7.4.post1 is specified for compatibility with CUDA 12.4.
+# Feel free to use a newer version if you use CUDA 12.6 or they fixed this compatibility issue.
+# OmniGen2 runs even without flash-attn, though we recommend install it for best performance.
+pip install flash-attn==2.7.4.post1 --no-build-isolation -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+---
+### 🧪 Run Examples
+```bash
+# Visual Understanding
+bash example_understanding.sh
+# Text-to-image generation
+bash example_t2i.sh
+# Instruction-guided image editing
+bash example_edit.sh
+# In-context generation
+bash example_in_context_generation.sh
+```
+---
+### 🌐 Gradio Demo
+* **Online Demo**: [HF Spaces](https://huggingface.co/spaces/OmniGen2/OmniGen2). Beyond Hugging Face Spaces, we are *temporarily* allocating additional GPU resources to ensure smooth access to the online demos. If you notice a long queue for a particular link, please try other links:
+    [Demo1](https://9c4426d27c3b9ecbed.gradio.live), [Demo2](https://06574c5e62d815f799.gradio.live), [Demo3](https://e0a82fd380d2ff17ac.gradio.live), [Demo4](https://d9c4410ee48ce35051.gradio.live)
+    [Chat-Demo1](https://0351497834a4d7226c.gradio.live), [Chat-Demo2](https://032160099388d1d10c.gradio.live), [Chat-Demo3](https://cf9f2797e92cfa2767.gradio.live), [Chat-Demo4](https://b87b82fd14215affc2.gradio.live)
+<!-- [Available on Hugging Face Spaces 🚀](https://huggingface.co/spaces/Shitao/OmniGen2) -->
+* **Run Locally**:
+    ```bash
+    # for only generating image
+    pip install gradio
+    python app.py
+    # Optional: Share demo with public link (You need to be able to access huggingface)
+    python app.py --share
+    # for generating image or text
+    pip install gradio
+    python app_chat.py
+    ```
+## 💡 Usage Tips
+To achieve optimal results with OmniGen2, you can adjust the following key hyperparameters based on your specific use case.
+- `text_guidance_scale`: Controls how strictly the output adheres to the text prompt (Classifier-Free Guidance).
+- `image_guidance_scale`: This controls how much the final image should resemble the input reference image.
+    - **The Trade-off**: A higher value makes the output more faithful to the reference image's structure and style, but it might ignore parts of your text prompt. A lower value (~1.5) gives the text prompt more influence.
+    - **Tip**: For image editing task, we recommend to set it between 1.2 and 2.0; for in-context generateion task, a higher image_guidance_scale will maintian more details in input images, and we recommend to set it between 2.5 and 3.0.
+- `max_pixels`: Automatically resizes images when their total pixel count (width × height) exceeds this limit, while maintaining its aspect ratio. This helps manage performance and memory usage.
+  - **Tip**: Default value is 1024*1024. You can reduce this value if you encounter memory issues.
+- `max_input_image_side_length`: Maximum side length for input images.
+- `negative_prompt`: Tell the model what you don't want to see in the image.
+    - **Example**: blurry, low quality, text, watermark
+    - **Tip**: For the best results, try experimenting with different negative prompts. If you're not sure, just use the default negative prompt.
+- `enable_model_cpu_offload`: **Reduces VRAM usage by nearly 50% with a negligible impact on speed**.
+  - This is achieved by offloading the model weights to CPU RAM when they are not in use.
+  - See: [Model Offloading](https://huggingface.co/docs/diffusers/optimization/memory#model-offloading)
+- `enable_sequential_cpu_offload`: Minimizes VRAM usage to less than 3GB, but at the cost of significantly slower performance.
+  - This works by offloading the model in submodules and loading them onto the GPU sequentially as needed.
+  - See: [CPU Offloading](https://huggingface.co/docs/diffusers/optimization/memory#cpu-offloading)
+- `cfg_range_start`, `cfg_range_end`: Define the timestep range where CFG is applied. Per this [paper](https://arxiv.org/abs/2404.07724), reducing `cfg_range_end` can significantly decrease inference time with a negligible impact on quality.
+- `scheduler`: Choose between `[euler, dpmsolver++]`. Default is `euler`. For potentially better performance with fewer steps, try `dpmsolver++`.
+- `num_inference_step`: Number of discretization steps for the ODE solver. Default is `50`.
+- `enable_teacache`: Whether or not enable [teacache](https://github.com/ali-vilab/TeaCache) for faster inference.
+- `teacache_rel_l1_thresh`: The threshold for accumulated L1 distance for the timestep embedding-modulated noisy input. It serves as an indicator of whether to cache the model output. You can modify the `teacache_rel_l1_thresh` parameter to achieve your desired trade-off between latency and visual quality. The default value of 0.05 provides approximately a **30% speedup** compared to the baseline. Increasing this value can further reduce latency, but may result in some loss of detail.
+- `enable_taylorseer`: Whether or not enable [taylorseer](https://github.com/Shenyi-Z/TaylorSeer) for faster inference. When enabled, inference speed can improve by up to **2X**, with negligible quality loss compared to the baseline.
+**Some suggestions for improving generation quality:**
+1. Use High-Quality Images
+  - Provide clear images, preferably with a resolution **greater than 512×512 pixels**.
+  - Small or blurry inputs will result in low-quality outputs.
+2. Be Specific with Instructions
+  - Clearly describe both **what to change** and **how you want it changed**.
+3. Prioritize English
+The model currently performs best with **English** prompts.
+4. Change instructions to enhance subject consistency.
+When the generated image does not align well with the input image, you can try the following methods to improve subject consistency:
+  - **Use images with larger size, as well as images in which people occupy a larger proportion of the frame.**
+  - **Increase the Image Guidance Scale**, for example to 3.0. The trade-off may be slight overexposure or a greasy look in the image. 
+  - **When using a single input image**, you can try to use the following prompt template: "she/he ..., maintaining her/his facial features, hairstyle, and other attributes."
+  - **Increase the parameter--Number of images per prompt** to generate more outputs, giving you a better chance to find one with stronger subject consistency and a more satisfactory result. 
+  - **Longer prompts generally yield better results than shorter ones.** More detailed descriptions of the scene and character interactions can provide additional benefits.
+5. For in-context edit (edit based multiple images), we recommend using the following prompt format: "Edit the first image: add/replace (the [object] with) the [object] from the second image. [descripton for your target image]." 
+For example: "Edit the first image: add the man from the second image. The man is talking with a woman in the kitchen". The descition for your target image should be as detailed as possible.
+## 🎨 Fine-tune
+See [fine-tuning](docs/FINETUNE.md) for details.
+## ❌ Limitations and Suggestions
+The current model sometimes does not follow instructions. You can increase the "Number of images per prompt" to generate multiple images at once, so you can choose the result you are satisfied with, or try different prompts. In our own experience, being as detailed as possible tends to work better.
+The current model cannot decide the output image size by itself; the default size is 1024×1024. You need to set a specific size if you require a different one. When you input an image, we will set the output size to match the input image (this works best for editing tasks). If you want to modify just one image out of several, you should also set the output size to match the image you want to edit; otherwise, it may lead to low-quality outputs.
+The in-context generation capability sometimes produces objects that differ from the original ones. Some suggested improvements are: increasing `image_guidance_scale` (it is recommended to set it to 3) can help alleviate this issue; using high-resolution images, increasing the size of the input image, and ensuring that the object to be used occupies a larger proportion of the image; and modifying the prompt. However, there is still a  gap compared to GPT-4o.
+Compared to OmniGen 1.0, although OmniGen 2 has made some improvements, many issues still remain. It may take multiple attempts to achieve a satisfactory result. 
+## 💻 Resources Requirement
+OmniGen2 natively requires an **NVIDIA RTX 3090** or an equivalent GPU with approximately **17GB of VRAM**. For devices with less VRAM, you can enable **CPU Offload** to run the model.
+**Performance Tip**: To improve inference speed, consider decreasing the `cfg_range_end` parameter. Within a reasonable range, this has a negligible impact on output quality.
+The following table details the inference performance of OmniGen2 on an **A800 GPU**:
+<p align="center">
+  <img src="assets/efficiency.png" width="95%">
+  <br>
+  <em>Inference Efficiency of OmniGen2.</em>
+</p>
+## 🤝 Community Efforts
+We’re honored and grateful for the support from the open source community. Here are some unofficial implementations contributed by the community(**Currently, we have not confirmed whether there are no bugs. Please try to use the our official demo as much as possible.**):
+- ComfyUI:
+  - [ComfyUI Official](https://comfyanonymous.github.io/ComfyUI_examples/omnigen/)
+  - [https://github.com/Yuan-ManX/ComfyUI-OmniGen2](https://github.com/Yuan-ManX/ComfyUI-OmniGen2)
+  - [https://github.com/neverbiasu/ComfyUI-OmniGen2](https://github.com/neverbiasu/ComfyUI-OmniGen2)
+- Quantization:
+  - [DFloat11, a lossless compression using 11 bits](https://github.com/LeanModels/OmniGen2-DFloat11)
+## ❤️ Citing Us
+If you find this repository or our work useful, please consider giving a star ⭐ and citation 🦖, which would be greatly appreciated:
+```bibtex
+@article{wu2025omnigen2,
+  title={OmniGen2: Exploration to Advanced Multimodal Generation},
+  author={Chenyuan Wu and Pengfei Zheng and Ruiran Yan and Shitao Xiao and Xin Luo and Yueze Wang and Wanli Li and Xiyan Jiang and Yexin Liu and Junjie Zhou and Ze Liu and Ziyi Xia and Chaofan Li and Haoge Deng and Jiahao Wang and Kun Luo and Bo Zhang and Defu Lian and Xinlong Wang and Zhongyuan Wang and Tiejun Huang and Zheng Liu},
+  journal={arXiv preprint arXiv:2506.18871},
+  year={2025}
+}
+```
--- a/app.py
+++ b/app.py
--- a/app_chat.py
+++ b/app_chat.py
--- a/app_chat.sh
+++ b/app_chat.sh
+# !/bin/bash
+SHELL_FOLDER=$(cd "$(dirname "$0")";pwd)
+cd $SHELL_FOLDER
+source "$(dirname $(which conda))/../etc/profile.d/conda.sh"
+conda activate py3.11+pytorch2.6+cu124
+RANK=0
+# 处理命名参数
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --rank=*)
+            RANK="${1#*=}"
+            shift
+            ;;
+        *)
+            echo "未知参数: $1"
+            shift
+            ;;
+    esac
+done
+CUDA_VISIBLE_DEVICES=${RANK} python app_chat.py \
+--port $((7860 + RANK)) \
+--share
\ No newline at end of file
--- a/assets/brand.png
+++ b/assets/brand.png
--- a/assets/efficiency.png
+++ b/assets/efficiency.png
--- a/assets/examples_edit.png
+++ b/assets/examples_edit.png
--- a/assets/examples_subject.png
+++ b/assets/examples_subject.png
--- a/assets/omnicontext_evaluation.png
+++ b/assets/omnicontext_evaluation.png
--- a/assets/omnicontext_overview.png
+++ b/assets/omnicontext_overview.png
--- a/assets/teaser.jpg
+++ b/assets/teaser.jpg
--- a/assets/teaser.png
+++ b/assets/teaser.png
--- a/convert_ckpt_to_hf_format.py
+++ b/convert_ckpt_to_hf_format.py
+import dotenv
+dotenv.load_dotenv(override=True)
+import argparse
+from omegaconf import OmegaConf
+import torch
+from accelerate import init_empty_weights
+from peft import LoraConfig
+from peft.utils import get_peft_model_state_dict
+from omnigen2.models.transformers.transformer_omnigen2 import OmniGen2Transformer2DModel
+from omnigen2.pipelines.omnigen2.pipeline_omnigen2 import OmniGen2Pipeline
+def main(args):
+    config_path = args.config_path
+    model_path = args.model_path
+    conf = OmegaConf.load(config_path)
+    arch_opt = conf.model.arch_opt
+    arch_opt = OmegaConf.to_object(arch_opt)
+    # Convert lists to tuples in conf.model.arch_opt
+    for key, value in arch_opt.items():
+        if isinstance(value, list):
+            arch_opt[key] = tuple(value)
+    with init_empty_weights():
+        transformer = OmniGen2Transformer2DModel(**arch_opt)
+        if conf.train.get('lora_ft', False):
+            target_modules = ["to_k", "to_q", "to_v", "to_out.0"]
+            # now we will add new LoRA weights the transformer layers
+            lora_config = LoraConfig(
+                r=conf.train.lora_rank,
+                lora_alpha=conf.train.lora_rank,
+                lora_dropout=conf.train.lora_dropout,
+                init_lora_weights="gaussian",
+                target_modules=target_modules,
+            )
+            transformer.add_adapter(lora_config)
+    state_dict = torch.load(model_path, mmap=True, weights_only=True)
+    missing, unexpect = transformer.load_state_dict(
+        state_dict, assign=True, strict=False
+    )
+    print(f"missed parameters: {missing}")
+    print(f"unexpected parameters: {unexpect}")
+    save_path = args.save_path
+    if conf.train.get('lora_ft', False):
+        transformer_lora_layers = get_peft_model_state_dict(transformer)
+        OmniGen2Pipeline.save_lora_weights(
+            save_directory=save_path,
+            transformer_lora_layers=transformer_lora_layers,
+        )
+    else:
+        transformer.save_pretrained(save_path)
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config_path", type=str, required=True)
+    parser.add_argument("--model_path", type=str, required=True)
+    parser.add_argument("--save_path", type=str, required=True)
+    return parser.parse_args()
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/data_configs/train/example/edit/edit.yml
+++ b/data_configs/train/example/edit/edit.yml
+data:
+  - 
+    path: "data_configs/example/edit/jsonls/0.jsonl"
+    type: "edit"
+    ratio: !!float 1
+  - 
+    path: "data_configs/example/edit/jsonls/1.jsonl"
+    type: "edit"
+    ratio: !!float 1
\ No newline at end of file
--- a/data_configs/train/example/edit/jsonls/0.jsonl
+++ b/data_configs/train/example/edit/jsonls/0.jsonl
+{"task_type": "edit", "instruction": "add a hat to the person", "input_images": ["/path/to/your/data/edit/0.png"], "output_image": "/path/to/your/data/edit/0.png"}
+{"task_type": "edit", "instruction": "add a dog behind the person", "input_images": ["/path/to/your/data/edit/1.png"], "output_image": "/path/to/your/data/edit/1.png"}
\ No newline at end of file
--- a/data_configs/train/example/ic/ic.yml
+++ b/data_configs/train/example/ic/ic.yml
+data:
+  - 
+    path: "data_configs/example/ic/jsonls/0.jsonl"
+    type: "ic"
+    ratio: !!float 1
+  - 
+    path: "data_configs/example/ic/jsonls/1.jsonl"
+    type: "ic"
+    ratio: !!float 1
\ No newline at end of file
--- a/data_configs/train/example/ic/jsonls/0.jsonl
+++ b/data_configs/train/example/ic/jsonls/0.jsonl
+{"task_type": "ic", "instruction": "A big tree is in the forest", "input_images": ["/path/to/your/data/ic/0.png", "/path/to/your/data/ic/1.png"], "output_image": "/path/to/your/data/ic/0.png"}
+{"task_type": "ic", "instruction": "a dog is running on grass", "input_images": ["/path/to/your/data/ic/2.png", "/path/to/your/data/ic/3.png"], "output_image": "/path/to/your/data/ic/1.png"}
\ No newline at end of file
--- a/data_configs/train/example/mix.yml
+++ b/data_configs/train/example/mix.yml
+data:
+  - 
+    path: 'data_configs/example/t2i/t2i.yml'
+    type: 't2i'
+    ratio: !!float 0.33
+  -
+    path: 'data_configs/example/edit/edit.yml'
+    type: 'edit'
+    ratio: !!float 0.33
+  - 
+    path: 'data_configs/example/ic/ic.yml'
+    type: 'ic'
+    ratio: !!float 0.33
\ No newline at end of file