test commit

5a92374a · suily · 5a92374a · 5a92374a · 5a92374a · 5a92374a
Commit 5a92374a authored Feb 03, 2026 by suily
20 changed files
--- a/.github/ISSUE_TEMPLATE/bug_report.yaml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yaml
+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve CogView3 / 提交一个 Bug 问题报告来帮助我们改进 CogView3 开源模型
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info / 系統信息
+      description: Your operating environment / 您的运行环境信息
+      placeholder: Includes Cuda version, Diffusers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本，Diffusers，Python版本，操作系统，硬件信息(如果您怀疑是硬件方面的问题)...
+    validations:
+      required: true
+
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information / 问题信息
+      description: 'The problem arises when using: / 问题出现在'
+      options:
+        - label: "The official example scripts / 官方的示例脚本"
+        - label: "My own modified scripts / 我自己修改的脚本和任务"
+
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction / 复现过程
+      description: |
+        Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
+        If you have code snippets, error messages, stack traces, please provide them here as well.
+        Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
+        
+        请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
+        如果您有代码片段、错误信息、堆栈跟踪，也请在此提供。
+        请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        请勿使用截图，因为截图难以阅读，而且（更重要的是）不允许他人复制粘贴您的代码。
+      placeholder: |
+        Steps to reproduce the behavior/复现Bug的步骤:
+          
+          1.
+          2.
+          3.
+
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior / 期待表现
+      description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
\ No newline at end of file
--- a/.github/ISSUE_TEMPLATE/feature-request.yaml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yaml
+name: "\U0001F680 Feature request"
+description: Submit a request for a new CogView3 feature / 提交一个新的 CogView3开源模型的功能建议
+labels: [ "feature" ]
+body:
+  - type: textarea
+    id: feature-request
+    validations:
+      required: true
+    attributes:
+      label: Feature request  / 功能建议
+      description: |
+        A brief description of the functional proposal. Links to corresponding papers and code are desirable.
+        对功能建议的简述。最好提供对应的论文和代码链接。
+
+  - type: textarea
+    id: motivation
+    validations:
+      required: true
+    attributes:
+      label: Motivation / 动机
+      description: |
+        Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
+        您提出建议的动机。如果该动机与另一个 GitHub 问题有关，请在此处提供对应的链接。
+
+  - type: textarea
+    id: contribution
+    validations:
+      required: true
+    attributes:
+      label: Your contribution / 您的贡献
+      description: |
+        
+        Your PR link or any other link you can help with.
+        您的PR链接或者其他您能提供帮助的链接。
\ No newline at end of file
--- a/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+++ b/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+#  Raise valuable PR / 提出有价值的PR
+
+## Caution / 注意事项:
+Users should keep the following points in mind when submitting PRs:
+
+1. Ensure that your code meets the requirements in the [specification](../../resources/contribute.md).
+2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
+
+用户在提交PR时候应该注意以下几点:
+
+1. 确保您的代码符合 [规范](../../resources/contribute_zh.md) 中的要求。
+2. 提出的PR应该具有针对性，如果具有多个不同的想法和优化方案，应该分配到不同的PR中。
+
+## 不应该提出的PR / PRs that should not be proposed
+
+If a developer proposes a PR about any of the following, it may be closed or Rejected.
+
+1. those that don't describe improvement options.
+2. multiple issues of different types combined in one PR.
+3. The proposed PR is highly duplicative of already existing PRs.
+
+如果开发者提出关于以下方面的PR，则可能会被直接关闭或拒绝通过。
+
+1. 没有说明改进方案的。
+2. 多个不同类型的问题合并在一个PR中的。
+3. 提出的PR与已经存在的PR高度重复的。
+
+
+# 检查您的PR
+- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分？
+- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过？如果是，请添加链接。
+- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档？这里是文档指南，这里是文档格式化技巧。
+- [ ] Did you write new required tests? / 您是否编写了新的必要测试？
+- [ ]  Are your PRs for only one issue / 您的PR是否仅针对一个问题
\ No newline at end of file
--- a/.gitignore
+++ b/.gitignore
+*__pycache__/
+samples*/
+runs/
+checkpoints/
+master_ip
+logs/
+*.DS_Store
+.idea
+output*
+test*
+cogview3-plus-3b/
+*.whl
+xformers/
+result/
+CogView3/
+t5-v1_1-xxl/
+CogVideoX-2b/
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2024 CogView Team@ZhipuAI
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/README.md
+++ b/README.md
+# CogView3 & CogView-3Plus
+
+[Read this in Chinese](./README_zh.md)
+
+<div align="center">
+<img src=resources/logo.svg width="50%"/>
+</div>
+
+<p align="center">
+Experience the CogView3-Plus-3B model online on <a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space" target="_blank"> 🤗 Huggingface Space</a>
+</p>
+<p align="center">
+📚 Check out the <a href="https://arxiv.org/abs/2403.05121" target="_blank">paper</a>
+</p>
+<p align="center">
+    👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a>
+</p>
+<p align="center">
+📍 Visit <a href="https://chatglm.cn/main/gdetail/65a232c082ff90a2ad2f15e2?fr=osm_cogvideox&lang=zh">Qingyan</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> for larger-scale commercial video generation models.
+</p>
+
+## Project Updates
+
+- 🔥🔥 ```2024/10/13```: We have adapted and open-sourced the **CogView-3Plus-3B** model in the [diffusers](https://github.com/huggingface/diffusers) version. You can [experience it online](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space).
+- 🔥 ```2024/9/29```: We have open-sourced **CogView3** and **CogView-3Plus-3B**. **CogView3** is a text-to-image system based on cascaded diffusion, utilizing a relay diffusion framework. **CogView-3Plus** is a series of newly developed text-to-image models based on Diffusion Transformers.
+
+## Model Introduction
+
+CogView-3-Plus builds upon CogView3 (ECCV'24) by introducing the latest DiT framework for further overall performance
+improvements. CogView-3-Plus uses the Zero-SNR diffusion noise scheduling and incorporates a joint text-image attention
+mechanism. Compared to the commonly used MMDiT structure, it effectively reduces training and inference costs while
+maintaining the model's basic capabilities. CogView-3Plus utilizes a VAE with a latent dimension of 16.
+
+The table below shows the list of text-to-image models we currently offer along with their basic information.
+
+<table style="border-collapse: collapse; width: 100%;">
+  <tr>
+    <th style="text-align: center;">Model Name</th>
+    <th style="text-align: center;">CogView3-Base-3B</th>
+    <th style="text-align: center;">CogView3-Base-3B-distill</th>
+    <th style="text-align: center;">CogView3-Plus-3B</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Model Description</td>
+    <td style="text-align: center;">The base and relay stage models of CogView3, supporting 512x512 text-to-image generation and 2x super-resolution generation.</td>
+    <td style="text-align: center;">The distilled version of CogView3, with 4 and 1 step sampling in two stages (or 8 and 2 steps).</td>
+    <td style="text-align: center;">The DiT version image generation model, supporting image generation ranging from 512 to 2048.</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Resolution</td>
+    <td colspan="2" style="text-align: center;">512 * 512</td>
+    <td style="text-align: center;">
+            512 <= H, W <= 2048 <br>
+            H * W <= 2^{21} <br>
+            H, W \mod 32 = 0
+    </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Inference Precision</td>
+    <td colspan="2" style="text-align: center;"><b>FP16 (recommended)</b>, BF16, FP32</td>
+    <td style="text-align: center;"><b>BF16* (recommended)</b>, FP16, FP32</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Memory Usage (bs = 4)</td>
+    <td style="text-align: center;"> 17G </td>
+    <td style="text-align: center;"> 64G </td>
+    <td style="text-align: center;"> 30G (2048 * 2048) <br> 20G (1024 * 1024) </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Prompt Language</td>
+    <td colspan="3" style="text-align: center;">English*</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Maximum Prompt Length</td>
+    <td colspan="2" style="text-align: center;">225 Tokens</td>
+    <td style="text-align: center;">224 Tokens</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Download Link (SAT)</td>
+    <td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Download Link (Diffusers)</td>
+    <td colspan="2" style="text-align: center;">Not Adapted</td>
+    <td style="text-align: center;">
+        <a href="https://huggingface.co/THUDM/CogView3-Plus-3B">🤗 HuggingFace</a><br>
+        <a href="https://modelscope.cn/models/ZhipuAI/CogView3-Plus-3B">🤖 ModelScope</a><br>
+        <a href="https://wisemodel.cn/models/ZhipuAI/CogView3-Plus-3B">🟣 WiseModel</a>
+    </td>
+</tr>
+</table>
+
+**Data Explanation**
+
+ All inference tests were conducted on a single A100 GPU with a batch size of 4,
+  using `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to save memory.
+ The models only support English input. Other languages can be translated into English when refining with large models.
+ This test environment uses the `SAT` framework. Many optimization points are not yet complete, and we will work with
+  the community to create a version of the model for the `diffusers` library. Once the `diffusers` repository is
+  supported, we will test using `diffusers`. The release is expected in November 2024.
+
+## Quick Start
+
+### Prompt Optimization
+
+Although CogView3 series models are trained with long image descriptions, we highly recommend rewriting prompts using
+large language models (LLMs) before generating text-to-image, as this will significantly improve generation quality.
+
+We provide an [example script](prompt_optimize.py). We suggest running this script to refine the prompt:
+
+```shell
+python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus"
+```
+
+### Inference Model (Diffusers)
+
+First, ensure the `diffusers` library is installed **from source**. 
+```
+pip install git+https://github.com/huggingface/diffusers.git
+```
+
+Then, run the following code:
+
+```python
+from diffusers import CogView3PlusPipeline
+import torch
+
+pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda")
+
+# Enable it to reduce GPU memory usage
+pipe.enable_model_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+
+prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
+image = pipe(
+    prompt=prompt,
+    guidance_scale=7.0,
+    num_images_per_prompt=1,
+    num_inference_steps=50,
+    width=1024,
+    height=1024,
+).images[0]
+
+image.save("cogview3.png")
+```
+
+For more inference code, please refer to [inference](inference/cli_demo.py). This folder also contains a simple WEBUI code wrapped with Gradio.
+
+### Inference Model (SAT)
+
+Please check the [sat](sat/README.md) tutorial for step-by-step instructions on model inference.
+
+### Open Source Plan
+
+Since the project is in its early stages, we are working on the following:
+
+ [ ] Fine-tuning the SAT version of CogView3-Plus-3B, including SFT and LoRA fine-tuning
+ [X] Inference with the Diffusers library version of the CogView3-Plus-3B model
+ [ ] Fine-tuning the Diffusers library version of the CogView3-Plus-3B model
+ [ ] Related work for the CogView3-Plus-3B model, including ControlNet and other tasks.
+
+## CogView3 (ECCV'24)
+
+Official paper
+repository: [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://arxiv.org/abs/2403.05121)
+
+CogView3 is a novel text-to-image generation system using relay diffusion. It breaks down the process of generating
+high-resolution images into multiple stages. Through the relay super-resolution process, Gaussian noise is added to
+low-resolution generation results, and the diffusion process begins from these noisy images. Our results show that
+CogView3 outperforms SDXL with a winning rate of 77.0%. Additionally, through progressive distillation of the diffusion
+model, CogView3 can generate comparable results while reducing inference time to only 1/10th of SDXL's.
+
+![CogView3 Showcase](resources/CogView3_showcase.png)
+![CogView3 Pipeline](resources/CogView3_pipeline.jpg)
+
+Comparison results from human evaluations:
+
+![CogView3 Evaluation](resources/CogView3_evaluation.png)
+
+## Citation
+
+🌟 If you find our work helpful, feel free to cite our paper and leave a star.
+
+```
+@article{zheng2024cogview3,
+  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
+  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
+  journal={arXiv preprint arXiv:2403.05121},
+  year={2024}
+}
+```
+
+We welcome your contributions! Click [here](resources/contribute.md) for more information.
+
+## Model License
+
+This codebase is released under the [Apache 2.0 License](LICENSE).
+
+The CogView3-Base, CogView3-Relay, and CogView3-Plus models (including the UNet module, Transformers module, and VAE
+module) are released under the [Apache 2.0 License](LICENSE).
--- a/README_zh.md
+++ b/README_zh.md
+# CogView3 & CogView-3Plus
+
+[Read this in English](./README_zh.md)
+
+<div align="center">
+<img src=resources/logo.svg width="50%"/>
+</div>
+<p align="center">
+在 <a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space" target="_blank"> 🤗 Huggingface Space</a> 在线体验 CogView3-Plus-3B 模型
+</p>
+<p align="center">
+📚 查看 <a href="https://arxiv.org/abs/2403.05121" target="_blank">论文</a>
+</p>
+<p align="center">
+    👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a>
+</p>
+<p align="center">
+📍 前往<a href="https://chatglm.cn/main/gdetail/65a232c082ff90a2ad2f15e2?fr=osm_cogvideox&lang=zh"> 清言 </a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验更大规模的商业版视频生成模型。
+</p>
+
+## 项目更新
+
+- 🔥🔥 ```2024/10/13```: 我们适配和开源了 [diffusers](https://github.com/huggingface/diffusers) 版本的  **CogView-3Plus-3B**
+  模型。你可以前往[在线体验](https://huggingface.co/spaces/THUDM-HF-SPACE/CogView3-Plus-3B-Space)。
+- 🔥 ```2024/9/29```: 我们已经开源了 **CogView3**  以及 **CogView-3Plus-3B** 。**CogView3** 是一个基于级联扩散的文本生成图像系统，采用了接力扩散框架。
+  **CogView-3Plus** 是一系列新开发的基 Diffusion Transformer 的文本生成图像模型。
+
+## 模型介绍
+
+CogView-3-Plus 在 CogView3（ECCV'24） 的基础上引入了最新的 DiT 框架，以实现整体性能的进一步提升。CogView-3-Plus 采用了
+Zero-SNR
+扩散噪声调度，并引入了文本-图像联合注意力机制。与常用的 MMDiT 结构相比，它在保持模型基本能力的同时，有效降低了训练和推理成本。CogView-3Plus
+使用潜在维度为 16 的 VAE。
+
+下表显示了我们目前提供的文本生成图像模型列表及其基础信息。
+
+<table style="border-collapse: collapse; width: 100%;">
+  <tr>
+    <th style="text-align: center;">模型名称</th>
+    <th style="text-align: center;">CogView3-Base-3B</th>
+    <th style="text-align: center;">CogView3-Base-3B-distill</th>
+    <th style="text-align: center;">CogView3-Plus-3B</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">模型描述</td>
+    <td style="text-align: center;">CogView3 的基础阶段和接力阶段模型，支持 512x512 文本生成图像以及 2x 超分辨率生成。</td>
+    <td style="text-align: center;">CogView3 的蒸馏版本，分别在两个阶段采样 4 和 1 步（或 8 和 2 步）。</td>
+    <td style="text-align: center;">DIT 版本的图像生成模型 ，支持从 512 到 2048 范围内的图像生成。</td>
+  <tr>
+    <td style="text-align: center;">分辨率</td>
+    <td colspan="2" style="text-align: center;">512 * 512</td>
+    <td style="text-align: center;">
+            512 <= H, W <= 2048 <br>
+            H * W <= 2^{21} <br>
+            H, W \mod 32 = 0
+    </td>
+  <tr>
+    <td style="text-align: center;">推理精度</td>
+    <td colspan="2" style="text-align: center;"><b>FP16(推荐)</b>, BF16, FP32</td>
+   <td style="text-align: center;"><b>BF16*(推荐)</b>, FP16, FP32</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;"> 显存占用 (bs = 4)</td>
+    <td style="text-align: center;"> 17G </td>
+    <td style="text-align: center;"> 64G </td>
+    <td style="text-align: center;"> 30G(2048 * 2048) <br> 20G(1024 * 1024) </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">提示词语言</td>
+    <td colspan="3" style="text-align: center;">English*</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">提示词长度上限</td>
+    <td colspan="2" style="text-align: center;">225 Tokens</td>
+    <td style="text-align: center;">224 Tokens</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">下载链接 (SAT)</td>
+    <td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">下载链接 (Diffusers)</td>
+    <td colspan="2"  style="text-align: center;"> 未适配 </td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogView3-Plus-3B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogView3-Plus-3B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogView3-Plus-3B">🟣 WiseModel</a></td>
+  </tr>
+
+</table>
+
+**数据解释**
+
+ 所有推理测试均在单卡A100上运行，批量大小为4。并使用`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`以节约显存。
+ 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。
+ 本次测试环境均使用`SAT`框架测试，众多优化点还未完善，我们会联合社区一起制作`diffusers`库版本的模型。`diffusers`
+  仓库支持后，将会使用`diffusers` 测试。预计将于 2024 年 11 月发布。
+
+## 快速开始
+
+### 提示词优化
+
+虽然 CogView3 系列模型都是通过长篇合成图像描述进行训练的，但我们强烈建议在文本生成图像之前，基于大语言模型（LLMs）进行提示词的重写操作，这将大大提高生成质量。
+
+我们提供了一个 [示例脚本](prompt_optimize.py)。我们建议您运行这个脚本，以实现对提示词对润色
+
+```shell
+python prompt_optimize.py --api_key "智谱AI API Key" --prompt {你的提示词} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus"
+```
+
+### 推理模型(Diffusers)
+
+首先，确保从源代码安装`diffusers`库。
+
+```shell
+pip install git+https://github.com/huggingface/diffusers.git
+```
+接着，运行以下代码：
+
+```python
+from diffusers import CogView3PlusPipeline
+import torch
+
+pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda")
+
+# Open it for reduce GPU memory usage
+pipe.enable_model_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+
+prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
+image = pipe(
+    prompt=prompt,
+    guidance_scale=7.0,
+    num_images_per_prompt=1,
+    num_inference_steps=50,
+    width=1024,
+    height=1024,
+).images[0]
+
+image.save("cogview3.png")
+```
+
+更多推理代码，请关注[inference](inference/cli_demo.py),该文件夹还包含一个Gradio封装的简单WEBUI代码。
+
+### 推理模型 (SAT)
+
+请查看 [sat](sat/README_zh.md) 手把手教程实现模型推理。
+
+### 开源计划
+
+由于项目处于初步阶段，我们正在制作以下内容：
+
+ [ ] CogView3-Plus-3B SAT版本的模型微调，包括SFT和Lora微调
+ [X] CogView3-Plus-3B Diffuser库版本模型的推理
+ [ ] CogView3-Plus-3B Diffuser库版本模型的微调
+ [ ] CogView3-Plus-3B 模型相关周边，包括ControlNet等工作。
+
+## CogView3（ECCV'24）
+
+官方论文仓库：[CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://arxiv.org/abs/2403.05121)
+
+CogView3 是一种新颖的文本生成图像系统，采用了接力扩散的方式，将生成高分辨率图像的过程分解为多个阶段。通过接力的超分辨率过程，对低分辨率生成结果添加高斯噪声，并从这些带噪声的图像开始扩散。我们的结果显示，CogView3
+的表现优于 SDXL，获胜率达到 77.0%。此外，通过对扩散模型的逐步蒸馏，CogView3 能够在推理时间仅为 SDXL 1/10 的情况下，生成可比的结果。
+
+![CogView3 示例](resources/CogView3_showcase.png)
+![CogView3 流程](resources/CogView3_pipeline.jpg)
+
+人类评估的对比结果：
+
+![CogView3 evaluation](resources/CogView3_evaluation.png)
+
+## 引用
+
+🌟 如果您发现我们的工作有所帮助，欢迎引用我们的文章，留下宝贵的stars
+
+```
+@article{zheng2024cogview3,
+  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
+  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
+  journal={arXiv preprint arXiv:2403.05121},
+  year={2024}
+}
+```
+
+我们欢迎您的贡献，您可以点击[这里](resources/contribute_zh.md)查看更多信息。
+
+## 模型协议
+
+该代码库基于 [Apache 2.0 License](LICENSE) 协议发布。
+
+CogView3-Base、CogView3-Relay 和 CogView3-Plus 模型（包括 UNet 模块、Transformers 模块和 VAE
+模块）基于 [Apache 2.0 License](LICENSE) 协议发布。
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
+"""
+This script demonstrates how to generate an image using the CogView3-Plus-3B model with the Hugging Face `diffusers` pipeline.
+It showcases memory-efficient techniques like model offloading, VAE slicing, and tiling to reduce memory consumption during inference.
+The prompt describes an image to be generated by the model, and the final image is saved to disk.
+
+Running the Script:
+To run the script, use the following command with appropriate arguments:
+
+```bash
+python cli_demo.py --prompt "A beautiful sunset over a mountain" --width 1024 --height 1024
+```
+
+Additional options are available to specify the model path, guidance scale, number of inference steps, image generation type, and output paths.
+"""
+
+from diffusers import CogView3PlusPipeline
+import torch
+import argparse
+
+
+def generate_image(
+    prompt, model_path, guidance_scale, num_images_per_prompt, num_inference_steps, width, height, output_path, dtype
+):
+    # Load the pre-trained model with the specified precision
+    pipe = CogView3PlusPipeline.from_pretrained(model_path, torch_dtype=dtype)
+
+    # Enable CPU offloading to free up GPU memory when layers are not actively being used
+    pipe.enable_model_cpu_offload()
+
+    # Enable VAE slicing and tiling for memory optimization
+    pipe.vae.enable_slicing()
+    pipe.vae.enable_tiling()
+
+    # Generate the image based on the prompt
+    image = pipe(
+        prompt=prompt,
+        guidance_scale=guidance_scale,
+        num_images_per_prompt=num_images_per_prompt,
+        num_inference_steps=num_inference_steps,
+        width=width,
+        height=height,
+    ).images[0]
+
+    # Save the generated image to the local file system
+    image.save(output_path)
+
+    print(f"Image saved to {output_path}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Generate an image using the CogView3-Plus-3B model.")
+
+    # Define arguments for prompt, model path, etc.
+    parser.add_argument("--prompt", type=str, required=True, help="The text description for generating the image.")
+    parser.add_argument(
+        "--model_path", type=str, default="THUDM/CogView3-Plus-3B", help="Path to the pre-trained model."
+    )
+    parser.add_argument(
+        "--guidance_scale", type=float, default=7.0, help="The guidance scale for classifier-free guidance."
+    )
+    parser.add_argument(
+        "--num_images_per_prompt", type=int, default=1, help="Number of images to generate per prompt."
+    )
+    parser.add_argument("--num_inference_steps", type=int, default=50, help="Number of denoising steps for inference.")
+    parser.add_argument("--width", type=int, default=1024, help="Width of the generated image.")
+    parser.add_argument("--height", type=int, default=1024, help="Height of the generated image.")
+    parser.add_argument("--output_path", type=str, default="cogview3.png", help="Path to save the generated image.")
+    parser.add_argument("--dtype", type=str, default="bfloat16", help="Precision type (float16 or bfloat16).")
+
+    # Parse the arguments
+    args = parser.parse_args()
+
+    # Convert dtype argument to torch dtype
+    dtype = torch.bfloat16 if args.dtype == "bfloat16" else torch.float16
+
+    # Call the function to generate the image
+    generate_image(
+        prompt=args.prompt,
+        model_path=args.model_path,
+        guidance_scale=args.guidance_scale,
+        num_images_per_prompt=args.num_images_per_prompt,
+        num_inference_steps=args.num_inference_steps,
+        width=args.width,
+        height=args.height,
+        output_path=args.output_path,
+        dtype=dtype,
+    )
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
+"""
+THis is the main file for the gradio web demo. It uses the CogView3-Plus-3B model to generate images gradio web demo.
+set environment variable OPENAI_API_KEY to use the OpenAI API to enhance the prompt.
+
+Usage:
+    OpenAI_API_KEY=your_openai_api_key OpenAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
+"""
+
+import os
+import re
+import threading
+import time
+from datetime import datetime, timedelta
+
+import gradio as gr
+import random
+from diffusers import CogView3PlusPipeline
+import torch
+from openai import OpenAI
+
+import gc
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.bfloat16).to(device)
+
+os.makedirs("./gradio_tmp", exist_ok=True)
+
+
+def clean_string(s):
+    s = s.replace("\n", " ")
+    s = s.strip()
+    s = re.sub(r"\s{2,}", " ", s)
+    return s
+
+
+def convert_prompt(
+    prompt: str,
+    retry_times: int = 5,
+) -> str:
+    if not os.environ.get("OPENAI_API_KEY"):
+        return prompt
+    client = OpenAI()
+    system_instruction = """
+    You are part of a team of bots that creates images . You work with an assistant bot that will draw anything you say. 
+    For example , outputting " a beautiful morning in the woods with the sun peaking through the trees " will trigger your partner bot to output an image of a forest morning , as described. 
+    You will be prompted by people looking to create detailed , amazing images. The way to accomplish this is to take their short prompts and make them extremely detailed and descriptive. 
+    There are a few rules to follow : 
+    - Prompt should always be written in English, regardless of the input language. Please provide the prompts in English.
+    - You will only ever output a single image description per user request.
+    - Image descriptions must be detailed and specific, including keyword categories such as subject, medium, style, additional details, color, and lighting. 
+    - When generating descriptions, focus on portraying the visual elements rather than delving into abstract psychological and emotional aspects. Provide clear and concise details that vividly depict the scene and its composition, capturing the tangible elements that make up the setting.
+    - Do not provide the process and explanation, just return the modified English description . Image descriptions must be between 100-200 words. Extra words will be ignored. 
+    """
+
+    text = prompt.strip()
+    for i in range(retry_times):
+        try:
+            response = client.chat.completions.create(
+                messages=[
+                    {"role": "system", "content": f"{system_instruction}"},
+                    {
+                        "role": "user",
+                        "content": 'Create an imaginative image descriptive caption for the user input : "一个头发花白的老人"',
+                    },
+                    {
+                        "role": "assistant",
+                        "content": "A seasoned male with white hair and a neatly groomed beard stands confidently, donning a dark vest over a striped shirt. His hands are clasped together in front, one adorned with a ring, as he looks directly at the viewer with a composed expression. The soft lighting accentuates his features and the subtle textures of his attire, creating a portrait that exudes sophistication and a timeless elegance.",
+                    },
+                    {
+                        "role": "user",
+                        "content": 'Create an imaginative image descriptive caption for the user input : "画一只老鹰"',
+                    },
+                    {
+                        "role": "assistant",
+                        "content": "A majestic eagle with expansive brown and white wings glides through the air, its sharp yellow eyes focused intently ahead. The eagle's talons are poised and ready for hunting, as it soars over a rugged mountainous terrain dusted with snow, under a soft blue sky.",
+                    },
+                    {
+                        "role": "user",
+                        "content": 'Create an imaginative image descriptive caption for the user input : "画一辆摩托车"',
+                    },
+                    {
+                        "role": "assistant",
+                        "content": "Parked on a wet city street at night, a sleek motorcycle with a black and green design stands out. Its headlights cast a soft glow, reflecting off the puddles and highlighting its aerodynamic shape. The design is marked by sharp lines and angular features, with gold accents that shine against the dark backdrop. The motorcycle exudes an air of performance and luxury, ready to slice through the urban landscape.",
+                    },
+                    {
+                        "role": "user",
+                        "content": 'Create an imaginative image descriptive caption for the user input : "穿着金色盔甲的人"',
+                    },
+                    {
+                        "role": "assistant",
+                        "content": "A figure clad in meticulously crafted, golden armor stands with an air of quiet confidence. The armor, reminiscent of medieval knight attire, features a scalloped design with leaf-like patterns and is complemented by a black, form-fitting undergarment. The helmet, with its angular visor, adds to the intimidating presence. This armor, with its rich gold tones and intricate details, suggests a character of nobility or mythical origin, poised for valorous endeavors.",
+                    },
+                    {
+                        "role": "user",
+                        "content": f'Create an imaginative image descriptive caption for the user input : "{text}"',
+                    },
+                ],
+                model="glm-4-plus",
+                temperature=0.01,
+                top_p=0.7,
+                stream=False,
+                max_tokens=300,
+            )
+            prompt = response.choices[0].message.content
+            if prompt:
+                prompt = clean_string(prompt)
+                break
+        except Exception as e:
+            pass
+
+    return prompt
+
+
+def delete_old_files():
+    while True:
+        now = datetime.now()
+        cutoff = now - timedelta(minutes=5)
+        directories = ["./gradio_tmp"]
+
+        for directory in directories:
+            for filename in os.listdir(directory):
+                file_path = os.path.join(directory, filename)
+                if os.path.isfile(file_path):
+                    file_mtime = datetime.fromtimestamp(os.path.getmtime(file_path))
+                    if file_mtime < cutoff:
+                        os.remove(file_path)
+        time.sleep(600)
+
+
+threading.Thread(target=delete_old_files, daemon=True).start()
+
+
+def infer(
+    prompt,
+    seed,
+    randomize_seed,
+    width,
+    height,
+    guidance_scale,
+    num_inference_steps,
+    progress=gr.Progress(track_tqdm=True),
+):
+    gc.collect()
+    torch.cuda.empty_cache()
+    torch.cuda.ipc_collect()
+    
+    if randomize_seed:
+        seed = random.randint(0, 65536)
+
+    image = pipe(
+        prompt=prompt,
+        guidance_scale=guidance_scale,
+        num_images_per_prompt=1,
+        num_inference_steps=num_inference_steps,
+        width=width,
+        height=height,
+        generator=torch.Generator().manual_seed(seed),
+    ).images[0]
+    return image, seed
+
+
+examples = [
+    "A vintage pink convertible with glossy chrome finishes and whitewall tires sits parked on an open road, surrounded by a field of wildflowers under a clear blue sky. The car's body is a delicate pastel pink, complementing the vibrant greens and colors of the meadow. Its interior boasts cream leather seats and a polished wooden dashboard, evoking a sense of classic elegance. The sun casts a soft light on the vehicle, highlighting its curves and shiny surfaces, creating a picture of nostalgia mixed with dreamy escapism.",
+    "A noble black Labrador retriever sits serenely in a sunlit meadow, its glossy coat absorbing the golden rays of a late afternoon sun. The dog's intelligent eyes sparkle with a mixture of curiosity and loyalty, as it gazes off into the distance where the meadow meets a line of tall, slender birch trees. The dog's posture is regal, yet approachable, with its tongue playfully hanging out to the side slightly, suggesting a friendly disposition. The idyllic setting is filled with the vibrant greens of lush grass and the soft colors of wildflowers speckled throughout, creating a peaceful harmony between the dog and its natural surroundings.",
+    "A vibrant red-colored dog of medium build stands attentively in an autumn forest setting. Its fur is a deep, rich red, reminiscent of autumn leaves, contrasting with its bright, intelligent eyes, a clear sky blue. The dog's ears perk up, and its tail wags slightly as it looks off into the distance, its posture suggesting alertness and curiosity. Golden sunlight filters through the canopy of russet and gold leaves above, casting dappled light onto the forest floor and the glossy coat of the canine, creating a serene and heartwarming scene.",
+]
+
+css = """
+#col-container {
+    margin: 0 auto;
+    max-width: 640px;
+}
+"""
+
+with gr.Blocks(css=css) as demo:
+    with gr.Column(elem_id="col-container"):
+        gr.Markdown(f"""
+            <div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
+             CogView3-Plus Huggingface Space🤗
+           </div>
+           <div style="text-align: center;">
+               <a href="https://huggingface.co/THUDM/CogView3-Plus">🤗 Model Hub | 
+               <a href="https://github.com/THUDM/CogView3">🌐 Github</a> |
+               <a href="https://arxiv.org/abs/2403.05121">📜 arxiv </a>
+           </div>
+           <div style="text-align: center;display: flex;justify-content: center;align-items: center;margin-top: 1em;margin-bottom: .5em;">
+              <span>If the Space is too busy, duplicate it to use privately</span>
+              <a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogView-3-Plus?duplicate=true"><img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/duplicate-this-space-lg.svg" width="160" style="
+                margin-left: .75em;
+            "></a>
+           </div>
+           <div style="text-align: center; font-size: 15px; font-weight: bold; color: red; margin-bottom: 20px;">
+            ⚠️ This demo is for academic research and experiential use only. 
+            </div>
+        """)
+
+        with gr.Row():
+            prompt = gr.Text(
+                label="Prompt",
+                show_label=False,
+                max_lines=3,
+                placeholder="Enter your prompt",
+                container=False,
+            )
+        with gr.Row():
+            enhance = gr.Button("Enhance Prompt (Strongly Suggest)", scale=1)
+            enhance.click(convert_prompt, inputs=[prompt], outputs=[prompt])
+            run_button = gr.Button("Run", scale=1)
+        result = gr.Image(label="Result", show_label=False)
+
+        with gr.Accordion("Advanced Settings", open=False):
+            seed = gr.Slider(
+                label="Seed",
+                minimum=0,
+                maximum=65536,
+                step=1,
+                value=0,
+            )
+
+            randomize_seed = gr.Checkbox(label="Randomize seed", value=True)
+
+            with gr.Row():
+                width = gr.Slider(
+                    label="Width",
+                    minimum=512,
+                    maximum=2048,
+                    step=32,
+                    value=1024,
+                )
+
+                height = gr.Slider(
+                    label="Height",
+                    minimum=512,
+                    maximum=2048,
+                    step=32,
+                    value=1024,
+                )
+
+            with gr.Row():
+                guidance_scale = gr.Slider(
+                    label="Guidance scale",
+                    minimum=0.0,
+                    maximum=10.0,
+                    step=0.1,
+                    value=7.0,
+                )
+
+                num_inference_steps = gr.Slider(
+                    label="Number of inference steps",
+                    minimum=10,
+                    maximum=100,
+                    step=1,
+                    value=50,
+                )
+
+        gr.Examples(examples=examples, inputs=[prompt])
+    gr.on(
+        triggers=[run_button.click, prompt.submit],
+        fn=infer,
+        inputs=[prompt, seed, randomize_seed, width, height, guidance_scale, num_inference_steps],
+        outputs=[result, seed],
+    )
+
+demo.queue().launch()
--- a/inference/requirements.txt
+++ b/inference/requirements.txt
+deepspeed>=0.15.1
+transformers>=4.45.0
+gradio>=5.0.2
+accelerate>=1.0.0
+diffusers
+sentencepiece>=0.2.0
+torch>=2.4.1
+openai
--- a/prompt_optimize.py
+++ b/prompt_optimize.py
+import re
+import argparse
+from openai import OpenAI
+import traceback
+
+
+def clean_string(s):
+    s = s.replace("\n", " ")
+    s = s.strip()
+    s = re.sub(r"\s{2,}", " ", s)
+    return s
+
+
+def upsample_prompt(
+        prompt: str,
+        api_key: str,
+        url: str,
+        model: str
+) -> str:
+    client = OpenAI(api_key=api_key, base_url=url)
+    system_instruction = """
+    You are part of a team of bots that creates images . You work with an assistant bot that will draw anything you say. 
+    For example , outputting " a beautiful morning in the woods with the sun peaking through the trees " will trigger your partner bot to output an image of a forest morning , as described. 
+    You will be prompted by people looking to create detailed , amazing images. The way to accomplish this is to take their short prompts and make them extremely detailed and descriptive. 
+    There are a few rules to follow : 
+    - Prompt should always be written in English, regardless of the input language. Please provide the prompts in English.
+    - You will only ever output a single image description per user request.
+    - Image descriptions must be detailed and specific, including keyword categories such as subject, medium, style, additional details, color, and lighting. 
+    - When generating descriptions, focus on portraying the visual elements rather than delving into abstract psychological and emotional aspects. Provide clear and concise details that vividly depict the scene and its composition, capturing the tangible elements that make up the setting.
+    - Do not provide the process and explanation, just return the modified English description . Image descriptions must be between 100-200 words. Extra words will be ignored. 
+    """
+
+    text = prompt.strip()
+    try:
+        response = client.chat.completions.create(
+            messages=[
+                {"role": "system", "content": f"{system_instruction}"},
+                {
+                    "role": "user",
+                    "content": 'Create an imaginative image descriptive caption for the user input : "一个头发花白的老人"',
+                },
+                {
+                    "role": "assistant",
+                    "content": "A seasoned male with white hair and a neatly groomed beard stands confidently, donning a dark vest over a striped shirt. His hands are clasped together in front, one adorned with a ring, as he looks directly at the viewer with a composed expression. The soft lighting accentuates his features and the subtle textures of his attire, creating a portrait that exudes sophistication and a timeless elegance.",
+                },
+                {
+                    "role": "user",
+                    "content": 'Create an imaginative image descriptive caption for the user input : "画一只老鹰"',
+                },
+                {
+                    "role": "assistant",
+                    "content": "A majestic eagle with expansive brown and white wings glides through the air, its sharp yellow eyes focused intently ahead. The eagle's talons are poised and ready for hunting, as it soars over a rugged mountainous terrain dusted with snow, under a soft blue sky.",
+                },
+                {
+                    "role": "user",
+                    "content": 'Create an imaginative image descriptive caption for the user input : "画一辆摩托车"',
+                },
+                {
+                    "role": "assistant",
+                    "content": "Parked on a wet city street at night, a sleek motorcycle with a black and green design stands out. Its headlights cast a soft glow, reflecting off the puddles and highlighting its aerodynamic shape. The design is marked by sharp lines and angular features, with gold accents that shine against the dark backdrop. The motorcycle exudes an air of performance and luxury, ready to slice through the urban landscape.",
+                },
+                {
+                    "role": "user",
+                    "content": 'Create an imaginative image descriptive caption for the user input : "穿着金色盔甲的人"',
+                },
+                {
+                    "role": "assistant",
+                    "content": "A figure clad in meticulously crafted, golden armor stands with an air of quiet confidence. The armor, reminiscent of medieval knight attire, features a scalloped design with leaf-like patterns and is complemented by a black, form-fitting undergarment. The helmet, with its angular visor, adds to the intimidating presence. This armor, with its rich gold tones and intricate details, suggests a character of nobility or mythical origin, poised for valorous endeavors.",
+                },
+                {
+                    "role": "user",
+                    "content": f'Create an imaginative image descriptive caption for the user input : "{text}"',
+                },
+            ],
+            model=model,
+            temperature=0.01,
+            top_p=0.7,
+            stream=False,
+            max_tokens=300,
+        )
+        prompt = response.choices[0].message.content
+        if prompt:
+            prompt = clean_string(prompt)
+    except Exception as e:
+        traceback.print_exc()
+    return prompt
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--api_key", type=str, help="api key")
+    parser.add_argument("--prompt", type=str, help="Prompt to upsample")
+    parser.add_argument(
+        "--base_url",
+        type=str,
+        default="https://open.bigmodel.cn/api/paas/v4",
+        help="base url"
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default="glm-4-plus",
+        help="LLM using for upsampling"
+    )
+    args = parser.parse_args()
+
+    api_key = args.api_key
+    prompt = args.prompt
+
+    prompt_enhanced = upsample_prompt(
+        prompt=prompt,
+        api_key=api_key,
+        url=args.base_url,
+        model=args.model
+    )
+    print(prompt_enhanced)
--- a/pyproject.toml
+++ b/pyproject.toml
+[tool.ruff]
+line-length = 119
+
+[tool.ruff.lint]
+# Never enforce `E501` (line length violations).
+ignore = ["C901", "E501", "E741", "F402", "F823"]
+select = ["C", "E", "F", "I", "W"]
+
+# Ignore import violations in all `__init__.py` files.
+[tool.ruff.lint.per-file-ignores]
+"__init__.py" = ["E402", "F401", "F403", "F811"]
+
+[tool.ruff.lint.isort]
+lines-after-imports = 2
+
+[tool.ruff.format]
+# Like Black, use double quotes for strings.
+quote-style = "double"
+
+# Like Black, indent with spaces, rather than tabs.
+indent-style = "space"
+
+# Like Black, respect magic trailing commas.
+skip-magic-trailing-comma = false
+
+# Like Black, automatically detect the appropriate line ending.
+line-ending = "auto"
--- a/resources/CogView3_evaluation.png
+++ b/resources/CogView3_evaluation.png
--- a/resources/CogView3_pipeline.jpg
+++ b/resources/CogView3_pipeline.jpg
--- a/resources/CogView3_showcase.png
+++ b/resources/CogView3_showcase.png
--- a/resources/WECHAT.md
+++ b/resources/WECHAT.md
+<div align="center">
+<img src=wechat.jpg width="60%"/>
+
+<p> 扫码关注公众号，加入「 CogView 交流群」 </p>
+<p> Scan the QR code to follow the official account and join the "CogView Discussion Group" </p>
+</div>
+
--- a/resources/contribute.md
+++ b/resources/contribute.md
+# Contribution Guide
+
+There may still be many incomplete aspects in this project.
+
+We look forward to your contributions to the repository in the following areas. If you complete the work mentioned above
+and are willing to submit a PR and share it with the community, upon review, we
+will acknowledge your contribution on the project homepage.
+
+## Model Algorithms
+
+- Support for model quantization inference (Int4 quantization project)
+- Optimization of model fine-tuning data loading (replacing the existing decord tool)
+
+## Model Engineering
+
+- `diffusers` version of the model implementation
+- Model fine-tuning examples / Best prompt practices
+- Inference adaptation on different devices (e.g., MLX framework)
+- Any tools related to the model
+
+## Code Standards
+
+Good code style is an art. We have prepared a `pyproject.toml` configuration file for the project to standardize code
+style. You can organize the code according to the following specifications:
+
+1. Install the `ruff` tool
+
+```shell
+pip install ruff
+```
+
+Then, run the `ruff` tool
+
+```shell
+ruff check tools sat inference
+```
+
+Check the code style. If there are issues, you can automatically fix them using the `ruff format` command.
+
+```shell
+ruff format tools sat inference
+```
+
+Once your code meets the standard, there should be no errors.
+
+## Naming Conventions
+1. Please use English names, do not use Pinyin or other language names. All comments should be in English.
+2. Please strictly follow the PEP8 specification and use underscores to separate words. Do not use names like a, b, c.
+
--- a/resources/contribute_zh.md
+++ b/resources/contribute_zh.md
+# 贡献指南
+
+本项目可能还存在很多不完善的内容。 我们期待您在以下方面与我们共建仓库, 如果您完成了上述工作并愿意PR和分享到社区，在通过审核后，我们将在项目首页感谢您的贡献。
+
+## 模型工程
+
+- `diffusers` 版本的模型实现
+- 模型微调示例 / 最佳提示词实践
+- 不同设备上的推理适配(MLX等框架)
+- 任何模型周边工具
+
+## 代码规范
+
+良好的代码风格是一种艺术，我们已经为项目准备好了`pyproject.toml`配置文件，用于规范代码风格。您可以按照以下规范梳理代码:
+
+1. 安装`ruff`工具
+
+```shell
+pip install ruff
+```
+
+接着，运行`ruff`工具
+
+```shell
+ruff check tools sat inference
+```
+
+检查代码风格，如果有问题，您可以通过`ruff format .`命令自动修复。
+
+```shell
+ruff format tools sat inference
+```
+
+如果您的代码符合规范，应该不会出现任何的错误。
+
+## 命名规范
+
+- 请使用英文命名，不要使用拼音或者其他语言命名。所有的注释均使用英文。
+- 请严格遵循 PEP8 规范，使用下划线分割单词。请勿使用 a,b,c 这样的命名。
\ No newline at end of file
--- a/resources/logo.svg
+++ b/resources/logo.svg
--- a/resources/wechat.jpg
+++ b/resources/wechat.jpg