Commit 2a934cec authored by raojy's avatar raojy
Browse files

first

parent 4b618aa3
# SenseNova-U1 # SenseNova-U1
## 论文
[SenseNova-U1](https://arxiv.org/abs/2605.12500)
## 模型简介
由 Inclusion AI 推出的160 亿参数 MoE 混合专家统一扩散大语言模型,基于掩码词预测范式打通多模态理解与生成全能力,依托 SigLIP-VQ 视觉分词器实现高效视觉编码,搭配蒸馏扩散解码器仅需 8 步即可完成高清图像生成;支持文生图、图文理解、指令式图像编辑、带思维推理生成等功能,还搭载 SPRINT 推理加速方案大幅提升运行速度,开源协议为 Apache2.0,仅需加载完整模型权重即可实现多模态全场景任务,是兼顾理解与创作的全能型多模态大模型。
<div align=center>
<img src="./doc/1.png"/>
</div>
## 环境依赖
| 软件 | 版本 |
| :------: |:-----------------------------------------:|
| DTK | 26.04 |
| Python | 3.11.9 |
| Transformers | 4.57.1 |
| Torch | 2.5.1+das.opt1.dtk2604 |
| Flash_attn | 2.8.3+das.opt1.dtk2604.torch251 |
推荐使用镜像: harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm011-ubuntu22.04-dtk26.04-nova
```bash
docker run -it \
--shm-size 256g \
--network=host \
--name nova \
--privileged \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mkfd \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-u root \
-v /opt/hyhal/:/opt/hyhal/:ro \
-v /path/your_code_data/:/path/your_code_data/ \
harbor.sourcefind.cn:5443/dcu/admin/base/custom:vllm011-ubuntu22.04-dtk26.04-nova bash
```
更多镜像可前往[光源](https://sourcefind.cn/#/service-list)下载使用。
关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.sourcefind.cn/tool/)开发者社区下载安装。
## 预训练权重
**请根据`支持的DCU型号`选择对应模型下载,FP8模型仅在BW1100/BW1101上支持,其他型号请勿使用!**
| 模型名称 | 权重大小 | 数据类型 |支持的DCU型号 | 最低卡数需求 | 下载地址 |
|:------:|:----:|:----:|:----------:|:------:|:---------------------:|
| SenseNova-U1-8B-MoT | 8B | BF16 | BW1000 | 1 | [Modelscope](https://modelscope.cn/models/SenseNova/SenseNova-U1-8B-MoT) |
## 数据集
暂无
## 训练
暂无
## 推理
### Transformers
#### 单机推理
##### BF16
##### 视觉理解
```
python examples/vqa/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --image examples/vqa/data/images/menu.jpg --question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." --output outputs/answer.txt --max_new_tokens 8192 --do_sample --temperature 0.6 --top_p 0.95 --top_k 20 --repetition_penalty 1.05 --profile
```
##### 文生图
```
python examples/t2i/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。" --width 2720 --height 1536 --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output.png --profile
```
##### 图像编辑
```
python examples/editing/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "Change the animal's fur color to a darker shade." --image examples/editing/data/images/1.webp --cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output_edited.png --profile --compare
```
##### 图文交错生成
```
python examples/interleave/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." --resolution "16:9" --output_dir outputs/interleave/ --stem demo --profile
```
## 效果展示
<div align=center>
<img src="./doc/ou0.png"/>
</div>
<div align=center>
<img src="./doc/output1.png"/>
</div>
<div align=center>
<img src="./doc/33.png"/>
</div>
### 精度
DCU与GPU精度一致,推理框架:pytorch。
## 源码仓库及问题反馈
- https://developer.sourcefind.cn/codes/modelzoo/sensenova-u1
## 参考资料
- https://github.com/OpenSenseNova/SenseNova-U1
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.14.4
hooks:
- id: ruff
name: ruff check (import sorting)
args: ["--select", "I", "--fix", "--exit-non-zero-on-fix"]
- id: ruff-format
name: ruff format
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: pretty-format-json
name: format ComfyUI example workflows
args: ["--autofix", "--indent=2", "--no-sort-keys", "--no-ensure-ascii"]
files: ^apps/comfyui/example_workflows/.*\.json$
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
<p align="center">
<strong>English</strong> | <a href="./README_CN.md">简体中文</a>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2605.12500"><img src="https://img.shields.io/badge/arXiv-2605.12500-b31b1b.svg" alt="arXiv"></a>
<a href="https://huggingface.co/collections/sensenova/sensenova-u1"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow" alt="HuggingFace Model"></a>
<a href="https://modelscope.cn/collections/SenseNova/SenseNova-U1"><img src="https://img.shields.io/badge/%F0%9F%A4%96%20ModelScope-模型-purple" alt="ModelScope-模型"></a>
<a href="https://unify.light-ai.top/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20SenseNova_U1-Demo-Green" alt="SenseNova-U1 Demo"></a>
<a href="./LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"></a>
<a href="https://discord.com/invite/BuTXPHmQub"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
</p>
<p align="center">
<img src="docs/assets/teaser.webp" alt="SenseNova-U1" width="900">
</p>
<p align="center">
<img src="docs/assets/teaser_2.webp" alt="visualization" width="900">
</p>
## 📣 Updated News
- `[2026.05.15]` Release [SenseNova-U1-8B-MoT-Infographic 📊](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-Infographic) model for improved infographic generation. See [U1 Infographic Model](docs/u1_infographic_model.md) for details, and [✨ Infographic Showcases ](docs/u1_infographic_showcases.md) for 100 generated examples.
- `[2026.05.10]` Release [🔥SenseNova-U1 Technical Report🔥](https://github.com/OpenSenseNova/SenseNova-U1/blob/main/docs/pdf/SenseNOVA_U1.pdf) and the weights for [SenseNova-U1-A3B-MoT-SFT](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT-SFT) & [SenseNova-U1-A3B-MoT](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT).
- `[2026.05.08]` Add **GGUF quantized checkpoints** and **layer-offload VRAM modes** for low-VRAM single-GPU inference. See [Memory-efficient inference](#-memory-efficient-inference-gguf--vram-modes). GGUF weights for `SenseNova-U1-8B-MoT-Merger` are available at [🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf](https://huggingface.co/smthem/SenseNova-U1-8B-MoT-Merger-gguf) — many thanks to [@smthem](https://github.com/smthem) for contributing the quantized weights.
- `[2026.05.06]` Release [SenseNova-U1-8B-MoT-LoRA-8step-V1.0](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-LoRAs/blob/main/SenseNova-U1-8B-MoT-LoRA-8step-V1.0.safetensors). Please see the [example script](docs/base_vs_distill.md#run-base-and-distilled-model).
- `[2026.04.30]` Release the preview version of the 8-step inference model [SenseNova-U1-8B-MoT-8step-preview](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-8step-preview). In most cases, the image generation quality of this model closely matches that of the base model (see [comparison and existing issues](docs/base_vs_distill.md)). To test this model, you can use the [inference scripts](examples/README.md), but with the following parameters: ```--cfg_scale 1.0 --num_steps 8``` .
- `[2026.04.27]` Initial release of the weights for [SenseNova-U1-8B-MoT-SFT](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-SFT) and [SenseNova-U1-8B-MoT](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT).
- `[2026.04.27]` Initial release of the [inference code](https://github.com/OpenSenseNova/SenseNova-U1/blob/main/examples/README.md) for SenseNova-U1.
## 🌟 Overview
🚀 **SenseNova U1** is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture.
It marks a fundamental paradigm shift in multimodal AI: **from modality integration to true unification**. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.
Unifying visual understanding and generation in an end-to-end architecture from pixel to word opens tremendous possibilities, enabling highly efficient and strong understanding, generation, and interleaved reasoning in a natively multimodal manner.
<p align="center">
<img src="docs/assets/teaser_1.webp" alt="radar plot" width="900">
</p>
#### 🏗️ *Key Pillars:*
At the core of SenseNova U1 is **[NEO-unify](https://huggingface.co/blog/sensenova/neo-unify)**, a novel architecture designed from the first principles for multimodal AI: *It eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE) where pixel-word information are inherently and deeply correlated.* Several important features are as follows:
- 🔗 Model language and visual information end-to-end as a unified compound.
- 🖼️ Preserve semantic richness while maintaining pixel-level visual fidelity.
- 🧠 Reason across modalities with high efficiency & minimal conflict via native MoTs.
#### ✨ *What This Unlocks:*
Powered by this new core architecture, SenseNova U1 delivers exceptional efficiency in multimodal learning:
<p align="center">
<img src="docs/assets/perform_vs_speed_5bench.webp" width="48%" />
<img src="docs/assets/perform_vs_speed_infobench.webp" width="48%" />
</p>
<p align="center">
<sub>
Left: Generation Latency vs. Averaging Performance on OneIG (EN, ZH), LongText (EN, ZH), BizGenEval (Easy, Hard), CVTG and IGenBench. <br>
Right: Generation Latency vs. Averaging Performance on Infographic Benchmarks, i.e., BizGenEval (Easy, Hard), and IGenBench.
</sub>
</p>
- 🏆 **Open-source SoTA in both understanding and generation**: SenseNova U1 sets a new standard for unified multimodal understanding and generation, achieving state-of-the-art performance among open-source models across a wide range of understanding, reasoning, and generation benchmarks.
- 📖 **Native interleaved image-text generation**: SenseNova U1 can generate coherent interleaved text and images in a single flow with one model, enabling use cases such as practical guides and travel diaries that combine clear communication with vivid storytelling and transform complex information into intuitive visuals.
- 📰 **High-density information rendering**: SenseNova U1 demonstrates strong capabilities in dense visual communication, generating richly structured layouts for knowledge illustrations, posters, presentations, comics, resumes, and other information-rich formats.
#### 🌍 *Beyond Multimodality:*
- 🤖 Vision–Language–Action (VLA)
- 🌐 World Modeling (WM)
## 🦁 Models
In this release, we are open-sourcing the SenseNova U1 Lite series in two sizes:
- SenseNova U1-8B-MoT — dense backbone
- SenseNova U1-A3B-MoT — MoE backbone
| Model | Params | HF Weights |
| :---- | :------- | :--------- |
| SenseNova-U1-8B-MoT-Infographic | 8B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-Infographic) |
| SenseNova-U1-8B-MoT-SFT | 8B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-SFT) |
| SenseNova-U1-8B-MoT | 8B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT) |
| SenseNova-U1-8B-MoT-LoRA-8step-V1.0 | 0.4B | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-LoRAs/blob/main/SenseNova-U1-8B-MoT-LoRA-8step-V1.0.safetensors) |
| SenseNova-U1-A3B-MoT-SFT | A3B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT-SFT) |
| SenseNova-U1-A3B-MoT | A3B MoT | [🤗 link](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT) |
Here **SFT models** (*×32 downsampling ratio*) are trained via Understanding Warmup, Generation Pre-training, Unified Mid-training, and Unified SFT, with **final models** obtained after an initial round of T2I RL training.
Although relatively compact by today’s standards, these models already show strong performance across diverse tasks, comparable to commercial models with excellent cost efficiency. Notably, larger-scale versions are planned to further enhance capability and performance in the future.
> 💡 The `8B-MoT` in `SenseNova-U1-8B-MoT` refers to ~8B understanding parameters **and** ~8B generation parameters. See [parameter breakdown](docs/parameter_breakdown.md) for details.
## 📋 ToDo List
- [ ] Training code of SenseNova-U1
- [x] Final weights and technical report of SenseNova-U1
## 🎨 Showcases
<details>
<summary>🖼️ Text-to-Image (General)</summary>
| | | |
| :---: | :---: | :---: |
| [<img width="300" alt="t2i general dense face hd 07" src="./docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
| [<img width="260" alt="t2i general face hd 13" src="./docs/assets/showcases/t2i_general/1_1_face_hd_13.webp">](./docs/assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./docs/assets/showcases/t2i_general/1_1_face_hd_17.webp">](./docs/assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
| [<img width="260" alt="t2i general landscape 06" src="./docs/assets/showcases/t2i_general/1_1_landscape_06.webp">](./docs/assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./docs/assets/showcases/t2i_general/1_1_landscape_07.webp">](./docs/assets/showcases/t2i_general/1_1_landscape_07.webp) |
| [<img width="200" alt="t2i general portrait artistic 02 a" src="./docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./docs/assets/showcases/t2i_general/9_16_human_pose_11.webp">](./docs/assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./docs/assets/showcases/t2i_general/9_16_artistic_07.webp">](./docs/assets/showcases/t2i_general/9_16_artistic_07.webp) |
| [<img width="200" alt="t2i general portrait text rendering 02" src="./docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
</details>
<details>
<summary>🖼️ Text-to-Image (Reasoning)</summary>
<table>
<tr>
<th style="width: 20%">Original Text</th>
<th style="width: 50%">Reasoning Process</th>
<th style="width: 30%">Resulting Image</th>
</tr>
<tr>
<td style="vertical-align: top;">A male peacock trying to attract a female</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a single male peacock. The action involves the male peacock trying to attract a female. 2. <b>Reasoning Process:</b> Male peacocks perform this ritual by fanning their tail feathers into a massive, colorful circular display. 3. <b>Establish the frame:</b> The frame captures the male peacock in a full-body shot to show the entire display. The angle is eye-level to emphasize the height of the tail feathers. 4. <b>Set the lighting and color:</b> Natural lighting illuminates the colorful circular display with soft shadows. The palette features vibrant blues and greens on the tail feathers. 5. <b>Lock the style:</b> The style is a realistic photograph with sharp focus on the male peacock. The finish is natural without artistic filters. 6. <b>Explicit Prompt:</b> A vibrant male peacock with his tail feathers fully fanned out in a wide, iridescent display of blues and greens, standing in a garden.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/4_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A small piece of dry wood and a dense iron block are in a transparent water tank</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subjects are a small piece of dry wood and a dense iron block positioned within a transparent water tank. The wood is floating on the surface while the iron block is submerged at the bottom. 2. <b>Reasoning Process:</b> The wood is less dense than water so it will float, while the iron is denser and will sink to the bottom. 3. <b>Establish the frame:</b> The composition is a vertical medium shot centering the rectangular tank within the frame. The camera angle is eye-level to clearly display the water line and the submerged base. Focus is sharp across the entire depth of the tank to ensure both materials are distinct. 4. <b>Build the environment:</b> The scene is contained entirely within the clear glass walls of the water tank. The water fills the majority of the volume, providing a medium for the floating wood and sunken iron block. The background remains out of focus to keep attention on the tank's interior. 5. <b>Set the lighting and color:</b> Soft natural light illuminates the scene from the left, creating gentle reflections on the water surface. The color palette features the brown grain of the wood contrasting against the dark grey metallic finish of the iron. Shadows are soft and diffused through the liquid. 6. <b>Explicit Prompt:</b> A realistic photo of a transparent water tank showing a piece of wood floating on the surface and an iron block resting at the bottom.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/5_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
</details>
<details>
<summary>🖼️ Text-to-Image (Infographics)</summary>
<table align="center">
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0004.webp"><img width="300" alt="t2i landscape 0001" src="./docs/assets/showcases/t2i_infographic/0004.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0012.webp"><img width="300" alt="t2i landscape 0002" src="./docs/assets/showcases/t2i_infographic/0012.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0005.webp"><img width="300" alt="t2i landscape 0003" src="./docs/assets/showcases/t2i_infographic/0005.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0018.webp"><img width="300" alt="t2i landscape 0004" src="./docs/assets/showcases/t2i_infographic/0018.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0024.webp"><img width="300" alt="t2i landscape 0005" src="./docs/assets/showcases/t2i_infographic/0024.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0019.webp"><img width="300" alt="t2i landscape 0006" src="./docs/assets/showcases/t2i_infographic/0019.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0006.webp"><img width="300" alt="t2i landscape 0007" src="./docs/assets/showcases/t2i_infographic/0006.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0015.webp"><img width="300" alt="t2i landscape 0008" src="./docs/assets/showcases/t2i_infographic/0015.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0025.webp"><img width="300" alt="t2i landscape 0009" src="./docs/assets/showcases/t2i_infographic/0025.webp"></a></td>
</tr>
</table>
<table align="center">
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0000.webp"><img width="220" alt="t2i landscape 0010" src="./docs/assets/showcases/t2i_infographic/0000.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0003.webp"><img width="220" alt="t2i landscape 0011" src="./docs/assets/showcases/t2i_infographic/0003.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0001.webp"><img width="220" alt="t2i landscape 0012" src="./docs/assets/showcases/t2i_infographic/0001.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0022.webp"><img width="220" alt="t2i landscape 0012" src="./docs/assets/showcases/t2i_infographic/0022.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0016.webp"><img width="220" alt="t2i image 0022" src="./docs/assets/showcases/t2i_infographic/0016.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0010.webp"><img width="220" alt="t2i image 0020" src="./docs/assets/showcases/t2i_infographic/0010.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0007.webp"><img width="220" alt="t2i image 0021" src="./docs/assets/showcases/t2i_infographic/0007.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0021.webp"><img width="220" alt="t2i image 0023" src="./docs/assets/showcases/t2i_infographic/0021.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0014.webp"><img width="220" alt="t2i image 0024" src="./docs/assets/showcases/t2i_infographic/0014.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0028.webp"><img width="220" alt="t2i image 0025" src="./docs/assets/showcases/t2i_infographic/0028.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0033.webp"><img width="220" alt="t2i image 0026" src="./docs/assets/showcases/t2i_infographic/0033.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0002.webp"><img width="220" alt="t2i image 0027" src="./docs/assets/showcases/t2i_infographic/0002.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0031.webp"><img width="230" alt="t2i image 0028" src="./docs/assets/showcases/t2i_infographic/0031.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0030.webp"><img width="230" alt="t2i image 0029" src="./docs/assets/showcases/t2i_infographic/0030.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0032.webp"><img width="230" alt="t2i image 0030" src="./docs/assets/showcases/t2i_infographic/0032.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0029.webp"><img width="230" alt="t2i image 0031" src="./docs/assets/showcases/t2i_infographic/0029.webp"></a></td>
</tr>
</table>
</details>
> 📸 **More generation samples:** see [Image Generation Gallery](./docs/showcases.md#text-to-image).
<details>
<summary>✏️ Image Editing (General)</summary>
| | |
| :---: | :---: |
| <div align="center"><a href="./examples/editing/data/images/1.webp"><img width="150" alt="editing input 1" src="./examples/editing/data/images/1.webp"></a> <a href="./docs/assets/showcases/editing/1_out.webp"><img width="150" alt="editing output 1" src="./docs/assets/showcases/editing/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <div align="center"><a href="./examples/editing/data/images/3.webp"><img width="150" alt="editing input 3" src="./examples/editing/data/images/3.webp"></a> <a href="./docs/assets/showcases/editing/3_out.webp"><img width="150" alt="editing output 3" src="./docs/assets/showcases/editing/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
| <div align="center"><a href="./examples/editing/data/images/2.webp"><img width="150" alt="editing input 2" src="./examples/editing/data/images/2.webp"></a> <a href="./docs/assets/showcases/editing/2_out.webp"><img width="150" alt="editing output 2" src="./docs/assets/showcases/editing/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <div align="center"><a href="./examples/editing/data/images/4.webp"><img width="150" alt="editing input 4" src="./examples/editing/data/images/4.webp"></a> <a href="./docs/assets/showcases/editing/4_out.webp"><img width="150" alt="editing output 4" src="./docs/assets/showcases/editing/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
| <div align="center"><a href="./examples/editing/data/images/8.webp"><img width="150" alt="editing input 8" src="./examples/editing/data/images/8.webp"></a> <a href="./docs/assets/showcases/editing/8_out.webp"><img width="150" alt="editing output 8" src="./docs/assets/showcases/editing/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> | <div align="center"><a href="./examples/editing/data/images/6.webp"><img width="150" alt="editing input 6" src="./examples/editing/data/images/6.webp"></a> <a href="./docs/assets/showcases/editing/6_out.webp"><img width="150" alt="editing output 6" src="./docs/assets/showcases/editing/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> |
</details>
<details>
<summary>✏️ Image Editing (Reasoning)</summary>
<table>
<tr>
<th style="width: 20%">Original Text</th>
<th style="width: 30%">Original Image</th>
<th style="width: 20%">Reasoning Process</th>
<th style="width: 30%">Resulting Image</th>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like one hour later.</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/034_temporal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like immediately after someone stands up from sitting on it for a long time.</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag’s shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/036_causal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Change the water to high-concentration saltwater</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/042_physics_change_the_water_to_high-con_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">What the fruit looks like when ripe in the picture</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/044_biology_what_the_fruit_looks_like_wh_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
</details>
> 📸 **More editing samples:** see [Image Editing Gallery](./docs/showcases.md#image-editing).
<details>
<summary>♻️ Interleaved Generation (General)</summary>
| |
| :---: |
| [<img alt="interleave case 05" src="./docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
| [<img alt="interleave case 06" src="./docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
</details>
<details>
<summary>♻️ Interleaved Generation (Reasoning)</summary>
| |
| :---: |
| [<img alt="interleave case 05" src="./docs/assets/showcases/interleave/reasoning.png">](./docs/assets/showcases/interleave/reasoning.png) |
</details>
> 📸 **More interleaved samples:** see [Interleaved Generation Gallery](./docs/showcases.md#interleaved-generation).
<details>
<summary>📝 Visual Understanding (General)</summary>
| |
| :---: |
| [<img alt="vqa general cases" src="./docs/assets/showcases/vqa/general_case.webp">](./docs/assets/showcases/vqa/general_case.webp) |
</details>
<details>
<summary>📝 Visual Understanding (Agentic)</summary>
| |
| :---: |
| [<img alt="vqa agentic case" src="./docs/assets/showcases/vqa/agentic_case.webp">](./docs/assets/showcases/vqa/agentic_case.webp) |
</details>
> 📸 **More understanding samples:** see [Visual Understanding Gallery](./docs/showcases.md#visual-understanding).
<details>
<summary>🦾 Visual-Language-Action</summary>
[![YouTube](./docs/assets/showcases/vla/1.png)](https://www.youtube.com/watch?v=3mvBPPgv8vo)
[![YouTube](./docs/assets/showcases/vla/2.png)](https://www.youtube.com/watch?v=2QZY8gf0Vsk)
[![YouTube](./docs/assets/showcases/vla/3.png)](https://www.youtube.com/watch?v=tznVbuYf0yw)
</details>
<details>
<summary>🦾 World Modeling</summary>
| |
| :---: |
| [<img alt="world modeling case" src="./docs/assets/showcases/wm/1.png">](./docs/assets/showcases/wm/1.png) |
</details>
## 📊 Key Benchmarks
<details>
<summary>📝 Visual Understanding</summary>
<p align="center">
<img src="docs/assets/benchmarks/understanding.webp" alt="Understanding Benchmarks">
</p>
</details>
<details>
<summary>🖼️ Visual Generation</summary>
<p align="center">
<img src="docs/assets/benchmarks/generation.webp" alt="Generation Benchmarks">
</p>
</details>
<details>
<summary>♻️ Visual Reasoning</summary>
<p align="center">
<img src="docs/assets/benchmarks/interleaved.webp" alt="Interleaved Benchmarks">
</p>
</details>
> Evaluation scripts and benchmark reproduction guides are added in [`evaluation`](./evaluation/README.md).
## ⚠️ Ongoing Improvements
Despite strong performance across tasks, several limitations remain for improvement:
* **Visual Understanding**:
The current model only supports a context length of up to **32K** tokens, which may constrain performance in scenarios requiring longer or more complex visual contexts.
* **Human-centric Generation**:
Fine-grained details of human bodies can be challenging, especially when people appear as small elements within a scene or are engaged in complex interactions with surrounding objects.
* **Text-based Generation**:
Text rendering may sometimes produce misspellings, distorted characters, or formatting inconsistencies, which are sensitive to how prompts are phrased, especially in text-heavy scenarios. (see [`prompt enhancement`](./docs/prompt_enhancement.md) for best practice)
* **Interleaved Generation**:
* As an experimental feature, interleaved generation is still evolving and may not yet match the performance of dedicated text-to-image (T2I) pipelines.
* **Beta status:** RL has not been specifically optimized for visual editing, reasoning, and interleaved tasks, and current performance is comparable SFT models.
We view these areas as active directions and expect continued improvements in future iterations.
## 🛠️ Quick Start
### 🌐 Use with SenseNova-Studio
The fastest way to experience SenseNova-U1 is through **[SenseNova-Studio](https://unify.light-ai.top/)** — a 🆓 free online playground where you can try the model directly in your browser, no installation or GPU required.
> **Note:** To serve more users, U1-Fast has undergone step and CFG distillation, and is dedicated to infographic generation.
### 🦞 Use with SenseNova-Skills (OpenClaw)
The easiest way to integrate SenseNova-U1 into your own agent or application is through our companion repository **[SenseNova-Skills (OpenClaw) 🦞](https://github.com/OpenSenseNova/SenseNova-Skills)**, which ships SenseNova-U1 as a ready-to-use skill with a unified tool-calling interface.
> Refer to the [SenseNova-Skills README](https://github.com/OpenSenseNova/SenseNova-Skills) for installation and usage details.
<details>
<summary>✨ Some interesting cases produced through our Skills and Studio</summary>
<p align="center">
<img src="docs/assets/showcases/t2i_infographic/u1-case2.webp" alt="Skill Cases">
</p>
</details>
### 🤗 Run with transformers (Default)
> **Setup:** Follow the [Installation Guide](./docs/installation.md) to clone the repo and install dependencies with [uv](https://github.com/astral-sh/uv).
<details open>
<summary>📝 Visual Understanding</summary>
```bash
python examples/vqa/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --image examples/vqa/data/images/menu.jpg --question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." --output outputs/answer.txt --max_new_tokens 8192 --do_sample --temperature 0.6 --top_p 0.95 --top_k 20 --repetition_penalty 1.05 --profile
```
</details>
> See [`examples/README.md`](./examples/README.md#visual-understanding-vqa) for batched inference, generation parameters, and JSONL format.
<details open>
<summary>🖼️ Text-to-Image</summary>
```bash
python examples/t2i/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。" --width 2720 --height 1536 --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output.png --profile
```
</details>
> Default resolution is 2048×2048 (1:1). See [supported resolution buckets](./examples/README.md#supported-resolution-buckets) for other aspect ratios.
> For high-quality infographic generation, it is recommended to apply [prompt enhancement](./docs/prompt_enhancement.md) before generating images.
<details open>
<summary>✏️ Image Editing</summary>
```bash
python examples/editing/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "Change the animal's fur color to a darker shade." --image examples/editing/data/images/1.webp --cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output_edited.png --profile --compare
```
</details>
> 💡 Pre-resize inputs to ~2048×2048 resolution with orginal aspect ratio before inference for best quality (see [`examples/editing/resize_inputs.py`](./examples/editing/resize_inputs.py)).
<details open>
<summary>♻️ Interleaved Generation</summary>
```bash
python examples/interleave/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." --resolution "16:9" --output_dir outputs/interleave/ --stem demo --profile
```
</details>
> See [`examples/README.md`](./examples/README.md) for batched inference, JSONL format, prompt enhancement, resolution buckets, and full flag reference.
> See [`docs/gpu_mem_profiler.md`](./docs/gpu_mem_profiler.md) for GPU memory profiler.
### 💾 Memory-efficient inference (GGUF + VRAM modes)
For users running on a single consumer GPU, two complementary features lower the VRAM footprint of the `transformers` path. They can be combined freely.
#### GGUF quantized checkpoints
Pass `--gguf_checkpoint` to any of the four inference scripts (`t2i`, `editing`, `interleave`, `vqa`) to load a quantized `.gguf` file via the `diffusers` GGUF Linear layer instead of the bf16 safetensors weights. The base `--model_path` is still required (for tokenizer / config / non-LM weights).
```bash
# install the optional extra once
uv pip install -e ".[gguf]" # or: pip install "gguf>=0.10.0" "diffusers>=0.30.0"
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--gguf_checkpoint /path/to/SenseNova-U1-8B-MoT-Merger-Q4_K_M.gguf \
--prompt "A male peacock trying to attract a female" \
--output output.png
```
GGUF weights for `SenseNova-U1-8B-MoT-Merger` (multiple quant levels: Q3 / Q4 / Q5 / Q6 / Q8) are available at:
| Quantized weights | HF link |
| :---------------- | :------ |
| SenseNova-U1-8B-MoT-Merger-gguf | [🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf](https://huggingface.co/smthem/SenseNova-U1-8B-MoT-Merger-gguf) |
> 🙏 Thanks to GitHub user [@smthem](https://github.com/smthem) for contributing the quantized GGUF weights to the community.
#### `--vram_mode`: single-GPU layer offload
Pass `--vram_mode` to keep the language-model layers resident on CPU pinned memory and stream them onto the GPU on-demand during forward, freeing weight VRAM while keeping activations on-device.
| Mode | Behavior | When to use |
| :--- | :--- | :--- |
| `full` *(default)* | No offload; whole model on GPU | Plenty of VRAM, best speed |
| `low` | Synchronous per-layer CPU↔GPU swap | Lowest VRAM footprint |
| `balanced` | Async prefetch overlaps H2D copy with compute | Tight on VRAM but want to recover speed |
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--vram_mode balanced \
--prompt "..." --output output.png
```
`--gguf_checkpoint` and `--vram_mode` compose: a Q4 GGUF + `balanced` is the recommended setup for ~10–12 GB consumer cards.
### ⚡ Run with LightLLM + LightX2V (Recommended)
For production serving, we co-design a dedicated inference stack on top of **[LightLLM](https://github.com/ModelTC/lightllm)** (understanding) and **[LightX2V](https://github.com/ModelTC/lightx2v)** (generation). The two engines are disaggregated so that each path can use its own parallelism and resource budget, with a low-overhead transfer channel in between.
On a single node with `TP2 + CFG2`, this stack delivers roughly **~0.15 s/step** and **~9 s end-to-end** for a **2048×2048** image on H100 / H200, with a ~**2.4–3.2×** prefill speedup from our FA3-based hybrid-mask attention over the Triton baseline. Full per-GPU performance are reported in [`docs/inference_infra.md`](./docs/inference_infra.md).
An official docker image is provided for one-command deployment:
```bash
docker pull lightx2v/lightllm_lightx2v:20260407
```
> ⚙️ **Deployment guide (Docker, launch flags, modes, quantization, API test):** see [`docs/deployment.md`](./docs/deployment.md).
>
> 📖 **Full design and performance profiling:** see [`docs/inference_infra.md`](./docs/inference_infra.md).
## 🌐 Join the Community!
Join our growing community to share feedback, get support, and stay updated on the latest SenseNova-U1 developments — we'd love to hear from you!
<div align="center">
<table>
<tr>
<td align="center"><b><a href="https://discord.com/invite/BuTXPHmQub">Discord</a></b></td>
<td align="center"><b>WeChat Group</b></td>
</tr>
<tr>
<td align="center"><a href="https://discord.com/invite/BuTXPHmQub"><img src="docs/assets/discord_qr.webp" width="160"/></a></td>
<td align="center"><img src="docs/assets/wechat_qr.webp" width="160"/></td>
</tr>
</table>
</div>
## ✒️ Citation
If this project is helpful for your research, please consider **star** ⭐ and **citation** 📝 :
```bibtex
@misc{sensenova2026neounify,
title = {NEO-unify: Building Native Multimodal Unified Models End to End},
author = {SenseNova},
journal = {Hugging Face blog},
url = {https://huggingface.co/blog/sensenova/neo-unify},
year = {2026}
}
@article{sensenova2026sensenovau1,
title = {SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture},
author = {Diao, Haiwen and Wu, Penghao and Deng, Hanming and Wang, Jiahao and Bai, Shihao and Wu, Silei and Fan, Weichen and Ye, Wenjie and Tong, Wenwen and Fan, Xiangyu and others},
journal = {arXiv preprint arXiv:2605.12500},
year = {2026}
}
```
## ⚖️ License
This project is released under the [Apache 2.0 License](./LICENSE).
# SenseNova-U1:基于 NEO-unify 架构统一多模态理解与生成
<p align="center">
<a href="./README.md">English</a> | <strong>简体中文</strong>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2605.12500"><img src="https://img.shields.io/badge/arXiv-2605.12500-b31b1b.svg" alt="arXiv"></a>
<a href="https://huggingface.co/collections/sensenova/sensenova-u1"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow" alt="HuggingFace Model"></a>
<a href="https://modelscope.cn/collections/SenseNova/SenseNova-U1"><img src="https://img.shields.io/badge/%F0%9F%A4%96%20ModelScope-模型-purple" alt="ModelScope-模型"></a>
<a href="https://unify.light-ai.top/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20SenseNova_U1-Demo-Green" alt="SenseNova-U1 Demo"></a>
<a href="./LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"></a>
<a href="https://discord.gg/cxkwXWjp"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
</p>
<p align="center">
<img src="docs/assets/teaser.webp" alt="SenseNova-U1" width="900">
</p>
<p align="center">
<img src="docs/assets/teaser_2.webp" alt="visualization" width="900">
</p>
## 📣 最新动态
- `[2026.05.15]` 发布 [SenseNova-U1-8B-MoT-Infographic 📊](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-Infographic) 模型,提升了信息图生成能力。模型细节可见 [U1 Infographic Model](docs/u1_infographic_model_CN.md),100个生成案例可见 [✨ Infographic Showcases ](docs/u1_infographic_showcases.md)
- `[2026.05.10]` 发布 [🔥SenseNova-U1 技术报告🔥](https://github.com/OpenSenseNova/SenseNova-U1/blob/main/docs/pdf/SenseNOVA_U1.pdf),并开源 [SenseNova-U1-A3B-MoT-SFT](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT-SFT)[SenseNova-U1-A3B-MoT](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT) 模型权重。
- `[2026.05.08]` 新增 **GGUF 量化权重支持****分层加载 VRAM 模式**,便于在单卡低显存环境下推理,详见 [低显存推理(GGUF + VRAM 模式)](#-低显存推理gguf--vram-模式)`SenseNova-U1-8B-MoT-Merger` 的 GGUF 权重已上传至 [🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf](https://huggingface.co/smthem/SenseNova-U1-8B-MoT-Merger-gguf),特别感谢 [@smthem](https://github.com/smthem) 为社区贡献量化权重。
- `[2026.05.06]` 发布[SenseNova-U1-8B-MoT-LoRA-8step-V1.0](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-LoRAs/blob/main/SenseNova-U1-8B-MoT-LoRA-8step-V1.0.safetensors). 请查看[推理示例脚本](docs/base_vs_distill.md#run-base-and-distilled-model)
- `[2026.04.30]` 发布8步推理模型的预览版 [SenseNova-U1-8B-MoT-8step-preview](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-8step-preview). 在大多数情况下,该模型的图像生成质量与基础模型非常接近 (查看 [效果对比和存在的问题](docs/base_vs_distill.md))。要测试该模型,可以参考[推理脚本](examples/README.md), 但需替换如下参数: ```--cfg_scale 1.0 --num_steps 8```
- `[2026.04.27]` 首发 [SenseNova-U1-8B-MoT-SFT](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-SFT)[SenseNova-U1-8B-MoT](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT) 模型权重。
- `[2026.04.27]` 首发 SenseNova-U1 的[推理代码](https://github.com/OpenSenseNova/SenseNova-U1/blob/main/examples/README_CN.md)
## 🌟 概述
🚀 **SenseNova U1** 是全新一代原生多模态模型系列,在单一架构中统一了多模态理解、推理与生成。
它代表着多模态 AI 的根本性范式转变:**从模态集成走向真正的统一**。SenseNova U1 不再依赖适配器在不同模态之间进行翻译,而是以原生方式跨语言与视觉进行思考与行动。
视觉理解与生成的统一开启了巨大的可能性。SenseNova U1 立足于**数据驱动学习阶段**(如 ChatGPT),并指向下一阶段——**智能体学习阶段**(如 OpenClaw),以原生多模态的方式进行学习、思考和行动。
<p align="center">
<img src="docs/assets/teaser_1.webp" alt="radar plot" width="900">
</p>
#### 🏗️ *核心支柱:*
SenseNova U1 的核心是 **[NEO-unify](https://huggingface.co/blog/sensenova/neo-unify)** —— 一个为多模态 AI 而设计、从第一性原理出发的全新架构:*它彻底摒弃了视觉编码器(VE)与变分自编码器(VAE),因为像素与文字信息在本质上是深度相关的。* 其主要特性如下:
- 🔗 端到端地将语言与视觉信息建模为统一整体。
- 🖼️ 在保留语义丰富度的同时,维持像素级的视觉保真度。
- 🧠 通过原生 MoT 实现跨模态推理,效率高、冲突少。
#### ✨ *能力突破:*
基于这一全新的核心架构,SenseNova U1 在多模态学习中展现出卓越的效率:
<p align="center">
<img src="docs/assets/perform_vs_speed_5bench.webp" width="48%" />
<img src="docs/assets/perform_vs_speed_infobench.webp" width="48%" />
</p>
<p align="center">
<sub>
左图:在 OneIG(EN、ZH)、LongText(EN、ZH)、CVTG、BizGenEval(Easy、Hard)与 IGenBench 上的生成延迟与平均性能对比。<br>
右图:在信息图基准(BizGenEval(Easy、Hard)、IGenBench)上的生成延迟与平均性能对比。
</sub>
</p>
- 🏆 **理解与生成均达到开源 SoTA**:SenseNova U1 在统一多模态理解与生成上树立了新的标杆,在多种理解、推理与生成基准上均达到开源模型中最先进的水平,比肩商用大模型。
- 📖 **原生图文交错生成**:SenseNova U1 可以用单一模型在单次生成流程中连贯产出图文交错内容,支持生活指南、旅行日记等既需要清晰表达又富有叙事性与表现力的场景,把复杂信息浓缩为直观的图示。
- 📰 **高密度信息呈现**:SenseNova U1 在高密度视觉信息表达上展现出强大能力,能够生成结构丰富、排版复杂的内容,适用于知识图解、海报、PPT、漫画、简历等多种信息密集型场景。
#### 🌍 *不止于多模态:*
- 🤖 视觉-语言-动作(VLA)
- 🌐 世界建模(WM)
## 🦁 模型库
在本次发布中,我们开源了 SenseNova U1 Lite 系列,共两个规格:
- SenseNova U1-8B-MoT — 密集主干网络
- SenseNova U1-A3B-MoT — MoE 主干网络
| 模型 | 参数量 | HF 权重 |
| :---- | :------- | :--------- |
| SenseNova-U1-8B-MoT-Infographic | 8B MoT | [🤗 链接](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-Infographic) |
| SenseNova-U1-8B-MoT-SFT | 8B MoT | [🤗 链接](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-SFT) |
| SenseNova-U1-8B-MoT | 8B MoT | [🤗 链接](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT) |
| SenseNova-U1-8B-MoT-LoRA-8step-V1.0 | 0.4B | [🤗 链接](https://huggingface.co/sensenova/SenseNova-U1-8B-MoT-LoRAs/blob/main/SenseNova-U1-8B-MoT-LoRA-8step-V1.0.safetensors) |
| SenseNova-U1-A3B-MoT-SFT | A3B MoT | [🤗 链接](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT-SFT) |
| SenseNova-U1-A3B-MoT | A3B MoT | [🤗 链接](https://huggingface.co/sensenova/SenseNova-U1-A3B-MoT) |
其中 **SFT 模型***×32 下采样比例*)经过四个阶段训练:(1) *理解预热*,(2) *生成预训练*,(3) *统一中期训练*,(4) *统一监督微调***最终模型**是在基座模型之上进行了一轮 T2I 强化学习(RL)训练后得到的版本。
目前这些模型在规模上相对紧凑,但已在多种任务上展现出强劲性能,与商用模型相当且具备出色的性价比。未来还将推出规模更大的版本,进一步提升能力。
> 💡 `SenseNova-U1-8B-MoT` 中的 `8B-MoT` 指的是 ~8B 理解参数**与** ~8B 生成参数。可参阅 [模型参数分解](docs/parameter_breakdown_CN.md) 查看详细的分组明细。
## 📋 后续计划
- [ ] SenseNova-U1 训练代码
- [x] SenseNova-U1 最终版权重与技术报告
## 🎨 效果展示
<details>
<summary>🖼️ 文生图(通用)</summary>
| | | |
| :---: | :---: | :---: |
| [<img width="300" alt="t2i general dense face hd 07" src="./docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
| [<img width="260" alt="t2i general face hd 13" src="./docs/assets/showcases/t2i_general/1_1_face_hd_13.webp">](./docs/assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./docs/assets/showcases/t2i_general/1_1_face_hd_17.webp">](./docs/assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
| [<img width="260" alt="t2i general landscape 06" src="./docs/assets/showcases/t2i_general/1_1_landscape_06.webp">](./docs/assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./docs/assets/showcases/t2i_general/1_1_landscape_07.webp">](./docs/assets/showcases/t2i_general/1_1_landscape_07.webp) |
| [<img width="200" alt="t2i general portrait artistic 02 a" src="./docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./docs/assets/showcases/t2i_general/9_16_human_pose_11.webp">](./docs/assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./docs/assets/showcases/t2i_general/9_16_artistic_07.webp">](./docs/assets/showcases/t2i_general/9_16_artistic_07.webp) |
| [<img width="200" alt="t2i general portrait text rendering 02" src="./docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
</details>
<details>
<summary>🖼️ 文生图(推理)</summary>
<table>
<tr>
<th style="width: 20%">原始文本</th>
<th style="width: 50%">推理过程</th>
<th style="width: 30%">生成图像</th>
</tr>
<tr>
<td style="vertical-align: top;">A male peacock trying to attract a female</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a single male peacock. The action involves the male peacock trying to attract a female. 2. <b>Reasoning Process:</b> Male peacocks perform this ritual by fanning their tail feathers into a massive, colorful circular display. 3. <b>Establish the frame:</b> The frame captures the male peacock in a full-body shot to show the entire display. The angle is eye-level to emphasize the height of the tail feathers. 4. <b>Set the lighting and color:</b> Natural lighting illuminates the colorful circular display with soft shadows. The palette features vibrant blues and greens on the tail feathers. 5. <b>Lock the style:</b> The style is a realistic photograph with sharp focus on the male peacock. The finish is natural without artistic filters. 6. <b>Explicit Prompt:</b> A vibrant male peacock with his tail feathers fully fanned out in a wide, iridescent display of blues and greens, standing in a garden.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/4_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A small piece of dry wood and a dense iron block are in a transparent water tank</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subjects are a small piece of dry wood and a dense iron block positioned within a transparent water tank. The wood is floating on the surface while the iron block is submerged at the bottom. 2. <b>Reasoning Process:</b> The wood is less dense than water so it will float, while the iron is denser and will sink to the bottom. 3. <b>Establish the frame:</b> The composition is a vertical medium shot centering the rectangular tank within the frame. The camera angle is eye-level to clearly display the water line and the submerged base. Focus is sharp across the entire depth of the tank to ensure both materials are distinct. 4. <b>Build the environment:</b> The scene is contained entirely within the clear glass walls of the water tank. The water fills the majority of the volume, providing a medium for the floating wood and sunken iron block. The background remains out of focus to keep attention on the tank's interior. 5. <b>Set the lighting and color:</b> Soft natural light illuminates the scene from the left, creating gentle reflections on the water surface. The color palette features the brown grain of the wood contrasting against the dark grey metallic finish of the iron. Shadows are soft and diffused through the liquid. 6. <b>Explicit Prompt:</b> A realistic photo of a transparent water tank showing a piece of wood floating on the surface and an iron block resting at the bottom.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/5_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
</details>
<details>
<summary>🖼️ 文生图(信息图)</summary>
<table align="center">
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0004.webp"><img width="300" alt="t2i landscape 0001" src="./docs/assets/showcases/t2i_infographic/0004.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0012.webp"><img width="300" alt="t2i landscape 0002" src="./docs/assets/showcases/t2i_infographic/0012.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0005.webp"><img width="300" alt="t2i landscape 0003" src="./docs/assets/showcases/t2i_infographic/0005.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0018.webp"><img width="300" alt="t2i landscape 0004" src="./docs/assets/showcases/t2i_infographic/0018.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0024.webp"><img width="300" alt="t2i landscape 0005" src="./docs/assets/showcases/t2i_infographic/0024.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0019.webp"><img width="300" alt="t2i landscape 0006" src="./docs/assets/showcases/t2i_infographic/0019.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0006.webp"><img width="300" alt="t2i landscape 0007" src="./docs/assets/showcases/t2i_infographic/0006.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0015.webp"><img width="300" alt="t2i landscape 0008" src="./docs/assets/showcases/t2i_infographic/0015.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0025.webp"><img width="300" alt="t2i landscape 0009" src="./docs/assets/showcases/t2i_infographic/0025.webp"></a></td>
</tr>
</table>
<table align="center">
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0000.webp"><img width="220" alt="t2i landscape 0010" src="./docs/assets/showcases/t2i_infographic/0000.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0003.webp"><img width="220" alt="t2i landscape 0011" src="./docs/assets/showcases/t2i_infographic/0003.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0001.webp"><img width="220" alt="t2i landscape 0012" src="./docs/assets/showcases/t2i_infographic/0001.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0022.webp"><img width="220" alt="t2i landscape 0012" src="./docs/assets/showcases/t2i_infographic/0022.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0016.webp"><img width="220" alt="t2i image 0022" src="./docs/assets/showcases/t2i_infographic/0016.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0010.webp"><img width="220" alt="t2i image 0020" src="./docs/assets/showcases/t2i_infographic/0010.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0007.webp"><img width="220" alt="t2i image 0021" src="./docs/assets/showcases/t2i_infographic/0007.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0021.webp"><img width="220" alt="t2i image 0023" src="./docs/assets/showcases/t2i_infographic/0021.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0014.webp"><img width="220" alt="t2i image 0024" src="./docs/assets/showcases/t2i_infographic/0014.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0028.webp"><img width="220" alt="t2i image 0025" src="./docs/assets/showcases/t2i_infographic/0028.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0033.webp"><img width="220" alt="t2i image 0026" src="./docs/assets/showcases/t2i_infographic/0033.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0002.webp"><img width="220" alt="t2i image 0027" src="./docs/assets/showcases/t2i_infographic/0002.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0031.webp"><img width="230" alt="t2i image 0028" src="./docs/assets/showcases/t2i_infographic/0031.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0030.webp"><img width="230" alt="t2i image 0029" src="./docs/assets/showcases/t2i_infographic/0030.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0032.webp"><img width="230" alt="t2i image 0030" src="./docs/assets/showcases/t2i_infographic/0032.webp"></a></td>
<td align="center"><a href="./docs/assets/showcases/t2i_infographic/0029.webp"><img width="230" alt="t2i image 0031" src="./docs/assets/showcases/t2i_infographic/0029.webp"></a></td>
</tr>
</table>
</details>
> 📸 **更多生成样例:** 参见 [文生图样例集](./docs/showcases_CN.md#文生图)。
<details>
<summary>✏️ 图像编辑(通用)</summary>
| | |
| :---: | :---: |
| <div align="center"><a href="./examples/editing/data/images/1.webp"><img width="150" alt="editing input 1" src="./examples/editing/data/images/1.webp"></a> <a href="./docs/assets/showcases/editing/1_out.webp"><img width="150" alt="editing output 1" src="./docs/assets/showcases/editing/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <div align="center"><a href="./examples/editing/data/images/3.webp"><img width="150" alt="editing input 3" src="./examples/editing/data/images/3.webp"></a> <a href="./docs/assets/showcases/editing/3_out.webp"><img width="150" alt="editing output 3" src="./docs/assets/showcases/editing/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
| <div align="center"><a href="./examples/editing/data/images/2.webp"><img width="150" alt="editing input 2" src="./examples/editing/data/images/2.webp"></a> <a href="./docs/assets/showcases/editing/2_out.webp"><img width="150" alt="editing output 2" src="./docs/assets/showcases/editing/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <div align="center"><a href="./examples/editing/data/images/4.webp"><img width="150" alt="editing input 4" src="./examples/editing/data/images/4.webp"></a> <a href="./docs/assets/showcases/editing/4_out.webp"><img width="150" alt="editing output 4" src="./docs/assets/showcases/editing/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
| <div align="center"><a href="./examples/editing/data/images/8.webp"><img width="150" alt="editing input 8" src="./examples/editing/data/images/8.webp"></a> <a href="./docs/assets/showcases/editing/8_out.webp"><img width="150" alt="editing output 8" src="./docs/assets/showcases/editing/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> | <div align="center"><a href="./examples/editing/data/images/6.webp"><img width="150" alt="editing input 6" src="./examples/editing/data/images/6.webp"></a> <a href="./docs/assets/showcases/editing/6_out.webp"><img width="150" alt="editing output 6" src="./docs/assets/showcases/editing/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> |
</details>
<details>
<summary>✏️ 图像编辑(推理)</summary>
<table>
<tr>
<th style="width: 20%">编辑指令</th>
<th style="width: 30%">原始图像</th>
<th style="width: 20%">推理过程</th>
<th style="width: 30%">编辑结果</th>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like one hour later.</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/034_temporal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like immediately after someone stands up from sitting on it for a long time.</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag's shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/036_causal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Change the water to high-concentration saltwater</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/042_physics_change_the_water_to_high-con_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">What the fruit looks like when ripe in the picture</td>
<td style="vertical-align: top;"><img src="./examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
<td style="vertical-align: top;"><img src="./docs/assets/showcases/editing/044_biology_what_the_fruit_looks_like_wh_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
</details>
> 📸 **更多编辑样例:** 参见 [图像编辑样例集](./docs/showcases_CN.md#图像编辑)。
<details>
<summary>♻️ 图文交错生成(通用)</summary>
| |
| :---: |
| [<img alt="interleave case 05" src="./docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
| [<img alt="interleave case 06" src="./docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
</details>
<details>
<summary>♻️ 图文交错生成(推理)</summary>
| |
| :---: |
| [<img alt="interleave reasoning case" src="./docs/assets/showcases/interleave/reasoning.png">](./docs/assets/showcases/interleave/reasoning.png) |
</details>
> 📸 **更多图文交错样例:** 参见 [图文交错生成样例集](./docs/showcases_CN.md#图文交错生成)。
<details>
<summary>📝 视觉理解(通用)</summary>
| |
| :---: |
| [<img alt="vqa general cases" src="./docs/assets/showcases/vqa/general_case.webp">](./docs/assets/showcases/vqa/general_case.webp) |
</details>
<details>
<summary>📝 视觉理解(智能体)</summary>
| |
| :---: |
| [<img alt="vqa agentic case" src="./docs/assets/showcases/vqa/agentic_case.webp">](./docs/assets/showcases/vqa/agentic_case.webp) |
</details>
> 📸 **更多视觉理解样例:** 参见 [视觉理解样例集](./docs/showcases_CN.md#视觉理解)。
<details>
<summary>🦾 视觉语言动作</summary>
[![YouTube](./docs/assets/showcases/vla/1.png)](https://www.youtube.com/watch?v=3mvBPPgv8vo)
[![YouTube](./docs/assets/showcases/vla/2.png)](https://www.youtube.com/watch?v=2QZY8gf0Vsk)
[![YouTube](./docs/assets/showcases/vla/3.png)](https://www.youtube.com/watch?v=tznVbuYf0yw)
</details>
<details>
<summary>🦾 世界建模</summary>
| |
| :---: |
| [<img alt="world modeling case" src="./docs/assets/showcases/wm/1.png">](./docs/assets/showcases/wm/1.png) |
</details>
## 📊 核心评测
<details>
<summary>📝 视觉理解</summary>
<p align="center">
<img src="docs/assets/benchmarks/understanding.webp" alt="Understanding Benchmarks">
</p>
</details>
<details>
<summary>🖼️ 视觉生成</summary>
<p align="center">
<img src="docs/assets/benchmarks/generation.webp" alt="Generation Benchmarks">
</p>
</details>
<details>
<summary>♻️ 视觉推理</summary>
<p align="center">
<img src="docs/assets/benchmarks/interleaved.webp" alt="Interleaved Benchmarks">
</p>
</details>
> 评测脚本与榜单复现指南已提供在 [`evaluation`](./evaluation/README_CN.md)。
## ⚠️ 进行中的改进
尽管在各项任务上表现优异,当前版本仍有若干已知局限有待改进:
* **视觉理解**
当前模型支持的上下文长度最长为 **32K** tokens,在需要更长或更复杂视觉上下文的场景下可能受到限制。
* **人体生成**
对人体细粒度细节的处理仍有挑战,尤其是当人物在画面中占比较小,或与周围物体存在复杂交互时。
* **文字生成**
文字渲染有时会出现拼写错误、字符变形或格式不一致的问题,且对 prompt 的措辞较为敏感,在文字密集场景下尤为明显。(最佳实践请参见 [`提示词增强`](./docs/prompt_enhancement_CN.md))
* **图文交错生成**
* 作为实验性功能,图文交错生成仍在持续演进中,性能可能尚未达到专用文生图(T2I)流程的水平。
* **Beta 状态:** 强化学习尚未针对图像编辑、推理及图文交错任务进行专项优化,当前性能与 SFT 模型相当。
我们将上述方向列为持续迭代的重点,期待在后续版本中不断改进。
## 🛠️ 快速开始
### 🌐 使用 SenseNova-Studio
体验 SenseNova-U1 最便捷的方式是通过 **[SenseNova-Studio](https://unify.light-ai.top/)** —— 一个 🆓 免费的在线体验平台,无需安装、无需 GPU,直接在浏览器中即可试用。
> **注:** 为服务更多用户,U1-Fast 经过步数蒸馏和 CFG 蒸馏,专供信息图生成使用。
### 🦞 使用 SenseNova-Skills(OpenClaw)
将 SenseNova-U1 集成进自己的智能体或应用,最简单的方式是使用配套仓库 **[SenseNova-Skills (OpenClaw) 🦞](https://github.com/OpenSenseNova/SenseNova-Skills)**——它将 SenseNova-U1 封装为开箱即用的技能,并提供统一的工具调用接口。
> 安装与使用详情请参考 [SenseNova-Skills README](https://github.com/OpenSenseNova/SenseNova-Skills)。
<details>
<summary>✨ 通过我们 Skills 和 Studio 制作的有趣案例</summary>
<p align="center">
<img src="docs/assets/showcases/t2i_infographic/u1-case2.webp" alt="Skill Cases">
</p>
</details>
### 🤗 使用 transformers 运行
> **环境准备:** 按照[安装指南](./docs/installation_CN.md)克隆仓库并用 [uv](https://github.com/astral-sh/uv) 安装依赖。
<details open>
<summary>📝 视觉理解</summary>
```bash
python examples/vqa/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --image examples/vqa/data/images/menu.jpg --question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." --output outputs/answer.txt --max_new_tokens 8192 --do_sample --temperature 0.6 --top_p 0.95 --top_k 20 --repetition_penalty 1.05 --profile
```
</details>
> 批量推理、生成参数和 JSONL 格式请参见 [`examples/README_CN.md`](./examples/README_CN.md#视觉理解vqa)。
<details open>
<summary>🖼️ 文生图</summary>
```bash
python examples/t2i/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。" --width 2720 --height 1536 --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output.png --profile
```
</details>
> 默认分辨率为 2048×2048(1:1)。其它长宽比请参见[支持的分辨率档位](./examples/README_CN.md#推荐分辨率档位)。
> 当进行信息图生成时,建议先使用[提示词增强](./docs/prompt_enhancement_CN.md)以获得最佳效果。
<details open>
<summary>✏️ 图像编辑</summary>
```bash
python examples/editing/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "Change the animal's fur color to a darker shade." --image examples/editing/data/images/1.webp --cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output_edited.png --profile --compare
```
</details>
> 💡 为获得最佳效果,建议在推理前将输入按原长宽比预缩放至约 2048×2048 分辨率(参见 [`examples/editing/resize_inputs.py`](./examples/editing/resize_inputs.py))。
<details open>
<summary>♻️ 图文交错生成</summary>
```bash
python examples/interleave/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." --resolution "16:9" --output_dir outputs/interleave/ --stem demo --profile
```
</details>
> 批量推理、JSONL 格式、prompt 增强、分辨率档位及完整参数说明请参见 [`examples/README_CN.md`](./examples/README_CN.md)。
> 显存性能分析请参见 [`性能分析`](./docs/gpu_mem_profiler_CN.md)。
### 💾 低显存推理(GGUF + VRAM 模式)
针对单张消费级显卡的部署场景,我们在 `transformers` 路径上提供两项可独立启用、也可组合使用的低显存特性。
#### GGUF 量化权重
在四个推理脚本(`t2i``editing``interleave``vqa`)中传入 `--gguf_checkpoint`,即可使用 `diffusers` GGUF Linear 层加载量化后的 `.gguf` 权重,替代原始 bf16 safetensors 权重。`--model_path` 仍需指定(用于加载 tokenizer / config 及非语言模型权重)。
```bash
# 一次性安装可选依赖
uv pip install -e ".[gguf]" # 或:pip install "gguf>=0.10.0" "diffusers>=0.30.0"
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--gguf_checkpoint /path/to/SenseNova-U1-8B-MoT-Merger-Q4_K_M.gguf \
--prompt "A male peacock trying to attract a female" \
--output output.png
```
`SenseNova-U1-8B-MoT-Merger` 的 GGUF 权重(提供 Q3 / Q4 / Q5 / Q6 / Q8 等多档量化):
| 量化权重 | HF 链接 |
| :------- | :------ |
| SenseNova-U1-8B-MoT-Merger-gguf | [🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf](https://huggingface.co/smthem/SenseNova-U1-8B-MoT-Merger-gguf) |
> 🙏 特别感谢 GitHub 用户 [@smthem](https://github.com/smthem) 为社区贡献 GGUF 量化权重。
#### `--vram_mode`:单卡分层卸载
`--vram_mode` 将语言模型各层常驻 CPU pinned memory,仅在前向时按需流式拷贝到 GPU 上参与计算,从而显著降低权重的 VRAM 占用,激活值仍保留在显卡上。
| 模式 | 行为 | 适用场景 |
| :--- | :--- | :--- |
| `full`(默认) | 不做卸载,整模放在 GPU 上 | 显存充裕,追求最快速度 |
| `low` | 同步逐层 CPU↔GPU 交换 | 显存最为紧张 |
| `balanced` | 异步预取,将 H2D 拷贝与计算重叠 | 显存吃紧但希望恢复部分速度 |
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--vram_mode balanced \
--prompt "..." --output output.png
```
`--gguf_checkpoint``--vram_mode` 可叠加:在 ~16 GB 消费卡上推荐使用 `Q4 GGUF + balanced` 组合。
### ⚡ 使用 LightLLM + LightX2V 运行
面向生产环境的部署,我们在 **[LightLLM](https://github.com/ModelTC/lightllm)**(理解)和 **[LightX2V](https://github.com/ModelTC/lightx2v)**(生成)之上协同设计了一套专用推理栈。两个引擎以解耦方式运行,可以各自使用独立的并行策略与资源配额,中间通过低开销传输通道连接。
在单节点 `TP2 + CFG2` 配置下,该推理栈在 H100 / H200 上为 **2048×2048** 图像提供约 **~0.15 s/step****~9 s 端到端**的表现;相较 Triton 基线,我们基于 FA3 的混合掩码注意力带来 ~**2.4–3.2×** 的 prefill 加速。完整的单卡性能数据见 [`docs/inference_infra_CN.md`](./docs/inference_infra_CN.md)
我们提供了官方 Docker 镜像,一行命令即可完成部署:
```bash
docker pull lightx2v/lightllm_lightx2v:20260407
```
> ⚙️ **部署指南(Docker、启动参数、模式、量化、API 测试):** 参见 [`docs/deployment_CN.md`](./docs/deployment_CN.md)。
>
> 📖 **完整架构设计与性能剖析:** 参见 [`docs/inference_infra_CN.md`](./docs/inference_infra_CN.md)。
## 🌐 加入社区!
加入我们的社区,分享反馈、获取支持,并第一时间了解 SenseNova-U1 的最新进展 — 期待与你交流!
<div align="center">
<table>
<tr>
<td align="center"><b><a href="https://discord.gg/cxkwXWjp">Discord</a></b></td>
<td align="center"><b>微信交流群</b></td>
</tr>
<tr>
<td align="center"><a href="https://discord.gg/cxkwXWjp"><img src="docs/assets/discord_qr.webp" width="160"/></a></td>
<td align="center"><img src="docs/assets/wechat_qr.webp" width="160"/></td>
</tr>
</table>
</div>
## ✒️ 引用
如果这个项目对您的研究有帮助,请考虑点个项目Star ⭐ 和论文引用 📝:
```bibtex
@misc{sensenova2026neounify,
title = {NEO-unify: Building Native Multimodal Unified Models End to End},
author = {SenseNova},
journal = {Hugging Face blog},
url = {https://huggingface.co/blog/sensenova/neo-unify},
year = {2026}
}
@article{sensenova2026sensenovau1,
title = {SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture},
author = {Diao, Haiwen and Wu, Penghao and Deng, Hanming and Wang, Jiahao and Bai, Shihao and Wu, Silei and Fan, Weichen and Ye, Wenjie and Tong, Wenwen and Fan, Xiangyu and others},
journal = {arXiv preprint arXiv:2605.12500},
year = {2026}
}
```
## ⚖️ 许可证
本项目基于 [Apache 2.0 License](./LICENSE) 开源发布。
# Runs in the published mirror repo (OpenSenseNova/ComfyUI-SenseNova-U1) after
# the monorepo's `publish-comfyui.yml` pushes a fresh subtree + tag here.
# It does not run in the SenseNova-U1 monorepo because the trigger tag pattern
# `v*.*.*` is rewritten from the monorepo's `comfyui-v*.*.*` tags during sync.
name: Publish to Comfy registry
on:
workflow_dispatch:
push:
tags:
- "v*.*.*"
permissions:
contents: read
issues: write
jobs:
publish-node:
name: Publish Custom Node to registry
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Publish Custom Node
uses: Comfy-Org/publish-node-action@v1
with:
personal_access_token: ${{ secrets.REGISTRY_ACCESS_TOKEN }}
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# SenseNova-U1 for ComfyUI
ComfyUI custom nodes for SenseNova-U1 API and local inference.
> Source of truth lives in [`OpenSenseNova/SenseNova-U1`](https://github.com/OpenSenseNova/SenseNova-U1)
> under `apps/comfyui/`. The standalone repo
> [`OpenSenseNova/ComfyUI-SenseNova-U1`](https://github.com/OpenSenseNova/ComfyUI-SenseNova-U1)
> is a read-only publish mirror used by Comfy Registry; please open PRs against
> the monorepo.
> Requires a ComfyUI build that ships the v3 node API (`comfy_api.latest`). The nodes are registered through `comfy_entrypoint()`; older ComfyUI installs that only support the v1 `NODE_CLASS_MAPPINGS` registration will not load them.
## Nodes
- `SenseNova Image Generate`: calls the U1-Fast image API.
- `SenseNova Chat`, `SenseNova Vision URL`, `SenseNova Vision Image`: utility API nodes.
- `SenseNova Prompt Builder`: rewrites raw ideas into image-generation prompts.
- `SenseNova U1 Local Loader`: loads a local or HuggingFace SenseNova-U1 checkpoint.
- `SenseNova U1 Local Text to Image`: runs local `t2i_generate`.
- `SenseNova U1 Local Image Edit`: runs local `it2i_generate`.
- `SenseNova U1 Local Interleave`: runs local `interleave_gen`.
- `SenseNova Interleave Preview`: renders ordered interleaved text / image results.
## Install
### Recommended (end users): ComfyUI Manager / Comfy Registry
Search for **SenseNova-U1** in ComfyUI Manager, or:
```bash
comfy node install ComfyUI-SenseNova-U1
```
This pulls the latest published release from https://registry.comfy.org and
installs the declared dependencies (including the `sensenova-u1` Python package
needed for local inference) into ComfyUI's Python environment automatically.
Restart ComfyUI afterwards.
### Developer install (from the SenseNova-U1 monorepo)
If you're hacking on the nodes alongside the model source:
```bash
python apps/comfyui/install.py --comfyui /path/to/ComfyUI
python -m pip install -r apps/comfyui/requirements.txt --no-deps # skip the git-URL line
python -m pip install -e . # install sensenova-u1 from src/
```
`install.py` symlinks (or copies, with `--copy`) `apps/comfyui/` into
`<ComfyUI>/custom_nodes/ComfyUI-SenseNova-U1`. Restart ComfyUI after installation.
**Source path auto-discovery is location-bound to the symlink.** In the
default symlink mode, `local_pipeline.default_source_path()` resolves
`__file__` through the symlink and uses `<repo>/src/` if it sees the file
sitting under `apps/comfyui/` — no `SENSENOVA_U1_SRC` needed. If you move,
rename, or delete the monorepo checkout, the link breaks; re-run
`install.py` to recreate it. With `--copy`, the files no longer point back
to the repo, so set `SENSENOVA_U1_SRC=/path/to/SenseNova-U1/src` (or fill
the loader node's `sensenova_u1_src` input) yourself.
## Workflows
Example workflows live in `example_workflows/`. Each links to a screenshot of the loaded graph in `docs/`:
| Workflow | Description | Preview |
| --- | --- | --- |
| `api_u1_fast_t2i.json` | API U1-Fast text-to-image | ![api_u1_fast_t2i](docs/api_u1_fast_t2i.jpg) |
| `local_t2i.json` | Local SenseNova-U1 text-to-image | ![t2i](docs/t2i.jpg) |
| `local_editing.json` | Local SenseNova-U1 image editing | ![editing](docs/editing.jpg) |
| `local_interleave.json` | Local SenseNova-U1 interleaved generation | ![interleave](docs/interleave.jpg) |
Drag a workflow JSON into ComfyUI, then update `model_path`, `device`, `device_map`, and prompt
settings as needed. For a smoke test, set `num_steps` to `1` or `2` before returning to the
recommended `50`.
## API Environment
API nodes read credentials from environment variables or `.env`:
```bash
export SN_API_KEY="your-api-token"
export SN_BASE_URL="https://token.sensenova.cn/v1"
```
Tokens are not exposed as node inputs, so they are not saved into ComfyUI workflows.
## GGUF Quantized Checkpoints
The `SenseNova U1 Local Loader` exposes an optional `gguf_checkpoint` dropdown
populated from `<comfyui>/models/gguf/` and the stock ComfyUI
`<comfyui>/models/diffusion_models/` folder (the default location used by
ComfyUI-GGUF style distributions). When a file is selected, weights are loaded
through `diffusers`' GGUF quantizer (dequantizing `nn.Linear` -> `GGUFLinear`)
instead of safetensors; config and tokenizer still come from `model_path`. The
default empty selection keeps the safetensors path.
Drop your `.gguf` file into either folder and restart ComfyUI to refresh the
dropdown.
Requirements: install the `gguf` extra in the ComfyUI Python environment, e.g.
```bash
python -m pip install -e ".[gguf]" # from this repo, or
python -m pip install "gguf>=0.10.0" "diffusers>=0.30.0"
```
`gguf_checkpoint` cannot be combined with a non-`none` `device_map` — pick one.
## Notes On Samplers
Local U1 generation uses the sampling loop implemented by `t2i_generate`, `it2i_generate`, and
`interleave_gen`. It does not directly plug into ComfyUI's `KSampler` / latent model interface.
You can still reuse ComfyUI image IO and post-processing nodes around these U1 nodes.
# Register `<comfyui>/models/gguf` as a model folder before nodes.py is imported,
# so SenseNovaU1LocalLoader's `gguf_checkpoint` dropdown can be populated via
# folder_paths.get_filename_list("gguf"). Tolerant when folder_paths isn't
# importable (e.g. running tests outside ComfyUI).
try:
import os as _os
import folder_paths as _folder_paths
_gguf_dir = _os.path.join(_folder_paths.models_dir, "gguf")
_existing = _folder_paths.folder_names_and_paths.get("gguf")
if _existing is None:
_folder_paths.folder_names_and_paths["gguf"] = ([_gguf_dir], {".gguf"})
else:
_paths, _exts = _existing
if _gguf_dir not in _paths:
_paths.append(_gguf_dir)
_exts.add(".gguf")
except Exception: # pragma: no cover - non-ComfyUI env or registration race
pass
try:
from .nodes import comfy_entrypoint
except ImportError: # pragma: no cover - supports direct pytest collection
from nodes import comfy_entrypoint
# ComfyUI auto-loads every JS file under this directory as a frontend extension.
# Used to render `ui.text` produced by SenseNovaInterleavePreview, which the
# stock frontend does not display on the node itself.
WEB_DIRECTORY = "./web"
__all__ = ["comfy_entrypoint", "WEB_DIRECTORY"]
from __future__ import annotations
import json
import time
from dataclasses import dataclass
from typing import Any
import httpx
try:
from .config import SenseNovaConfig, load_config
from .image_utils import MAX_IMAGE_BYTES, is_http_url, is_supported_vision_image_url
except ImportError: # pragma: no cover - supports direct test imports
from config import SenseNovaConfig, load_config
from image_utils import MAX_IMAGE_BYTES, is_http_url, is_supported_vision_image_url
CHAT_MODELS = ("sensenova-6.7-flash-lite", "deepseek-v4")
VISION_MODELS = ("sensenova-6.7-flash-lite",)
IMAGE_MODELS = ("sensenova-u1-fast",)
IMAGE_SIZES = (
"2752x1536",
"1536x2752",
"2048x2048",
"2496x1664",
"1664x2496",
"2368x1760",
"1760x2368",
"2272x1824",
"1824x2272",
"3072x1376",
"1344x3136",
)
IMAGE_SIZE_OPTIONS = (
"2752x1536|16:9",
"1536x2752|9:16",
"2048x2048|1:1",
"2496x1664|3:2",
"1664x2496|2:3",
"2368x1760|4:3",
"1760x2368|3:4",
"2272x1824|5:4",
"1824x2272|4:5",
"3072x1376|21:9",
"1344x3136|9:21",
)
@dataclass(frozen=True)
class ChatResult:
text: str
usage: dict[str, Any]
raw: dict[str, Any]
@dataclass(frozen=True)
class ImageGenerationResult:
image_base64: str
image_url: str
image_bytes: bytes
raw: dict[str, Any]
class SenseNovaClient:
def __init__(self, config: SenseNovaConfig):
self.config = config
@classmethod
def from_env(cls) -> SenseNovaClient:
return cls(load_config())
def chat(
self,
*,
text: str,
system_prompt: str,
model: str,
temperature: float,
top_p: float,
max_tokens: int,
timeout: int,
) -> ChatResult:
if model not in CHAT_MODELS:
raise RuntimeError(f"Unsupported chat model: {model}")
if not text.strip():
raise RuntimeError("Chat text cannot be empty.")
payload: dict[str, Any] = {
"model": model,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": text},
],
"stream": False,
"temperature": temperature,
"top_p": top_p,
"max_tokens": max_tokens,
}
raw = self._post_json("/chat/completions", payload, timeout=timeout)
return ChatResult(text=_extract_chat_text(raw), usage=raw.get("usage", {}), raw=raw)
def vision_chat(
self,
*,
image_url: str,
prompt: str,
system_prompt: str,
model: str,
temperature: float,
top_p: float,
max_tokens: int,
timeout: int,
) -> ChatResult:
if model not in VISION_MODELS:
raise RuntimeError(f"Unsupported vision model: {model}")
if not prompt.strip():
raise RuntimeError("Vision prompt cannot be empty.")
if not is_supported_vision_image_url(image_url):
raise RuntimeError("Vision image URL must be http(s) or a base64 image data URL.")
payload: dict[str, Any] = {
"model": model,
"messages": [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
],
"stream": False,
"temperature": temperature,
"top_p": top_p,
"max_tokens": max_tokens,
}
raw = self._post_json("/chat/completions", payload, timeout=timeout)
return ChatResult(text=_extract_chat_text(raw), usage=raw.get("usage", {}), raw=raw)
def generate_image(
self,
*,
prompt: str,
model: str,
size: str,
timeout: int,
) -> ImageGenerationResult:
if model not in IMAGE_MODELS:
raise RuntimeError(f"Unsupported image model: {model}")
normalized_size = normalize_image_size(size)
if normalized_size not in IMAGE_SIZES:
raise RuntimeError(f"Unsupported image size: {size}")
if not prompt.strip():
raise RuntimeError("Image prompt cannot be empty.")
payload: dict[str, Any] = {
"model": model,
"prompt": prompt,
"size": normalized_size,
"n": 1,
}
raw = self._post_json("/images/generations", payload, timeout=timeout)
image_base64, image_url = _extract_image_payload(raw)
image_bytes = b""
if image_base64:
import base64
try:
from .image_utils import strip_data_url
except ImportError: # pragma: no cover - supports direct test imports
from image_utils import strip_data_url
image_bytes = base64.b64decode(strip_data_url(image_base64), validate=True)
elif image_url:
image_bytes = self.download_image(image_url, timeout=timeout)
else:
raise RuntimeError("Image response did not contain b64_json, base64, or url.")
return ImageGenerationResult(
image_base64=image_base64,
image_url=image_url,
image_bytes=image_bytes,
raw=raw,
)
def download_image(self, url: str, *, timeout: int) -> bytes:
if not is_http_url(url):
raise RuntimeError("Image URL must use http or https.")
try:
with (
httpx.Client(timeout=timeout, follow_redirects=True) as client,
client.stream("GET", url) as response,
):
response.raise_for_status()
chunks: list[bytes] = []
total = 0
for chunk in response.iter_bytes():
total += len(chunk)
if total > MAX_IMAGE_BYTES:
raise RuntimeError("Downloaded image is larger than 50MB.")
chunks.append(chunk)
return b"".join(chunks)
except httpx.HTTPStatusError as exc:
status_code = exc.response.status_code
raise RuntimeError(f"Image download failed with HTTP {status_code}.") from exc
except httpx.HTTPError as exc:
raise RuntimeError(f"Image download failed: {exc.__class__.__name__}.") from exc
def _post_json(self, path: str, payload: dict[str, Any], *, timeout: int) -> dict[str, Any]:
url = f"{self.config.base_url}{path}"
headers = {
"Authorization": f"Bearer {self.config.api_key}",
"Content-Type": "application/json",
}
last_error: Exception | None = None
for attempt in range(3):
try:
with httpx.Client(timeout=timeout) as client:
response = client.post(url, headers=headers, json=payload)
if response.status_code in {429, 500, 502, 503, 504} and attempt < 2:
time.sleep(2**attempt)
continue
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as exc:
status_code = exc.response.status_code
if status_code in {429, 500, 502, 503, 504} and attempt < 2:
time.sleep(2**attempt)
last_error = exc
continue
raise RuntimeError(_format_api_error(exc.response, self.config.api_key)) from exc
except httpx.HTTPError as exc:
if attempt < 2:
time.sleep(2**attempt)
last_error = exc
continue
raise RuntimeError(f"SenseNova request failed: {exc.__class__.__name__}.") from exc
except json.JSONDecodeError as exc:
raise RuntimeError("SenseNova response was not valid JSON.") from exc
raise RuntimeError(f"SenseNova request failed: {last_error.__class__.__name__}.")
def _extract_chat_text(raw: dict[str, Any]) -> str:
try:
return raw["choices"][0]["message"]["content"]
except (KeyError, IndexError, TypeError) as exc:
raise RuntimeError("Chat response did not contain choices[0].message.content.") from exc
def normalize_image_size(size: str) -> str:
return size.split("|", 1)[0].strip()
def _extract_image_payload(raw: dict[str, Any]) -> tuple[str, str]:
try:
first = raw["data"][0]
except (KeyError, IndexError, TypeError) as exc:
raise RuntimeError("Image response did not contain data[0].") from exc
if not isinstance(first, dict):
raise RuntimeError("Image response data[0] was not an object.")
image_base64 = first.get("b64_json") or first.get("base64") or first.get("image_base64") or ""
image_url = first.get("url") or ""
return str(image_base64), str(image_url)
def _format_api_error(response: httpx.Response, api_key: str = "") -> str:
message = ""
try:
body = response.json()
message = body.get("error", {}).get("message") or body.get("message") or ""
except Exception:
message = response.text[:500]
if message:
return f"SenseNova API error HTTP {response.status_code}: {_redact(message, api_key)}"
return f"SenseNova API error HTTP {response.status_code}."
def _redact(value: str, api_key: str = "") -> str:
redacted = value.replace("Bearer ", "Bearer [REDACTED] ")
if api_key:
redacted = redacted.replace(api_key, "[REDACTED]")
return redacted
from __future__ import annotations
import os
from dataclasses import dataclass
from dotenv import load_dotenv
DEFAULT_BASE_URL = "https://token.sensenova.cn/v1"
API_KEY_ENV = "SN_API_KEY"
BASE_URL_ENV = "SN_BASE_URL"
@dataclass(frozen=True)
class SenseNovaConfig:
api_key: str
base_url: str = DEFAULT_BASE_URL
def load_config(*, load_env_file: bool = True) -> SenseNovaConfig:
if load_env_file:
load_dotenv()
api_key = os.getenv(API_KEY_ENV, "").strip()
if not api_key:
raise RuntimeError(f"Missing {API_KEY_ENV}. Set it in your environment or in a local .env file.")
base_url = os.getenv(BASE_URL_ENV, DEFAULT_BASE_URL).strip().rstrip("/")
if not base_url:
base_url = DEFAULT_BASE_URL
return SenseNovaConfig(api_key=api_key, base_url=base_url)
{
"id": "03c13bce-fb68-4284-bd5a-88aed2db6cec",
"revision": 0,
"last_node_id": 8,
"last_link_id": 7,
"nodes": [
{
"id": 3,
"type": "SenseNovaImageGenerate",
"pos": [
224.37622016118027,
254.75802785642676
],
"size": [
400,
240
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "prompt",
"type": "STRING",
"widget": {
"name": "prompt"
},
"link": 6
}
],
"outputs": [
{
"name": "images",
"type": "IMAGE",
"links": [
7
]
},
{
"name": "image_base64",
"type": "STRING",
"links": null
},
{
"name": "image_url",
"type": "STRING",
"links": null
},
{
"name": "raw_json",
"type": "STRING",
"links": []
},
{
"name": "image_info",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaImageGenerate"
},
"widgets_values": [
"",
"sensenova-u1-fast",
"2752x1536|16:9",
300
]
},
{
"id": 8,
"type": "PreviewImage",
"pos": [
663.0133442095159,
255.6695527433532
],
"size": [
140,
246
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 7
}
],
"outputs": [],
"properties": {
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 7,
"type": "SenseNovaPromptBuilder",
"pos": [
-201.9358334262903,
356.0050478984727
],
"size": [
400,
302
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "prompt",
"type": "STRING",
"links": [
6
]
},
{
"name": "usage_json",
"type": "STRING",
"links": null
},
{
"name": "raw_json",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaPromptBuilder"
},
"widgets_values": [
"如何养猫咪",
"You are a world-renowned \"Senior Visual Information Architect\" and \"AI Image Prompt Engineering Expert.\" You specialize in transforming fragmented or chaotic [Raw Information] into highly structured, professional Infographic Generation Prompts. Your work is defined by rigorous visual logic, precise spatial organization, and a dense amount of useful information.\n\n# Task\nReconstruct the user's [Raw Information] into a comprehensive visual synthesis prompt. Your objective is to guide an image generation model to render an information-dense infographic with advanced typography, vivid visual style, and clear structure based only on the user's text.\n\n# Step-by-Step Methodology\n1. Content Expansion and Textualization: Analyze the [Raw Information] to extract its core intent.\n - Detailing: Extract every entity, number, color, date, and phrase from the [Raw Information]. Do not omit provided facts.\n - Categorization: Define sub-categories with distinct visual markers.\n - Density Enrichment: If the input is brief, add professional annotations, sub-headings, body text, key insights, or practical notes that fit the topic.\n2. Adaptive Structural Analysis:\n - User-Defined Priority: If the user provides layout instructions, strictly follow them.\n - Logic-Driven Inference: If no layout is specified, infer whether the content is chronological, hierarchical, process-oriented, comparative, or modular, then choose the best spatial architecture.\n3. Style Tonal Setting: If no style is provided, assign an aesthetic that complements the topic, such as modern editorial infographic, technical blueprint, clean SaaS dashboard, or hand-drawn knowledge poster.\n4. Data Preservation and Encoding: Preserve all numbers, dates, colors, and proper nouns exactly. Convert them into explicit visual labels, charts, or callouts within the prompt.\n5. Language Parity: Detect the language of the [Raw Information] and use that language for the entire output. If the input is Chinese, output Chinese. If the input is English, output English. Do not mix languages.\n\n# Strict Constraints\n1. Do not include introductory, summary, or meta-commentary text. Start directly with the final visual prompt.\n2. Every piece of text intended to appear in the image must be enclosed in quotation marks.\n3. Do not use quotation marks for style descriptions, layout descriptions, colors, or non-textual elements.\n4. Describe the layout, reading order, background texture, visual hierarchy, typography, color palette, icons, charts, and callouts explicitly.\n5. Describe every icon semantically. Do not write generic phrases like \"an icon\" without specifying its visual content.\n6. Minimize arrows unless the user asks for them. Prefer alignment, grouping, proximity, and numbered steps to show relationships.\n7. Do not use hexadecimal color codes. Use descriptive color names.\n8. Do not invent factual numbers, dates, names, or claims that conflict with the user's input.\n\n# Output\nReturn only the final image generation prompt. The prompt should be directly usable as input to an image generation node.",
"sensenova-6.7-flash-lite",
0.3,
1,
2048,
120
]
}
],
"links": [
[
6,
7,
0,
3,
0,
"STRING"
],
[
7,
3,
0,
8,
0,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"workflowRendererVersion": "LG",
"ds": {
"scale": 1.8954988944389861,
"offset": [
329.6590232185344,
-134.93740305896515
]
},
"frontendVersion": "1.39.19"
},
"version": 0.4
}
{
"id": "8a36c97b-03db-45f4-9154-0e59b7f61366",
"revision": 0,
"last_node_id": 6,
"last_link_id": 4,
"nodes": [
{
"id": 1,
"type": "SenseNovaU1LocalLoader",
"pos": [
-622.3850180121219,
-19.147734766664918
],
"size": [
504,
432
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "u1_model",
"type": "SENSENOVA_U1_LOCAL_MODEL",
"links": [
1
]
},
{
"name": "model_info_json",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaU1LocalLoader"
},
"widgets_values": [
"sensenova/SenseNova-U1-8B-MoT",
"",
"cuda",
"bfloat16",
"auto",
"none",
"",
"full",
""
]
},
{
"id": 2,
"type": "LoadImage",
"pos": [
-620.497673660328,
480.7228565428773
],
"size": [
501.1875,
646.90625
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
2
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"1.webp",
"image"
]
},
{
"id": 3,
"type": "SenseNovaU1LocalImageEdit",
"pos": [
1.2299845460901224,
146.87837744855761
],
"size": [
528,
639.65625
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "u1_model",
"type": "SENSENOVA_U1_LOCAL_MODEL",
"link": 1
},
{
"name": "image",
"type": "IMAGE",
"link": 2
}
],
"outputs": [
{
"name": "images",
"type": "IMAGE",
"links": [
3
]
},
{
"name": "text",
"type": "STRING",
"links": null
},
{
"name": "think_text",
"type": "STRING",
"links": [
4
]
},
{
"name": "metadata_json",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaU1LocalImageEdit"
},
"widgets_values": [
"Change the jacket of the person on the left to bright yellow.",
true,
2048,
2048,
4.194304,
4,
1,
"none",
3,
0,
1,
50,
1,
42,
"fixed",
false
]
},
{
"id": 4,
"type": "PreviewImage",
"pos": [
663.2786584907477,
-37.692689060801655
],
"size": [
464.25,
746.078125
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 3
}
],
"outputs": [],
"properties": {
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 6,
"type": "PreviewAny",
"pos": [
666.6133339501529,
809.4684559313681
],
"size": [
464.015625,
208.015625
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 4
}
],
"outputs": [],
"properties": {
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
}
],
"links": [
[
1,
1,
0,
3,
0,
"SENSENOVA_U1_LOCAL_MODEL"
],
[
2,
2,
0,
3,
1,
"IMAGE"
],
[
3,
3,
0,
4,
0,
"IMAGE"
],
[
4,
3,
2,
6,
0,
"STRING"
]
],
"groups": [],
"config": {},
"extra": {
"workflowRendererVersion": "Vue",
"ds": {
"scale": 0.7149221062580672,
"offset": [
1225.7407146168464,
100.87314475139422
]
},
"frontendVersion": "1.39.19",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
{
"id": "a92af27a-0106-4c6f-9d1c-f9783b652f44",
"revision": 0,
"last_node_id": 7,
"last_link_id": 5,
"nodes": [
{
"id": 1,
"type": "SenseNovaU1LocalLoader",
"pos": [
-586.4999290408795,
148.5000079099019
],
"size": [
500.234375,
390.625
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "u1_model",
"type": "SENSENOVA_U1_LOCAL_MODEL",
"links": [
1
]
},
{
"name": "model_info_json",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaU1LocalLoader"
},
"widgets_values": [
"sensenova/SenseNova-U1-8B-MoT-Infographic",
"",
"cuda",
"bfloat16",
"auto",
"none",
"",
"full",
""
]
},
{
"id": 2,
"type": "SenseNovaU1LocalTextToImage",
"pos": [
-19.2498760630192,
149.75002977542277
],
"size": [
565.953125,
887.203125
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "u1_model",
"type": "SENSENOVA_U1_LOCAL_MODEL",
"link": 1
},
{
"name": "prompt",
"type": "STRING",
"widget": {
"name": "prompt"
},
"link": 5
}
],
"outputs": [
{
"name": "images",
"type": "IMAGE",
"links": [
2
]
},
{
"name": "text",
"type": "STRING",
"links": null
},
{
"name": "think_text",
"type": "STRING",
"links": [
3
]
},
{
"name": "metadata_json",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaU1LocalTextToImage"
},
"widgets_values": [
"这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\\n\\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\\n\\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\\n\\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\\n\\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\\n\\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\\n\\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。",
"2720x1536|16:9",
4,
"none",
3,
0,
1,
50,
1,
42,
false,
false
]
},
{
"id": 3,
"type": "PreviewImage",
"pos": [
612.0374341866545,
153.07410621275454
],
"size": [
614.578125,
393.609375
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 2
}
],
"outputs": [],
"properties": {
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 4,
"type": "PreviewAny",
"pos": [
614.1411713052378,
610.8904862658945
],
"size": [
606.28125,
415.96875
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 3
}
],
"outputs": [],
"properties": {
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
},
{
"id": 6,
"type": "Note",
"pos": [
-874.1990665007161,
624.8566777241045
],
"size": [
238.546875,
133.046875
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [],
"properties": {},
"widgets_values": [
"This is a prompt enhancement module; you can turn it off if you don't need it."
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 7,
"type": "SenseNovaPromptBuilder",
"pos": [
-585.0759757790407,
605.0166969060281
],
"size": [
504.09375,
412.65625
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "prompt",
"type": "STRING",
"links": null
},
{
"name": "usage_json",
"type": "STRING",
"links": [
5
]
},
{
"name": "raw_json",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaPromptBuilder"
},
"widgets_values": [
"生成一张教育预防电信诈骗的信息图",
"You are a world-renowned \"Senior Visual Information Architect\" and \"AI Image Prompt Engineering Expert.\" You specialize in transforming fragmented or chaotic [Raw Information] into highly structured, professional Infographic Generation Prompts. Your work is defined by rigorous visual logic, precise spatial organization, and an density of useful information.\n\n# Task\nReconstruct the user’s [Raw Information] into a comprehensive visual synthesis prompt (approx. 400-600 words). Your objective is to guide large image models (e.g., Gemini, Midjourney, DALL-E 3) to render an information-dense infographic featuring advanced typography, a vivid visual style, and perfect structural clarity based solely on your textual description.\n\n# Step-by-Step Methodology\n1. **Content Expansion & Textualization**: Analyze the [Raw Information] to extract its core intent.\n - Detailing: Extract every entity, number, color, and phrase from the [Raw Information]. Do not summarize.\n - Categorization: Define sub-categories with distinct visual markers.\n - Density Enrichment: If the input is brief, supplement it with professional annotations, sub-headings, body text and \"Pro-tips\" or \"Key Insights\" related to the topic to maximize the \"information load\".\n2. **Adaptive Structural Analysis**:\n - User-Defined Priority: First, check if the user has provided specific layout instructions (e.g., \"three-column grid,\" \"horizontal timeline\"). If present, strictly follow these instructions.\n - Logic-Driven Inference: If no layout is specified, analyze the [Raw Information] for its underlying logic (chronological, hierarchical, process-oriented, or comparative) and design a spatial architecture that best serves that logic.\n3. **Style Tonal Setting**: If no specific style is provided, assign a unique aesthetic that complements the content (e.g., French hand-drawn collage, modern minimalist matrix, or industrial technical blueprint).\n4. **Data Preservation & Encoding**: Ensure all numbers, dates, and proper nouns are 100% preserved. Convert these into explicit visual labels, charts, or callouts within the prompt. Detect the language of the [Raw Information] and use it for 100% of the output. If input is Chinese, output Chinese. If input is English, output English. No mixing.\n\n\n# Strict Constraints\n1. **Strict Language Parity**: Maintain absolute language consistency. If the [Raw Information] is in Chinese, the entire output must be in Chinese; if in English, the output must be in English. No code-switching.\n2. **Fidelity to [Raw Information]**: You are prohibited from omitting any proper nouns, dates, colors, or specific values provided in the input.\n3. **The \"Zero Nonsense\" Rule**: STRICTLY FORBIDDEN to include introductory, summary, or meta-commentary text (e.g., \"Here is the refined prompt...\"). Do not explain design choices or justify element omissions (e.g., do not mention \"implied flow\"). Start the response immediately with the visual description.\n4. **Visual Precision:\n - Textures: Mandatorily describe background textures (e.g., off-white aged paper, light gray grid, or black halftone shadows).\n - Typography: Explicitly specify font styles for different hierarchies (e.g., bold serif for titles, condensed mono-space for technical data).\n5. **Text Rendering Protocol**:\n - Quotes for Content: Every piece of text intended to appear in the image MUST be enclosed in quotes.\n - No Quotes for Style: NEVER use quotation marks for descriptions of [Style Description], [Layout Structure], colors or any non-textual elements.\n6. **Relational Arrow Logic**: Minimize the use of arrows. Rely on spatial proximity or alignment to imply connectivity. If arrows are requested, avoid generic orientations like \"horizontal.\" Instead, specify their precise starting point and target destination.\n7. **Semantic Icon Correspondence (CRITICAL)**: You must specifically describe the visual content of every icon to ensure it matches the quoted text. (e.g., \"Next to the text 'Apple' is a detailed illustration of a red delicious apple with a green leaf.\") Do not use generic terms like \"an icon\" or \"a graphic\" without specifying what it is.\n8. **No Hexadecimal Codes**: Never use codes like #xxxx. Use descriptive color names (e.g., sage green, deep navy blue, terracotta).\n\n# Output Format (If the [Raw Information] is in Chinese, please translate the following content into Chinese. If the [Raw Information] is in English, please keep the following content in English.)\nThe theme of the infographic is [Subject Name] (or 此信息图的主题是: [Subject Name]), [Style Description]. The overall layout is [Layout Structure], with a background of [Background Details].\nProvide a smooth and fluent description of the prompts for generating professional infographics. The title is: \"Subject Name\", [Description of elements or icons in the infographic], [Position], and embed the text information within it, enclosed in quotes.\n\n---\nPlease receive the user's [Raw Information] and directly output the restructured professional image generation prompt:",
"sensenova-6.7-flash-lite",
0.3,
1,
4096,
120
]
}
],
"links": [
[
1,
1,
0,
2,
0,
"SENSENOVA_U1_LOCAL_MODEL"
],
[
2,
2,
0,
3,
0,
"IMAGE"
],
[
3,
2,
2,
4,
0,
"STRING"
],
[
5,
7,
1,
2,
1,
"STRING"
]
],
"groups": [],
"config": {},
"extra": {
"workflowRendererVersion": "Vue",
"ds": {
"scale": 0.6412973759090729,
"offset": [
1055.6855096974812,
19.214707810300403
]
},
"frontendVersion": "1.39.19",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
{
"id": "be8d8cc8-cbee-4189-9901-f3562f2b9815",
"revision": 0,
"last_node_id": 6,
"last_link_id": 8,
"nodes": [
{
"id": 1,
"type": "SenseNovaU1LocalLoader",
"pos": [
-699.5435454242499,
144.33814729971365
],
"size": [
504,
432
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "u1_model",
"type": "SENSENOVA_U1_LOCAL_MODEL",
"links": [
5
]
},
{
"name": "model_info_json",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaU1LocalLoader"
},
"widgets_values": [
"sensenova/SenseNova-U1-8B-MoT",
"",
"cuda",
"bfloat16",
"auto",
"none",
"",
"full",
""
]
},
{
"id": 3,
"type": "PreviewImage",
"pos": [
1558.6051808986454,
127.84686237113888
],
"size": [
1080,
684
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 6
}
],
"outputs": [],
"properties": {
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 4,
"type": "SenseNovaInterleavePreview",
"pos": [
743.748594409936,
132.9936483478975
],
"size": [
702.609375,
2427.171875
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "interleave_result",
"type": "SENSENOVA_INTERLEAVE_RESULT",
"link": 7
},
{
"name": "images",
"shape": 7,
"type": "IMAGE",
"link": 8
}
],
"outputs": [
{
"name": "markdown",
"type": "STRING",
"links": null
}
],
"properties": {
"Node name for S&R": "SenseNovaInterleavePreview"
},
"widgets_values": [
false,
""
]
},
{
"id": 5,
"type": "SenseNovaU1LocalInterleave",
"pos": [
-73.00482369366728,
142.500843346499
],
"size": [
689.09375,
955.40625
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "u1_model",
"type": "SENSENOVA_U1_LOCAL_MODEL",
"link": 5
},
{
"name": "image",
"shape": 7,
"type": "IMAGE",
"link": null
}
],
"outputs": [
{
"name": "images",
"type": "IMAGE",
"links": [
6,
8
]
},
{
"name": "text",
"type": "STRING",
"links": null
},
{
"name": "think_text",
"type": "STRING",
"links": null
},
{
"name": "metadata_json",
"type": "STRING",
"links": null
},
{
"name": "interleave_result",
"type": "SENSENOVA_INTERLEAVE_RESULT",
"links": [
7
]
}
],
"properties": {
"Node name for S&R": "SenseNovaU1LocalInterleave"
},
"widgets_values": [
"讲一下经典童话《卖火柴的小女孩》,但这次请给出一个温暖的平行宇宙改编版图文绘本。在最后一次擦亮火柴时,出现的不是幻象,而是一只拥有魔法的驯鹿,它载着小女孩飞向了有糖果和壁炉的城堡",
"2048x1152|16:9",
"You are a multimodal assistant capable of reasoning with both text and images. You support two modes:\n\nThink Mode: When reasoning is needed, you MUST start with a <think></think> block and place all reasoning inside it. You MUST interleave text with generated images using tags like <image1>, <image2>. Images can ONLY be generated between <think> and </think>, and may be referenced in the final answer.\n\nNon-Think Mode: When no reasoning is needed, directly provide the answer without reasoning. Do not use tags like <image1>, <image2>; present any images naturally alongside the text.\n\nAfter the think block, always provide a concise, user-facing final answer. The answer may include text, images, or both. Match the user's language in both reasoning and the final answer.",
4,
1,
3,
0,
1,
50,
42,
"fixed",
false
]
}
],
"links": [
[
5,
1,
0,
5,
0,
"SENSENOVA_U1_LOCAL_MODEL"
],
[
6,
5,
0,
3,
0,
"IMAGE"
],
[
7,
5,
4,
4,
0,
"SENSENOVA_INTERLEAVE_RESULT"
],
[
8,
5,
0,
4,
1,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"workflowRendererVersion": "Vue",
"ds": {
"scale": 0.626887968379132,
"offset": [
1496.3740592420436,
92.9135509708664
]
},
"frontendVersion": "1.39.19",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment