Commit 2a934cec authored by raojy's avatar raojy
Browse files

first

parent 4b618aa3
# Examples
Reference inference scripts for SenseNova-U1. Every script here is intentionally
self-contained — on top of the `sensenova_u1` package itself it only pulls in
`torch`, `transformers`, `pillow`, `numpy` (and optionally `tqdm` /
`flash-attn`).
Each task lives in its own subfolder with a matching `data/` directory of
sample inputs:
```
examples/
├── README.md
├── t2i/ # text-to-image
│ ├── inference.py
│ └── data/
│ ├── samples.jsonl
| ├── samples_reasoning.jsonl
│ └── samples_infographic.jsonl
├── editing/ # image editing (it2i)
│ ├── inference.py
│ ├── resize_inputs.py # offline pre-resize helper (recommended)
│ └── data/
│ ├── samples.jsonl
│ ├── samples_reasoning.jsonl
│ ├── images/
│ └── images_reasoning/
├── interleave/ # interleaved text+image gen (runnable)
│ ├── inference.py
│ ├── run.sh
│ └── data/
│ ├── samples.jsonl
│ ├── samples_reasoning.jsonl
│ ├── images/
│ └── images_reasoning/
└── vqa/ # visual understanding / VQA
├── inference.py
└── data/
├── samples.jsonl
└── images/
```
## Multi-GPU dispatch
All reference inference scripts support Transformers / Accelerate device-map loading.
For multi-GPU machines, add `--device_map auto` so Accelerate can split the model across visible GPUs:
```bash
python examples/t2i/inference.py \
--model_path SenseNova/SenseNova-U1-8B-MoT \
--prompt "A cinematic mountain village at sunrise" \
--device_map auto \
--max_memory "0=22GiB,1=22GiB" \
--output output.png
```
When `--device_map` is set, the model is dispatched by Accelerate and the
script does not call `.to(device)` on the full model.
With `--device_map auto`, Accelerate estimates module sizes, checks the
available memory on each visible GPU, and assigns modules to GPUs in order.
Passing `--max_memory` overrides the automatically detected per-device budget
and is recommended for reproducible placement on heterogeneous setups.
`--max_memory` constrains how Transformers / Accelerate places **model weights** across GPUs.
It is not a hard end-to-end VRAM cap: forward-time activation tensors, KV cache,
PyTorch reserved memory, and image-generation intermediates still need extra room.
On small GPUs, set the per-GPU budget below physical VRAM (for example `26GiB`-`28GiB` on a 32GB card)
and lower resolution / batch size if generation still OOMs.
### Low-VRAM single-card: `--vram_mode`
For single-card low-VRAM setups, prefer the project's layer-offload path over Accelerate CPU/disk
dispatch — it is significantly faster and is supported by all four reference scripts (t2i,
interleave, editing, vqa):
| `--vram_mode` | Behavior |
|---------------|----------|
| `full` | Whole model on GPU, no offload. Fastest. (Default.) |
| `low` | Synchronous per-layer CPU<->GPU swap. Smallest weight footprint, slowest. |
| `balanced` | Async prefetch (overlaps host->device with compute). Faster than `low`. |
```bash
python examples/t2i/inference.py \
--model_path SenseNova/SenseNova-U1-8B-MoT \
--prompt "A cinematic mountain village at sunrise" \
--vram_mode balanced \
--output output.png
```
`--vram_mode` is mutually exclusive with `--device_map`: layer offload requires the model to stay
on CPU between forwards, which is incompatible with Accelerate's static dispatch.
## Text-to-Image
Single prompt:
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。" \
--width 2720 --height 1536 \
--device_map auto \
--batch_size 1 \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--output output.png \
--profile
```
Batched prompts from a JSONL file (each line must contain a `prompt`;
`width` / `height` / `seed` are optional):
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/t2i/data/samples.jsonl \
--output_dir outputs/ \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--profile
```
See [`t2i/data/samples.jsonl`](./t2i/data/samples.jsonl) for a tiny starter file. Run `python examples/t2i/inference.py --help` for the full flag list.
Infographic-focused batched generation:
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT-Infographic \
--jsonl examples/t2i/data/samples_infographic_showcases.jsonl \
--output_dir outputs/ \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--profile
```
See [`t2i/data/samples_infographic_showcases.jsonl`](./t2i/data/samples_infographic_showcases.jsonl) to reproduce the infographic showcases, and the generated results can be viewed in [Infographic Showcases](../docs/u1_infographic_showcases.md).
### T2I reasoning (think mode)
The model can run a **reasoning** phase before denoising: it autoregressively fills `<think>...</think>`, then generates the image.
Single prompt (image + reasoning text):
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "A male peacock trying to attract a female" \
--width 2048 --height 2048 \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--seed 42 \
--think \
--print_think \
--output outputs/peacock.png
```
This writes `outputs/peacock.think.txt` with the raw thinking tokens. Use `--think_output /path/to/reasoning.txt` to choose another path, or `--print_think` to echo it to stdout.
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/t2i/data/samples_reasoning.jsonl \
--output_dir outputs/ \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--seed 42 \
--think \
--print_think \
--profile
```
JSONL: set `"think": true` per sample, or pass `--think` for all samples.
### Supported resolution buckets
SenseNova-U1 is trained on ~2K-pixel resolution buckets. Passing arbitrary `--width` / `--height` is allowed but quality may degrade for untrained shapes.
| Aspect ratio | Width × Height |
| :----------- | :------------- |
| 1:1 | 2048 × 2048 |
| 16:9 / 9:16 | 2720 × 1536 / 1536 × 2720 |
| 3:2 / 2:3 | 2496 × 1664 / 1664 × 2496 |
| 4:3 / 3:4 | 2368 × 1760 / 1760 × 2368 |
| 2:1 / 1:2 | 2880 × 1440 / 1440 × 2880 |
| 3:1 / 1:3 | 3456 × 1152 / 1152 × 3456 |
### Prompt Enhancement for Infographics
Short prompts — especially for **infographic** generation — can be enhanced by a strong LLM before inference, which noticeably lifts information density, typography fidelity, and layout adherence. Enable with `--enhance`:
```bash
# export U1_ENHANCE_API_KEY=sk-... # required
# defaults target Gemini 3.1 Pro via its OpenAI-compatible endpoint;
# override any of these to point at SenseNova / Claude / Kimi 2.5 etc.:
# export U1_ENHANCE_BACKEND=chat_completions # or 'anthropic'
# export U1_ENHANCE_ENDPOINT=https://...chat/completions
# export U1_ENHANCE_MODEL=gemini-3.1-pro
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "如何制作咖啡的教程" \
--enhance --print_enhance \
--output output.png
```
See [`docs/prompt_enhancement.md`](../docs/prompt_enhancement.md) for full details.
## Image Editing (it2i)
Single edit:
```bash
python examples/editing/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "Change the jacket of the person on the left to bright yellow." \
--image examples/editing/data/images/1.webp \
--cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none \
--timestep_shift 3.0 --num_steps 50 \
--output edited.png \
--profile --compare
```
Batched edits from a JSONL file (each line must contain a `prompt` and
`image` path; `seed` / `type` are optional; `image` can also be a list of
paths to pass multiple reference images; a per-sample `width` + `height` pair
overrides the CLI default for that line):
```bash
python examples/editing/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/editing/data/samples.jsonl \
--output_dir outputs/editing/ \
--cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none \
--timestep_shift 3.0 --num_steps 50 \
--profile --compare
```
Output resolution has two modes:
- **Auto (default)**: omit `--width / --height` — output tracks the first input via `smart_resize` (aspect ratio preserved, total pixels normalized to `--target_pixels` default `2048 * 2048`, H / W snapped to multiples of 32).
- **Explicit**: pass `--width W --height H` (both multiples of 32). 2048 × 2048 is a good general-purpose choice.
For best editing quality, provide high-resolution input/reference images when possible.
Use `--input_max_pixels auto` to keep up to two inputs at 2048 × 2048 and divide that total budget across more inputs.
This is a conservative default based on a two-reference 2048 × 2048 memory budget;
tune the cap for your GPU memory and quality needs, or pass an integer such as `1048576` for a 1024 × 1024 cap.
Resize-to-budget is enabled by default; pass `--no-do-resize` to skip this outer resize step.
The model's native preprocessing still applies its own input limits.
The script prints the per-image input cap before generation.
CFG defaults: `--cfg_scale 4.0` (text guidance), `--img_cfg_scale 1.0` (image CFG off by default). Run `python examples/editing/inference.py --help` for the full flag list.
See [`editing/data/samples.jsonl`](./editing/data/samples.jsonl) for a tiny starter file.
## Interleave
`examples/interleave/inference.py` drives `model.interleave_gen`, which produces
**interleaved text and images in a single response**. The model can emit a
`<think>...</think>` reasoning block that generates intermediate images, followed
by a concise final answer. See [`interleave/run.sh`](./interleave/run.sh) for a
three-mode launcher covering every usage pattern below.
**Output files:** every sample writes `<stem>.txt` (generated text) plus `<stem>_image_<i>.png` for each generated image; `--jsonl` mode also emits a `results.jsonl` manifest.
**Resolution:** when input images are provided via `--image` or the JSONL `image` field, the output resolution follows the first input image (snapped to 32-aligned buckets via `smart_resize`), overriding `--resolution` / `--width` / `--height`.
### 1) Single sample, text prompt only
```bash
python examples/interleave/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." \
--resolution "16:9" \
--output_dir outputs/interleave/text \
--stem demo_text
```
### 2) Single sample, text prompt + input image
```bash
python examples/interleave/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "<image>\n图文交错生成小猫游览故宫的场景" \
--image examples/interleave/data/images/image0.jpg \
--output_dir outputs/interleave/text_image \
--stem demo_text_image
```
### 3) Batched samples from JSONL
Each line is one sample:
```json
{"prompt": "..."}
{"prompt": "...", "image": ["a.jpg"]}
```
```bash
python examples/interleave/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/interleave/data/samples.jsonl \
--resolution "16:9" \
--output_dir outputs/interleave/jsonl
```
See [`interleave/data/samples.jsonl`](./interleave/data/samples.jsonl) for a
two-sample starter (one text-only, one image-conditioned).
## Visual Understanding (VQA)
Single image, with sampling enabled:
```bash
python examples/vqa/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--image examples/vqa/data/images/menu.jpg \
--question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." \
--output outputs/menu_answer.txt \
--max_new_tokens 8192 \
--do_sample \
--temperature 0.6 \
--top_p 0.95 \
--top_k 20 \
--repetition_penalty 1.05 \
--profile
```
Omit `--do_sample` (and the sampling flags) for deterministic greedy decoding.
Batched questions from a JSONL file (each line must contain `image` and `question`; `id` is optional):
```bash
python examples/vqa/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/vqa/data/samples.jsonl \
--output_dir outputs/vqa/ \
--max_new_tokens 8192 \
--do_sample \
--temperature 0.6 \
--top_p 0.95 \
--top_k 20 \
--repetition_penalty 1.05 \
--profile
```
Results are written to `outputs/vqa/answers.jsonl`, one JSON object per line with `id`, `image`, `question`, and `answer` fields.
See [`vqa/data/samples.jsonl`](./vqa/data/samples.jsonl) for a starter file.
### Generation parameters
| Flag | Default | Description |
| :--- | :------ | :---------- |
| `--max_new_tokens` | 1024 | Maximum response length |
| `--do_sample` | off (greedy) | Enable sampling |
| `--temperature` | 0.7 | Sampling temperature (used with `--do_sample`) |
| `--top_p` | 0.9 | Nucleus sampling threshold (used with `--do_sample`) |
| `--top_k` | None | Top-k sampling (used with `--do_sample`) |
| `--repetition_penalty` | None | Repetition penalty |
Run `python examples/vqa/inference.py --help` for the full flag list.
# 示例
本目录提供 SenseNova-U1 的参考推理脚本。所有脚本均刻意保持自包含——除 `sensenova_u1` 包本身外,仅依赖 `torch``transformers``pillow``numpy`(以及可选的 `tqdm` / `flash-attn`)。
每个任务位于独立的子目录下,并配有对应的 `data/` 示例输入目录:
```
examples/
├── README.md
├── t2i/ # 文生图
│ ├── inference.py
│ └── data/
│ ├── samples.jsonl
│ ├── samples_reasoning.jsonl
│ └── samples_infographic.jsonl
├── editing/ # 图像编辑(it2i)
│ ├── inference.py
│ ├── resize_inputs.py # 离线预缩放工具(推荐)
│ └── data/
│ ├── samples.jsonl
│ ├── samples_reasoning.jsonl
│ ├── images/
│ └── images_reasoning/
├── interleave/ # 图文交错生成(可直接运行)
│ ├── inference.py
│ ├── run.sh
│ └── data/
│ ├── samples.jsonl
│ ├── samples_reasoning.jsonl
│ ├── images/
│ └── images_reasoning/
└── vqa/ # 视觉理解 / VQA
├── inference.py
└── data/
├── samples.jsonl
└── images/
```
## 多 GPU 分发
所有参考推理脚本都支持 Transformers / Accelerate 的 device-map 加载。
对于多 GPU 机器,可以加上 `--device_map auto`,让 Accelerate 在可见的 GPU 之间切分模型:
```bash
python examples/t2i/inference.py \
--model_path SenseNova/SenseNova-U1-8B-MoT \
--prompt "A cinematic mountain village at sunrise" \
--device_map auto \
--max_memory "0=22GiB,1=22GiB" \
--output output.png
```
设置 `--device_map` 后,模型会交给 Accelerate 分发,脚本不会再对整个模型调用 `.to(device)`
设置 `--device_map auto` 时,Accelerate 会估算各个模块的大小,读取可见 GPU 的可用内存,并依次分配到各 GPU。
传入 `--max_memory` 会覆盖自动探测到的逐设备内存预算;在异构机器上想要稳定复现放置策略时建议显式设置。
`--max_memory` 约束的是 Transformers / Accelerate 如何在 GPU 之间放置**模型权重**
它不是严格的端到端显存上限:forward 期间的 activation、KV cache、PyTorch reserved memory,
以及图像生成相关中间状态仍然需要额外空间。
小显卡上建议把单卡预算设得低于物理显存,例如 32GB 显卡可先尝试 `26GiB`-`28GiB`
如果生成阶段仍然 OOM,再进一步降低分辨率或 batch size。
### 单卡低显存:`--vram_mode`
单卡显存吃紧时,请优先使用项目自带的 layer offload,比 Accelerate 的 CPU/磁盘分发快得多;
四个参考脚本(t2i / interleave / editing / vqa)都支持:
| `--vram_mode` | 行为 |
|---------------|------|
| `full` | 整模型常驻 GPU,不做 offload。最快。(默认)|
| `low` | 每层同步 CPU<->GPU 交换,权重显存占用最小,速度最慢。|
| `balanced` | 异步预取(H2D 与计算重叠),比 `low` 快。|
```bash
python examples/t2i/inference.py \
--model_path SenseNova/SenseNova-U1-8B-MoT \
--prompt "A cinematic mountain village at sunrise" \
--vram_mode balanced \
--output output.png
```
`--vram_mode``--device_map` 互斥:layer offload 需要模型在两次 forward 之间保留在 CPU 上,
这与 Accelerate 的静态分发不兼容。
## 文生图(Text-to-Image)
单条 prompt 推理:
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。" \
--width 2720 --height 1536 \
--device_map auto \
--batch_size 1 \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--output output.png \
--profile
```
通过 JSONL 文件批量推理(每行必须包含 `prompt` 字段;`width` / `height` / `seed` 为可选字段):
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/t2i/data/samples.jsonl \
--output_dir outputs/ \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--profile
```
可参考 [`t2i/data/samples.jsonl`](./t2i/data/samples.jsonl) 中的精简起步样例。完整参数列表请运行 `python examples/t2i/inference.py --help` 查看。
面向信息图(infographic)的批量生成示例:
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT-Infographic \
--jsonl examples/t2i/data/samples_infographic_showcases.jsonl \
--output_dir outputs/ \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--profile
```
可参考 [`t2i/data/samples_infographic_showcases.jsonl`](./t2i/data/samples_infographic_showcases.jsonl) 复现信息图展示样例,生成结果展示可见 [信息图案例展示](../docs/u1_infographic_showcases.md)
### T2I 推理模式(think mode)
模型支持在扩散去噪前先进行一段**推理**:会先自回归生成 `<think>...</think>`,随后再生成图像。
单条 prompt(输出图像 + 推理文本):
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "A male peacock trying to attract a female" \
--width 2048 --height 2048 \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--seed 42 \
--think \
--print_think \
--output outputs/peacock.png
```
该命令会写出 `outputs/peacock.think.txt`,也支持用 `--think_output /path/to/reasoning.txt` 指定保存路径,或用 `--print_think` 直接打印到标准输出。
```bash
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/t2i/data/samples_reasoning.jsonl \
--output_dir outputs/ \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--seed 42 \
--think \
--print_think \
--profile
```
JSONL 模式下,可在单条样本里设置 `"think": true`(即使全局没传 `--think` 也会对该条启用);也可以直接传全局 `--think` 对所有样本启用。
### 推荐分辨率档位
SenseNova-U1 在约 2K 像素的分辨率档位上训练。尽管支持任意的 `--width` / `--height`,但对未训练过的尺寸组合,生成质量可能有所下降。
| 宽高比 | 宽 × 高 |
| :----------- | :------------------------ |
| 1:1 | 2048 × 2048 |
| 16:9 / 9:16 | 2720 × 1536 / 1536 × 2720 |
| 3:2 / 2:3 | 2496 × 1664 / 1664 × 2496 |
| 4:3 / 3:4 | 2368 × 1760 / 1760 × 2368 |
| 2:1 / 1:2 | 2880 × 1440 / 1440 × 2880 |
| 3:1 / 1:3 | 3456 × 1152 / 1152 × 3456 |
### 信息图场景的 Prompt 增强
对于较短的 prompt——特别是**信息图**生成——可以在推理前先用一个能力更强的 LLM 对 prompt 进行改写增强,从而显著提升画面信息密度、排版还原度以及布局的遵循程度。加上 `--enhance` 参数即可开启:
```bash
# export U1_ENHANCE_API_KEY=sk-... # required
# defaults target Gemini 3.1 Pro via its OpenAI-compatible endpoint;
# override any of these to point at SenseNova / Claude / Kimi 2.5 etc.:
# export U1_ENHANCE_BACKEND=chat_completions # or 'anthropic'
# export U1_ENHANCE_ENDPOINT=https://...chat/completions
# export U1_ENHANCE_MODEL=gemini-3.1-pro
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "如何制作咖啡的教程" \
--enhance --print_enhance \
--output output.png
```
详细说明请参见 [`docs/prompt_enhancement_CN.md`](../docs/prompt_enhancement_CN.md)
## 图像编辑(it2i)
单次编辑:
```bash
python examples/editing/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "Change the jacket of the person on the left to bright yellow." \
--image examples/editing/data/images/1.webp \
--cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none \
--timestep_shift 3.0 --num_steps 50 \
--output edited.png \
--profile --compare
```
通过 JSONL 文件批量编辑(每行必须包含 `prompt` 以及 `image` 路径;`seed` / `type` 为可选字段;`image` 也可以是路径列表以传入多张参考图;若某行同时提供 `width``height`,则会覆盖该样本上命令行的默认值):
```bash
python examples/editing/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/editing/data/samples.jsonl \
--output_dir outputs/editing/ \
--cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none \
--timestep_shift 3.0 --num_steps 50 \
--profile --compare
```
输出分辨率共支持两种模式:
- **自动模式(默认)**:不传 `--width / --height` 时,输出分辨率跟随第一张输入图,通过 `smart_resize` 保持宽高比,总像素数归一化到 `--target_pixels`(默认 `2048 * 2048`),且 H / W 均对齐到 32 的倍数。
- **显式指定**:传入 `--width W --height H`(二者均须为 32 的倍数)。2048 × 2048 是一个通用场景下的稳妥选择。
为了获得更好的编辑质量,建议尽量提供高分辨率输入/参考图。
显存有限时可使用 `--input_max_pixels auto`:1-2 张图保持单图最高2048 × 2048,超过 2 张时把两张 2048 × 2048 的总预算均分给所有输入图。
这个策略是基于“两张 2048 × 2048 参考图”的显存预算给出的保守默认值;
可根据自己的显卡显存和编辑效果调节,也可传入整数,例如 `1048576` 表示单张输入图最高约 1024 × 1024。
外层 resize-to-budget 默认开启;可传 `--no-do-resize` 跳过这一步,但模型原生图像预处理仍会应用自己的输入尺寸限制。
脚本会在生成前打印每张输入图的上限。
CFG 默认值:`--cfg_scale 4.0`(文本引导强度),`--img_cfg_scale 1.0`(默认关闭图像 CFG)。完整参数列表请运行 `python examples/editing/inference.py --help` 查看。
可参考 [`editing/data/samples.jsonl`](./editing/data/samples.jsonl) 中的精简起步样例。
## 图文交错生成(Interleave)
`examples/interleave/inference.py` 调用的是 `model.interleave_gen`,可在**单次响应中交错生成文本与图像**。模型会先输出一段 `<think>...</think>` 的推理块,在其中生成中间图像,最后再给出简洁的答复。涵盖以下三种用法的启动脚本见 [`interleave/run.sh`](./interleave/run.sh)
**输出文件:** 每个样本会写出 `<stem>.txt`(生成的文本)以及对应每张图像的 `<stem>_image_<i>.png`;在 `--jsonl` 模式下还会额外生成一份 `results.jsonl` 清单。
**分辨率:** 若通过 `--image` 或 JSONL 中的 `image` 字段提供了输入图像,输出分辨率会跟随第一张输入图(经 `smart_resize` 对齐到 32 的倍数),并覆盖 `--resolution` / `--width` / `--height` 的设置。
### 1) 单样本,仅文本 prompt
```bash
python examples/interleave/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." \
--resolution "16:9" \
--output_dir outputs/interleave/text \
--stem demo_text
```
### 2) 单样本,文本 prompt + 输入图像
```bash
python examples/interleave/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--prompt "<image>\n图文交错生成小猫游览故宫的场景" \
--image examples/interleave/data/images/image0.jpg \
--output_dir outputs/interleave/text_image \
--stem demo_text_image
```
### 3) 通过 JSONL 批量推理
每行代表一个样本:
```json
{"prompt": "..."}
{"prompt": "...", "image": ["a.jpg"]}
```
```bash
python examples/interleave/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/interleave/data/samples.jsonl \
--resolution "16:9" \
--output_dir outputs/interleave/jsonl
```
[`interleave/data/samples.jsonl`](./interleave/data/samples.jsonl) 提供了一份包含两条样本(一条纯文本、一条图像条件)的起步文件。
## 视觉理解(VQA)
单图问答,启用采样:
```bash
python examples/vqa/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--image examples/vqa/data/images/menu.jpg \
--question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." \
--output outputs/menu_answer.txt \
--max_new_tokens 8192 \
--do_sample \
--temperature 0.6 \
--top_p 0.95 \
--top_k 20 \
--repetition_penalty 1.05 \
--profile
```
如果省略 `--do_sample`(连同相关的采样参数),则切换为确定性的贪婪解码。
通过 JSONL 文件批量问答(每行必须包含 `image``question` 字段;`id` 为可选):
```bash
python examples/vqa/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/vqa/data/samples.jsonl \
--output_dir outputs/vqa/ \
--max_new_tokens 8192 \
--do_sample \
--temperature 0.6 \
--top_p 0.95 \
--top_k 20 \
--repetition_penalty 1.05 \
--profile
```
结果会写入 `outputs/vqa/answers.jsonl`,每行一个 JSON 对象,包含 `id``image``question``answer` 字段。
起步样例可参考 [`vqa/data/samples.jsonl`](./vqa/data/samples.jsonl)
### 生成参数
| Flag | Default | 说明 |
| :--- | :------ | :---------- |
| `--max_new_tokens` | 1024 | 生成响应的最大长度 |
| `--do_sample` | off (greedy) | 启用采样 |
| `--temperature` | 0.7 | 采样温度(需配合 `--do_sample` 使用) |
| `--top_p` | 0.9 | 核采样(Nucleus)阈值(需配合 `--do_sample` 使用) |
| `--top_k` | None | Top-k 采样(需配合 `--do_sample` 使用) |
| `--repetition_penalty` | None | 重复惩罚系数 |
完整参数列表请运行 `python examples/vqa/inference.py --help` 查看。
{"image": "./examples/editing/data/images/1.webp", "prompt": "Change the jacket of the person on the left to bright yellow."}
{"image": "./examples/editing/data/images/2.webp", "prompt": "Make the person in the image smile."}
{"image": "./examples/editing/data/images/3.webp", "prompt": "在小狗头上放一个花环,并且把图片变为吉卜力风格。"}
{"image": "./examples/editing/data/images/4.webp", "prompt": "Add a bouquet of flowers."}
{"image": "./examples/editing/data/images/5.webp", "prompt": "Turn the image into an American comic style."}
{"image": "./examples/editing/data/images/6.webp", "prompt": "Replace the text \"WARFIGHTER\" to \"BATTLEFIELD\" in the bold orange-red font."}
{"image": "./examples/editing/data/images/7.webp", "prompt": "Remove the person on the far right wearing a green skirt and a green top."}
{"image": "./examples/editing/data/images/8.webp", "prompt": "Replace the man with a woman."}
\ No newline at end of file
{"prompt": "Draw what it will look like one hour later.", "image": "./examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png"}
{"prompt": "Draw what it will look like immediately after someone stands up from sitting on it for a long time.", "image": "./examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png"}
{"prompt": "Draw an image showing the side view of the provided traffic cone.", "image": "./examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si.png"}
{"prompt": "Change the water to high-concentration saltwater", "image": "./examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg"}
{"prompt": "What the fruit looks like when ripe in the picture", "image": "./examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg"}
{"prompt": "Correct the unreasonable part in the image.", "image": "./examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par.jpg"}
{"prompt": "Modify the matrix in the image to an upper triangular matrix", "image": "./examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima.jpg"}
from __future__ import annotations
import argparse
import json
import math
from pathlib import Path
from typing import Sequence
import numpy as np
import torch
from PIL import Image
import sensenova_u1
from sensenova_u1.models.neo_unify.utils import smart_resize
from sensenova_u1.utils import (
DEFAULT_IMAGE_PATCH_SIZE,
DEFAULT_VRAM_MODE,
InferenceProfiler,
add_offload_args,
best_available_device,
load_and_merge_lora_weight_from_safetensors,
load_model_and_tokenizer,
make_offload_ctx,
save_compare,
vram_mode_to_prefetch_count,
)
NORM_MEAN = (0.5, 0.5, 0.5)
NORM_STD = (0.5, 0.5, 0.5)
DEFAULT_SEED = 42
# Output H / W must be divisible by this (= patch_size * merge_size).
_IMAGE_GRID_FACTOR = DEFAULT_IMAGE_PATCH_SIZE
# Aspect ratio is preserved, total output pixels are normalized to this target.
DEFAULT_TARGET_PIXELS = 2048 * 2048
DEFAULT_INPUT_MAX_PIXELS = 2048 * 2048
MIN_INPUT_MAX_PIXELS = 512 * 512
def _auto_input_max_pixels(num_images: int) -> int:
full_res_image_budget = 2
if num_images <= full_res_image_budget:
return DEFAULT_INPUT_MAX_PIXELS
total_budget = full_res_image_budget * DEFAULT_INPUT_MAX_PIXELS
return max(MIN_INPUT_MAX_PIXELS, total_budget // max(1, num_images))
def _resolve_input_max_pixels(value: str | None, num_images: int) -> int | None:
if value is None:
return None
if value == "auto":
return _auto_input_max_pixels(num_images)
try:
input_max_pixels = int(value)
except ValueError as exc:
raise SystemExit("--input_max_pixels must be an integer or 'auto'.") from exc
if input_max_pixels < MIN_INPUT_MAX_PIXELS:
side = int(math.sqrt(MIN_INPUT_MAX_PIXELS))
raise SystemExit(f"--input_max_pixels must be >= {MIN_INPUT_MAX_PIXELS} ({side}*{side}).")
return input_max_pixels
def _resize_to_max_budget(img: Image.Image, input_max_pixels: int) -> Image.Image:
resized_h, resized_w = smart_resize(
height=img.height,
width=img.width,
factor=_IMAGE_GRID_FACTOR,
min_pixels=input_max_pixels,
max_pixels=input_max_pixels,
)
if (resized_w, resized_h) == img.size:
return img
return img.resize((resized_w, resized_h), Image.LANCZOS)
def _print_input_resize_hint(num_images: int, input_max_pixels: int | None, source: str, do_resize: bool) -> None:
if not do_resize:
print("[editing] resize-to-budget disabled; model preprocessing will still enforce its input limits.")
return
if input_max_pixels is None:
return
side = int(math.sqrt(input_max_pixels))
print(
f"[editing] {num_images} input image(s); {source} input_max_pixels={input_max_pixels} "
f"(about {side}x{side} per image, aspect ratio preserved)."
)
def _denorm(x: torch.Tensor) -> torch.Tensor:
mean = torch.tensor(NORM_MEAN, device=x.device, dtype=x.dtype).view(1, 3, 1, 1)
std = torch.tensor(NORM_STD, device=x.device, dtype=x.dtype).view(1, 3, 1, 1)
return (x * std + mean).clamp(0, 1)
def _to_pil(batch: torch.Tensor) -> list[Image.Image]:
arr = _denorm(batch.float()).permute(0, 2, 3, 1).cpu().numpy()
arr = (arr * 255.0).round().astype(np.uint8)
return [Image.fromarray(a) for a in arr]
def _load_input_image(
path: str | Path,
*,
do_resize: bool,
input_max_pixels: int | None,
) -> Image.Image:
"""Load as RGB; flatten RGBA onto white so the generator sees a clean canvas."""
img = Image.open(path)
if img.mode == "RGBA":
bg = Image.new("RGB", img.size, (255, 255, 255))
bg.paste(img, mask=img.split()[3])
img = img.convert("RGB")
if do_resize and input_max_pixels is not None:
img = _resize_to_max_budget(img, input_max_pixels)
return img
def _coerce_image_paths(value: object) -> list[str]:
if isinstance(value, list):
return [str(v) for v in value]
return [str(value)]
def _coerce_bool(value: object) -> bool:
if isinstance(value, bool):
return value
if isinstance(value, str):
lowered = value.strip().lower()
if lowered in {"1", "true", "yes", "y", "on"}:
return True
if lowered in {"0", "false", "no", "n", "off"}:
return False
raise SystemExit(f"Expected a boolean value for do_resize, got {value!r}.")
def _check_grid_divisible(width: int, height: int) -> None:
if width % _IMAGE_GRID_FACTOR or height % _IMAGE_GRID_FACTOR:
raise SystemExit(
f"[editing] output resolution ({width}x{height}) must be a multiple "
f"of {_IMAGE_GRID_FACTOR} on both axes (image-token grid factor)."
)
def _resolve_output_size(
input_images: Sequence[Image.Image],
*,
explicit: tuple[int, int] | None,
target_pixels: int,
) -> tuple[int, int]:
"""Explicit (W, H) wins; else match the first input's aspect ratio and
normalize the total pixel count to ``target_pixels``."""
if explicit is not None:
width, height = explicit
_check_grid_divisible(width, height)
return width, height
w, h = input_images[0].size
resized_h, resized_w = smart_resize(
height=h,
width=w,
factor=_IMAGE_GRID_FACTOR,
min_pixels=target_pixels,
max_pixels=target_pixels,
)
return resized_w, resized_h
def _explicit_size_from_sample(sample: dict) -> tuple[int, int] | None:
if "width" in sample and "height" in sample:
return int(sample["width"]), int(sample["height"])
return None
class SenseNovaU1Editing:
"""Thin wrapper calling ``model.it2i_generate`` on top of ``AutoModel``."""
def __init__(
self,
model_path: str,
device: str = "cuda",
dtype: torch.dtype = torch.bfloat16,
gguf_checkpoint: str | None = None,
device_map: str | None = None,
max_memory: str | None = None,
vram_mode: str = DEFAULT_VRAM_MODE,
) -> None:
self.device = device
self.vram_mode = vram_mode
self.prefetch_count = vram_mode_to_prefetch_count(vram_mode)
self.model, self.tokenizer = load_model_and_tokenizer(
model_path,
dtype=dtype,
device=device,
gguf_checkpoint=gguf_checkpoint,
for_offload=self.prefetch_count > 0,
device_map=device_map,
max_memory=max_memory,
)
@torch.inference_mode()
def edit(
self,
prompt: str,
images: Sequence[Image.Image],
image_size: tuple[int, int],
cfg_scale: float = 4.0,
img_cfg_scale: float = 1.0,
cfg_norm: str = "none",
timestep_shift: float = 3.0,
cfg_interval: tuple[float, float] = (0.0, 1.0),
num_steps: int = 50,
batch_size: int = 1,
think_mode: bool = False,
seed: int = 0,
) -> tuple[list[Image.Image], str]:
with make_offload_ctx(self.model, self.prefetch_count, self.device) as offloaded:
output = offloaded.it2i_generate(
self.tokenizer,
prompt,
list(images),
image_size=image_size,
cfg_scale=cfg_scale,
img_cfg_scale=img_cfg_scale,
cfg_norm=cfg_norm,
timestep_shift=timestep_shift,
cfg_interval=cfg_interval,
num_steps=num_steps,
batch_size=batch_size,
think_mode=think_mode,
seed=seed,
)
if think_mode:
return _to_pil(output[0]), output[1]
return _to_pil(output), ""
def _save_images(
images: Sequence[Image.Image],
out_path: Path,
) -> None:
out_path.parent.mkdir(parents=True, exist_ok=True)
if len(images) == 1:
images[0].save(out_path)
print(f"[saved] {out_path}")
return
for i, img in enumerate(images):
p = out_path.with_name(f"{out_path.stem}_{i}{out_path.suffix}")
img.save(p)
print(f"[saved] {p}")
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(description="Image editing (it2i) inference for SenseNova-U1.")
p.add_argument(
"--model_path",
required=True,
help="HuggingFace Hub id (e.g. sensenova/SenseNova-U1-8B-MoT) or a local path.",
)
p.add_argument(
"--lora_path",
required=False,
default=None,
help="HuggingFace Hub id or a local path to a lora model.",
)
src = p.add_mutually_exclusive_group(required=True)
src.add_argument(
"--prompt",
help="Edit instruction. When the prompt does not include an ``<image>`` "
"placeholder, the model prepends one per input image automatically. "
"Requires --image.",
)
src.add_argument(
"--jsonl",
help='JSONL file, one sample per line. Required: {"prompt": str, '
'"image": str | list[str]}. Optional: {"width": int, "height": int, '
'"seed": int, "type": str}. When "width" and "height" are both '
"present they override --width / --height for that sample.",
)
p.add_argument(
"--image",
nargs="+",
metavar="PATH",
help="One or more input image paths (only used with --prompt).",
)
p.add_argument("--output", default="output.png", help="Output path when using --prompt.")
p.add_argument("--output_dir", default="outputs", help="Output directory when using --jsonl.")
p.add_argument(
"--width",
type=int,
default=None,
help=(
"Explicit output width in pixels. Must be given together with --height, "
f"and must be a multiple of {_IMAGE_GRID_FACTOR}. "
"When both --width and --height are omitted the output resolution is "
"derived from the first input image: aspect ratio preserved, total "
"pixels normalized to --target_pixels."
),
)
p.add_argument(
"--height",
type=int,
default=None,
help=f"Explicit output height in pixels. See --width. Must be a multiple of {_IMAGE_GRID_FACTOR}.",
)
p.add_argument(
"--target_pixels",
type=int,
default=DEFAULT_TARGET_PIXELS,
help=(
f"Target pixel count for the auto-derived output resolution "
f"(default: {DEFAULT_TARGET_PIXELS} = 2048*2048). The first input "
"image's aspect ratio is preserved and H*W is rescaled to match "
f"this target, which is a multiple of {_IMAGE_GRID_FACTOR}. "
"Ignored when --width / --height are given."
),
)
p.add_argument(
"--input_max_pixels",
default="auto",
help=(
"Maximum pixels per input/reference image before vision encoding. "
"Use an integer (for example 1048576 for 1024*1024) or 'auto' to "
"keep up to two inputs at 2048*2048 and divide that total budget across more inputs. "
"Default: auto."
),
)
p.add_argument(
"--do_resize",
"--do-resize",
dest="do_resize",
action=argparse.BooleanOptionalAction,
default=True,
help=(
"Whether to resize input/reference images to the input pixel budget before model preprocessing. "
"Enabled by default. If disabled, the model's native image preprocessing still applies its "
"own size limits."
),
)
p.add_argument(
"--cfg_scale",
type=float,
default=4.0,
help="Text CFG weight. Higher values track the edit instruction more aggressively.",
)
p.add_argument(
"--img_cfg_scale",
type=float,
default=1.0,
help=("Image CFG weight (default: 1.0 = image CFG disabled)."),
)
p.add_argument(
"--cfg_norm",
default="none",
choices=["none", "global", "channel"],
help=(
"Classifier-free guidance rescaling mode. 'none' (default) is classical CFG; "
"'global'/'channel' rescale the CFG output back to the conditional norm "
"(globally / per-channel). Unlike t2i, 'cfg_zero_star' is not supported here."
),
)
p.add_argument("--timestep_shift", type=float, default=3.0)
p.add_argument(
"--cfg_interval",
type=float,
nargs=2,
default=[0.0, 1.0],
metavar=("LO", "HI"),
)
p.add_argument("--num_steps", type=int, default=50)
p.add_argument("--batch_size", type=int, default=1)
p.add_argument(
"--seed",
type=int,
default=DEFAULT_SEED,
help=(
f"Random seed for reproducible sampling (default: {DEFAULT_SEED}). "
"In --jsonl mode, a per-sample `seed` field in the JSONL overrides this."
),
)
p.add_argument(
"--device",
default=str(best_available_device()),
help="Compute device, e.g. 'cuda', 'cuda:0', 'xpu', 'xpu:0', 'cpu'. Defaults to the best available accelerator.",
)
p.add_argument(
"--dtype",
default="bfloat16",
choices=["bfloat16", "float16", "float32"],
)
add_offload_args(p)
p.add_argument(
"--gguf_checkpoint",
default=None,
help=(
"Optional path to a .gguf quantized checkpoint. When set, the dequantizing "
"diffusers GGUF Linear layer is used instead of safetensors weights. "
"Requires the [gguf] extra (gguf>=0.10.0, diffusers>=0.30.0)."
),
)
p.add_argument(
"--attn_backend",
default="auto",
choices=["auto", "flash", "sdpa"],
help=(
"Attention kernel used by the Qwen3 layers. "
"'auto' picks flash-attn when it's importable and falls back to SDPA "
"otherwise. 'flash' hard-requires flash-attn; 'sdpa' forces torch SDPA "
"even when flash-attn is installed (useful for A/B-ing outputs)."
),
)
p.add_argument(
"--profile",
action="store_true",
help=(
"Print timing and CUDA memory stats: model load time, average "
"per-image generation time, peak GPU memory, and the same time "
f"normalized per image token (patch size = {DEFAULT_IMAGE_PATCH_SIZE})."
),
)
p.add_argument(
"--think",
action="store_true",
help=(
"Enable think mode (chain-of-thought reasoning). The model will "
"reason about the edit before generating the output image. "
"The thinking content is printed to stdout and saved to a "
"``<stem>_think.txt`` file next to the output image."
),
)
p.add_argument(
"--compare",
action="store_true",
help=(
"Also save a side-by-side ``[inputs... | output]`` montage with the "
"prompt rendered below, written next to the plain output as "
"``<stem>_compare.png``. Useful for eyeballing edits without an "
"external image viewer."
),
)
args = p.parse_args()
if args.prompt is not None and not args.image:
p.error("--prompt requires at least one --image.")
if args.jsonl is not None and args.image:
p.error("--image is only valid with --prompt; in --jsonl mode, put 'image' in the JSONL.")
if (args.width is None) != (args.height is None):
p.error("--width and --height must be given together (or both omitted).")
if args.width is not None:
if args.width % _IMAGE_GRID_FACTOR or args.height % _IMAGE_GRID_FACTOR:
p.error(
f"--width / --height must each be a multiple of {_IMAGE_GRID_FACTOR} (got {args.width}x{args.height})."
)
return args
def main() -> None:
args = parse_args()
dtype = {"bfloat16": torch.bfloat16, "float16": torch.float16, "float32": torch.float32}[args.dtype]
sensenova_u1.set_attn_backend(args.attn_backend)
print(f"[attn] backend={args.attn_backend!r} (effective={sensenova_u1.effective_attn_backend()!r})")
profiler = InferenceProfiler(
enabled=args.profile,
device=args.device,
config={
"vram_mode": args.vram_mode,
"attn_backend": sensenova_u1.effective_attn_backend(),
"dtype": args.dtype,
"gguf": args.gguf_checkpoint,
},
)
with profiler.time_load():
engine = SenseNovaU1Editing(
args.model_path,
device=args.device,
dtype=dtype,
gguf_checkpoint=args.gguf_checkpoint,
device_map=args.device_map,
max_memory=args.max_memory,
vram_mode=args.vram_mode,
)
if args.lora_path is not None:
print(f"load lora {args.lora_path}")
engine.model = load_and_merge_lora_weight_from_safetensors(engine.model, args.lora_path)
cfg_interval = tuple(args.cfg_interval)
cli_explicit_size: tuple[int, int] | None = (args.width, args.height) if args.width is not None else None
if args.prompt is not None:
input_max_pixels = _resolve_input_max_pixels(args.input_max_pixels, len(args.image))
_print_input_resize_hint(
len(args.image), input_max_pixels, args.input_max_pixels or "model-default", args.do_resize
)
images = [_load_input_image(p, do_resize=args.do_resize, input_max_pixels=input_max_pixels) for p in args.image]
w, h = _resolve_output_size(
images,
explicit=cli_explicit_size,
target_pixels=args.target_pixels,
)
# _set_seed(args.seed)
with profiler.time_generate(w, h, args.batch_size):
outputs, think_text = engine.edit(
args.prompt,
images,
image_size=(w, h),
cfg_scale=args.cfg_scale,
img_cfg_scale=args.img_cfg_scale,
cfg_norm=args.cfg_norm,
timestep_shift=args.timestep_shift,
cfg_interval=cfg_interval,
num_steps=args.num_steps,
batch_size=args.batch_size,
think_mode=args.think,
seed=args.seed,
)
out_path = Path(args.output)
_save_images(outputs, out_path)
if think_text:
print(f"[think] {think_text}")
think_path = out_path.with_name(f"{out_path.stem}_think.txt")
think_path.write_text(think_text, encoding="utf-8")
print(f"[saved] {think_path}")
if args.compare:
save_compare(out_path, images, outputs[0], args.prompt)
profiler.report()
return
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
with open(args.jsonl) as f:
samples = [json.loads(line) for line in f if line.strip()]
try:
from tqdm import tqdm
except ImportError:
def tqdm(x, **_kw): # type: ignore[no-redef]
return x
for i, sample in enumerate(tqdm(samples, desc="Editing")):
paths = _coerce_image_paths(sample["image"])
sample_input_max_pixels = sample.get("input_max_pixels", args.input_max_pixels)
sample_do_resize = _coerce_bool(sample.get("do_resize", args.do_resize))
input_max_pixels = _resolve_input_max_pixels(str(sample_input_max_pixels), len(paths))
_print_input_resize_hint(
len(paths),
input_max_pixels,
str(sample_input_max_pixels or "model-default"),
sample_do_resize,
)
images = [_load_input_image(p, do_resize=sample_do_resize, input_max_pixels=input_max_pixels) for p in paths]
w, h = _resolve_output_size(
images,
explicit=_explicit_size_from_sample(sample) or cli_explicit_size,
target_pixels=args.target_pixels,
)
# _set_seed(int(sample.get("seed", args.seed)))
with profiler.time_generate(w, h, 1):
outputs, think_text = engine.edit(
sample["prompt"],
images,
image_size=(w, h),
cfg_scale=args.cfg_scale,
img_cfg_scale=args.img_cfg_scale,
cfg_norm=args.cfg_norm,
timestep_shift=args.timestep_shift,
cfg_interval=cfg_interval,
num_steps=args.num_steps,
batch_size=1,
think_mode=args.think,
seed=args.seed,
)
tag = sample.get("type")
stem = f"{i + 1:04d}" + (f"_{tag}" if tag else "") + f"_{w}x{h}.png"
sample_out = out_dir / stem
outputs[0].save(sample_out)
if think_text:
think_path = sample_out.with_suffix(".think.txt")
think_path.write_text(think_text, encoding="utf-8")
if args.compare:
save_compare(sample_out, images, outputs[0], sample["prompt"])
profiler.report()
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment