Commit 2a934cec authored by raojy's avatar raojy
Browse files

first

parent 4b618aa3
# Base (100 NFE) vs Distilled (8 NFE)
[← Back to README](../README.md).
## Run Base and Distilled Model
```bash
# Taking T2I for example
# Run Base
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--jsonl examples/t2i/data/samples.jsonl \
--output_dir outputs/ \
--cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
--profile
# Run 8-step preview model (deprecated)
python examples/t2i/inference.py \
--model_path SenseNova-U1-8B-MoT-8step-preview \
--jsonl examples/t2i/data/samples.jsonl \
--output_dir outputs/ \
--cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 8 \
--profile
# Run 8-step LoRA
huggingface-cli download sensenova/SenseNova-U1-8B-MoT-LoRAs --include "SenseNova-U1-8B-MoT-LoRA-8step-V1.0.safetensors" --local-dir ./sensenova/SenseNova-U1-8B-MoT-LoRAs/ --local-dir-use-symlinks False
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--lora_path sensenova/SenseNova-U1-8B-MoT-LoRAs/SenseNova-U1-8B-MoT-LoRA-8step-V1.0.safetensors \
--jsonl examples/t2i/data/samples.jsonl \
--output_dir outputs/ \
--cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 8 \
--profile
```
---
## Text-to-Image
| SenseNova-U1-8B-MoT (100 NFE) | SenseNova-U1-8B-MoT-8step-preview (8 NFE) |
|---|---|
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/b957b661-f916-40b7-8457-bb76ed942ce1" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/5f921347-abc7-4034-b37e-166150c46116" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/2b06c91d-320b-4344-b2b7-3eecc3655afb" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/aacbf7cd-6641-4046-a148-e56227fc1ba6" /> |
| <img width="2720" height="1536" alt="Image" src="https://github.com/user-attachments/assets/eeec15de-6004-4acf-9947-b29cea0cd404" /> | <img width="2720" height="1536" alt="Image" src="https://github.com/user-attachments/assets/2c6a933f-93bb-4b28-a959-9998187626be" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/ebfa083c-e1a3-47b1-baec-18204c19ab8a" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/11326e6c-134a-4ceb-ac17-07ddcc430200" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/aabfeca6-f346-4787-9bbf-b90c66250262" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/ae14ae4f-ba42-4c49-8d29-d76031dff130" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/c60dc057-301f-4df4-ae73-270a67ea928f" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/93b76c40-3333-4738-b977-7981635b162b" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/1d90cfd5-54ae-4669-9cde-51b0b911405e" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/52db191b-dce3-4c85-bfb7-e02a45bef0fa" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/eaf8f018-de5d-4ae6-bfde-895c12833480" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/2a1f04c7-a990-403c-904f-28aaa31f1c33" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/fc7d04f5-f2f0-4aa0-8cda-f0f06defcdd9" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/e96c4e91-2d99-42dc-a521-8a8922fbf580" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/3a949ced-c7f1-4175-b478-60792fd634a4" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/cbcda1d3-ec83-4d3a-b1b5-1130fc67095e" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/dc93499e-f7f7-4d08-8d6f-69f27b3b943e" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/fd868c8e-ddaf-4fa0-b3dc-976cab223443" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/e62aa2ed-650a-44b5-b74d-b8aa150ec8e5" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/7b7bc125-5d56-43f2-9938-a9c73489a563" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/8704123c-7afe-4a77-a878-883c354d9e27" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/dfea87b5-0295-4713-8875-313d041a1623" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/c79cd8dc-53c7-4586-b7d0-9002db1b9829" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/e91b1502-a93f-475f-b408-141748e57af5" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/d158e591-fa2e-4e5d-b61c-1e5f4692b529" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/158ebf30-144d-4cb9-8708-5eca39def04e" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/4f92e25a-3d4f-4b0f-9e13-9027dc688df6" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/0e1104f3-a5df-4291-bb34-5ecd847f12bb" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/c11fedcb-8362-4315-8cf6-7a6652746260" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/644938ee-adbf-4f2e-beec-9ace2abb7a3e" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/79982f47-76d0-42fd-bac2-783cdc3c6cab" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/54562cd5-895c-4ee1-a4da-371865692eae" /> |
| <img width="2496" height="1664" alt="Image" src="https://github.com/user-attachments/assets/ca7519ee-bfbc-41d3-acb7-070cbf8fc588" /> | <img width="2496" height="1664" alt="Image" src="https://github.com/user-attachments/assets/60ff8170-5826-4ffc-97b4-9324d58b98cb" /> |
| <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/16b69220-2594-4517-beca-f70954895937" /> | <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/0af69452-5e1d-4982-a26b-a860c7051656" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/46c13011-9f86-4ed1-b167-0e0112606136" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/021349a3-d92a-4d3d-930d-50d3afa5adc1" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/920a325a-ba53-4dfd-8843-863cc20a2d44" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/8d82ee0f-2051-4112-a47e-31ed03fce4d5" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/08232762-81fb-4aa6-b001-389064e73d2e" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/ec84c507-8c79-4a74-b325-3ba1e5196f69" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/ecebe0ae-9f83-430b-ad0d-1c26689cadc8" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/ab892f47-6786-4c91-a736-0a034e3e8b3c" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/da932a5c-e13d-43f6-a619-6c057a7d4826" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/8eb5874e-a8fe-427a-a5f0-8633681a77ec" /> |
| <img width="1664" height="2496" alt="Image" src="https://github.com/user-attachments/assets/522c9f71-56c0-4924-b099-61bc19646a7e" /> | <img width="1664" height="2496" alt="Image" src="https://github.com/user-attachments/assets/7fdd668f-d6d6-4295-91c8-f3f2caa43bd6" /> |
| <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/996f59cc-eabb-45c1-b9b9-f2553fc42ec8" /> | <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/07cfc906-ac47-496e-97b5-19c5ff8f785a" /> |
| <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/bcf618f1-50fe-40b7-8d14-a14bc294abf3" /> | <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/b2e3c2a0-2970-40f1-b678-5af71d7640c9" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/018504fd-ffc4-44ac-b8d0-ff2109cd4b2e" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/bad8ac39-e41a-4e6b-a390-0b2fd99d9af7" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/28bc226b-122e-48f9-b2cd-46816274d74c" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/1390480b-f405-4615-b3b2-b0ebda9408b1" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/d0a1e419-294b-4c59-896a-a8ae85668e40" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/9d96b38a-f5e9-4c00-937f-88e6c50e50f3" /> |
| <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/45eb0bc6-bce2-42ac-b5c3-cab26663fd7d" /> | <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/ea842ff6-a621-41f5-8de1-889b0565e155" /> |
| <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/39bc2e0d-7eb7-4b94-82f9-d51766c4dc53" /> | <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/7c390a71-cce2-4c1d-bf72-c7fb73d52cec" /> |
| <img width="2720" height="1536" alt="Image" src="https://github.com/user-attachments/assets/6540e91d-012b-4d9a-b5b2-133127019a89" /> | <img width="2720" height="1536" alt="Image" src="https://github.com/user-attachments/assets/63a55bd7-95ce-4447-b184-d7bafa0bed12" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/c3f26e79-18ca-4e80-9f8a-284cff18bbbe" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/98afbfa3-4e2e-4087-8e65-cff2efd33f52" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/9675fe23-dcb0-4959-964e-bb57d71a8901" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/a1b473eb-205f-49e8-b069-8b16556c3d5a" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/a3c91778-740f-42be-9d85-cfaa2b3a34d6" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/e07f3884-d46c-4eef-af88-0b12638f7525" /> |
| <img width="1664" height="2496" alt="Image" src="https://github.com/user-attachments/assets/f76c6220-9f95-4a58-8832-ffd825f28ee9" /> | <img width="1664" height="2496" alt="Image" src="https://github.com/user-attachments/assets/d837428c-6d27-4d6a-90c4-2c29da250785" /> |
| <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/7ab2b3e8-5084-4584-9bf4-00c59ccddbf5" /> | <img width="2368" height="1760" alt="Image" src="https://github.com/user-attachments/assets/0de36a7f-28f4-40d6-9df8-a746092dbce6" /> |
| <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/a6818c15-ee27-4d0b-9e57-29ae0fe89bf9" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/c40f4df8-99a6-4ca7-b953-e0f849d6bcd7" /> |
| <img width="1760" height="2368" alt="Image" src="https://github.com/user-attachments/assets/42aa2e76-fbba-41ce-a84b-55b7e2e20f46" /> | <img width="1760" height="2368" alt="Image" src="https://github.com/user-attachments/assets/7b66e1a2-281d-4ccc-b885-872be4afa59a" /> |
| <img width="1664" height="2496" alt="Image" src="https://github.com/user-attachments/assets/f1f76b33-5eed-4e8b-9e81-27eceb61f600" /> | <img width="1664" height="2496" alt="Image" src="https://github.com/user-attachments/assets/394873c1-745f-4bb7-b1d3-e0dface04522" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/6ddcaf07-8e87-4460-b151-a596064af0e5" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/43085347-61ad-4640-9c94-903a48d0937c" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/dda19b41-5467-4375-9079-f28fec74a8d8" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/c2e3ef48-831d-446b-946b-b9c8cab81bf7" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/cb88be1c-6e4b-4e2b-924c-99d2f4ea9463" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/97900093-98ec-495c-8867-359f8d73dadd" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/6f7acfca-aeb7-4dc0-8efa-53a325f830aa" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/f2aef2a2-29b4-411d-8321-42f57662015c" /> |
| <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/f027e3d5-a08f-48bb-b2e3-9ab13b05dea2" /> | <img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/5ab3f0a0-b803-4251-b96f-0d7140ad2df1" /> |
## Image-Editing
| Reference Image | SenseNova-U1-8B-MoT (100 NFE) | SenseNova-U1-8B-MoT-8step-preview (8 NFE) |
|---|---|---|
| <img width="640" height="640" alt="3.webp" src="../examples/editing/data/images/3.webp" /> | <img width="1536" height="2752" alt="Image" src="https://github.com/user-attachments/assets/8fcbcf8f-ad6a-4ea8-af82-5c99f37ffa11" /> | <img width="1536" height="2752" alt="Image" src="https://github.com/user-attachments/assets/e6878663-e64f-407c-9f1e-88987591b977" /> |
| <img width="640" height="640" alt="6.webp" src="../examples/editing/data/images/6.webp" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/3f2341f0-b20d-44b3-8025-b92d87e0a0fc" /> | <img width="2048" height="2048" alt="Image" src="https://github.com/user-attachments/assets/cfc7e1c9-5790-4414-b1e0-f7a2bde3d3f2" /> |
| <img width="640" height="640" alt="8.webp" src="../examples/editing/data/images/8.webp" /> | <img width="1696" height="2528" alt="Image" src="https://github.com/user-attachments/assets/36ce560e-1dd1-4b10-8aee-2dda74003e42" /> | <img width="1696" height="2528" alt="Image" src="https://github.com/user-attachments/assets/3a1997b1-96cf-4bbd-a637-ae9793985811" /> |
## Existing Issues
A issue have been identified in the SenseNova-U1-8B-MoT-LoRA-8step-V1.0 (8 NFE), and we are actively working to resolve them.
- Grid artifacts may occur in certain instances.
<img width="1536" height="2720" alt="Image" src="https://github.com/user-attachments/assets/7b97a6a7-4e1a-4af8-884f-5df134cfdc3b" />
# LightLLM + LightX2V Deployment
This guide provides a practical deployment flow for serving SenseNova-U1 with
LightLLM + LightX2V using the Docker image
`lightx2v/lightllm_lightx2v:20260407`.
## 1) Pull and enter the Docker image
```bash
docker pull lightx2v/lightllm_lightx2v:20260407
docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash
```
## 2) Clone runtime dependencies inside the container
The image may not include the latest source trees. Clone both repositories and
pin LightLLM to the validated branch:
```bash
git clone https://github.com/ModelTC/LightX2V.git
git clone https://github.com/ModelTC/LightLLM.git
cd LightLLM
git checkout neo_plus_clean
```
## 3) X2I-related arguments
When enabling image generation in the same API server, use the following flags:
- `--enable_multimodal_x2i`
Enable image generation capability.
- `--x2i_server_used_gpus`
Number of GPUs reserved for the X2I generation server.
- `--x2i_server_deploy_mode {colocate,separate}`
- `colocate`: understanding and generation share the same visible GPU pool.
- `separate`: understanding and generation are deployed as separate services, and
can use different GPU sets.
- `--x2i_use_naive_impl`
Use the native/naive PyTorch backend for X2I (debugging/testing only, not for
production throughput).
## 4) Deployment modes
### Mode A: `colocate` (single service, shared GPU pool)
Use this mode for quick validation and simpler operations. The LLM understanding
path (`--tp`) and X2I generation path (`--x2i_server_used_gpus`) consume resources
from the same visible GPUs.
Example (2 GPUs total):
- understanding path: `tp=2`
- generation path: `cfg=2` (configured in `neopp_dense_parallel_cfg.json`)
```bash
PYTHONPATH=/workspace/LightX2V/ \
python -m lightllm.server.api_server \
--model_dir $MODEL_DIR \
--enable_multimodal_x2i \
--x2i_server_deploy_mode colocate \
--x2i_server_used_gpus 2 \
--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \
--host 0.0.0.0 \
--port 8000 \
--max_req_total_len 65536 \
--mem_fraction 0.75 \
--tp 2
```
### Mode B: `separate` (understanding and generation decoupled)
`separate` is conceptually similar to PD-style decoupling in LLM serving: split
different stages onto different GPU groups so a long stage does not block the
short stage.
For multimodal serving, image generation is usually the long stage, while
understanding is short and lightweight. Separating them allows understanding
requests to keep flowing even when generation workers are busy.
Recommended deployment profiles:
1. **Default profile (continuity-first): Understanding `tp=1` + Generation 1 GPU**
- Understanding: `--tp 1`
- Generation: `--x2i_server_used_gpus 1`
- Use as the baseline profile for mixed workloads. It keeps the pipeline simple
while avoiding head-of-line blocking between understanding and generation.
2. **Understanding-expanded profile: Understanding `tp=2` + Generation 1 GPU**
- Understanding: `--tp 2`
- Generation: `--x2i_server_used_gpus 1`
- Use when complex prompts or high understanding QPS become the bottleneck.
3. **Generation-expanded profile: Understanding `tp=1/2` + Generation parallel**
- Understanding: `--tp 1` or `--tp 2`
- Generation option A (2 GPUs): `--x2i_server_used_gpus 2` +
`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json`
- Generation option B (4 GPUs): `--x2i_server_used_gpus 4` +
`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json`
- Use when generation latency/throughput dominates (most common scaling path).
Example launch (separate mode in API server):
```bash
PYTHONPATH=/workspace/LightX2V/ \
python -m lightllm.server.api_server \
--model_dir $MODEL_DIR \
--enable_multimodal_x2i \
--x2i_server_deploy_mode separate \
--x2i_server_used_gpus 1 \
--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \
--host 0.0.0.0 \
--port 8000 \
--max_req_total_len 65536 \
--mem_fraction 0.75 \
--tp 2
```
## 5) Quantization
`separate` mode also enables independent quantization strategies for
understanding and generation.
Because understanding and generation are decoupled, you can tune quality/latency
for each path independently:
1. **Understanding FP16/BF16 + Generation FP8**
- Understanding: no quantization flag (keep default precision)
- Generation: use FP8 generation config, for example
`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
- Recommended as the default quantized profile for production.
2. **Understanding FP8 + Generation FP8**
- Understanding: add `--quant_type fp8w8a8`
- Generation: use FP8 generation config
`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
- Use when GPU memory/throughput is the primary constraint.
Notes:
- `--quant_type fp8w8a8` controls quantization on the understanding path.
- Generation-side precision is controlled by `--x2v_gen_model_config`.
## 6) OpenAI-compatible API
Once the API server is up, you can send requests through the OpenAI-compatible
endpoint exposed by LightLLM. A minimal text-to-image example:
```bash
python examples/serving/client.py \
--mode t2i \
--prompt "A cozy coffee shop storefront with infographic style."
```
See [`examples/serving/client.py`](../examples/serving/client.py) for more modes
(VQA, editing, interleaved) and request formats.
\ No newline at end of file
# LightLLM + LightX2V 部署
本文档介绍如何基于 Docker 镜像 `lightx2v/lightllm_lightx2v:20260407`,通过 LightLLM + LightX2V 部署 SenseNova-U1 推理服务。
## 1) 拉取并进入 Docker 镜像
```bash
docker pull lightx2v/lightllm_lightx2v:20260407
docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash
```
## 2) 在容器内克隆运行时依赖
镜像中自带的源码未必是最新版本,建议重新克隆这两个仓库,并将 LightLLM 切到已验证的分支:
```bash
git clone https://github.com/ModelTC/LightX2V.git
git clone https://github.com/ModelTC/LightLLM.git
cd LightLLM
git checkout neo_plus_clean
```
## 3) X2I 相关参数
在同一个 API 服务中开启图像生成时,用到以下参数:
- `--enable_multimodal_x2i`
开启图像生成能力。
- `--x2i_server_used_gpus`
分配给 X2I 生成服务的 GPU 数量。
- `--x2i_server_deploy_mode {colocate,separate}`
- `colocate`:理解与生成共用同一块可见 GPU 资源池。
- `separate`:理解与生成拆分为独立服务,可分别占用不同的 GPU。
- `--x2i_use_naive_impl`
X2I 使用原生 PyTorch 实现,仅用于调试与测试,不建议在生产环境追求吞吐量时使用。
## 4) 部署模式
### 模式 A:`colocate`(单服务共用 GPU)
适合快速验证与简化运维。LLM 理解路径(`--tp`)与 X2I 生成路径(`--x2i_server_used_gpus`)从同一组可见 GPU 中分配资源。
示例(共 2 张 GPU):
- 理解路径:`tp=2`
- 生成路径:`cfg=2`(在 `neopp_dense_parallel_cfg.json` 中配置)
```bash
PYTHONPATH=/workspace/LightX2V/ \
python -m lightllm.server.api_server \
--model_dir $MODEL_DIR \
--enable_multimodal_x2i \
--x2i_server_deploy_mode colocate \
--x2i_server_used_gpus 2 \
--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \
--host 0.0.0.0 \
--port 8000 \
--max_req_total_len 65536 \
--mem_fraction 0.75 \
--tp 2
```
### 模式 B:`separate`(理解与生成分离部署)
`separate` 的思路与 LLM 服务中的 PD 分离类似:将不同阶段放到不同的 GPU 组上,避免长阶段拖慢短阶段。
在多模态场景下,图像生成通常是长阶段,而理解请求轻量且耗时较短。将两者分离后,即便生成 worker 被占满,理解请求依然能正常流转。
推荐的部署配置方案:
1. **默认方案(以稳定性为先):理解 `tp=1` + 生成 1 GPU**
- 理解:`--tp 1`
- 生成:`--x2i_server_used_gpus 1`
- 适合作为混合负载下的基线方案。pipeline 简单,又能避免理解与生成互相产生队头阻塞。
2. **理解加强方案:理解 `tp=2` + 生成 1 GPU**
- 理解:`--tp 2`
- 生成:`--x2i_server_used_gpus 1`
- 适用于复杂 prompt 或高理解 QPS 成为瓶颈的场景。
3. **生成加强方案:理解 `tp=1/2` + 生成并行**
- 理解:`--tp 1``--tp 2`
- 生成方案 A(2 GPU):`--x2i_server_used_gpus 2` +
`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json`
- 生成方案 B(4 GPU):`--x2i_server_used_gpus 4` +
`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json`
- 适用于生成延迟/吞吐量占主导的场景,也是最常见的扩容路径。
`separate` 模式启动 API 服务示例:
```bash
PYTHONPATH=/workspace/LightX2V/ \
python -m lightllm.server.api_server \
--model_dir $MODEL_DIR \
--enable_multimodal_x2i \
--x2i_server_deploy_mode separate \
--x2i_server_used_gpus 1 \
--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \
--host 0.0.0.0 \
--port 8000 \
--max_req_total_len 65536 \
--mem_fraction 0.75 \
--tp 2
```
## 5) 量化
`separate` 模式的另一个好处是,理解与生成可以各自采用独立的量化策略。
两条路径解耦后,可分别针对各自的质量与延迟目标进行调优:
1. **理解 FP16/BF16 + 生成 FP8**
- 理解:不加量化参数,保持默认精度。
- 生成:使用 FP8 生成配置,例如
`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
- 推荐作为生产环境的默认量化方案。
2. **理解 FP8 + 生成 FP8**
- 理解:添加 `--quant_type fp8w8a8`
- 生成:使用 FP8 生成配置
`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
- 适用于 GPU 显存或吞吐量吃紧的场景。
说明:
- `--quant_type fp8w8a8` 控制理解路径的量化精度。
- 生成侧的精度由 `--x2v_gen_model_config` 决定。
## 6) OpenAI 兼容 API
API 服务启动之后,可直接通过 LightLLM 暴露的 OpenAI 兼容端点发送请求。下面是一个最简的文生图示例:
```bash
python examples/serving/client.py \
--mode t2i \
--prompt "A cozy coffee shop storefront with infographic style."
```
更多模式(VQA、图像编辑、图文交错生成)及请求格式,详见 [`examples/serving/client.py`](../examples/serving/client.py)
# GPU Memory Profiling
This document records VRAM usage and performance benchmarks for the SenseNova-U1-8B-MoT model across different inference tasks. All tests are run with the `--profile` flag on a single NVIDIA H100 80G GPU.
---
## Text-to-Image
Standard text-to-image inference without chain-of-thought.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--profile
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 88.857 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.59 GiB
generations : 1 call(s), 1 image(s) total, 22.108 s wall
avg per image : 22.108 s
image tokens : patch_size=32, avg 4096 tok/image (4096)
throughput : 185.27 tok/s
generation peak mem : allocated 34.83 GiB, reserved 35.82 GiB, cpu RSS 5.59 GiB
================================================================
```
---
## Text-to-Image with Chain-of-Thought
Enables chain-of-thought reasoning (`--think`), where the model outputs its reasoning process before generating the image. Generation time and VRAM usage increase slightly.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--profile \
--think \
--print_think
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 82.060 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.58 GiB
generations : 1 call(s), 1 image(s) total, 38.342 s wall
avg per image : 38.342 s
image tokens : patch_size=32, avg 4096 tok/image (4096)
throughput : 106.83 tok/s
generation peak mem : allocated 35.02 GiB, reserved 35.94 GiB, cpu RSS 5.58 GiB
================================================================
```
---
## Image Editing
Image editing requires both an input image and an editing instruction. Processing the additional input image results in higher peak VRAM usage compared to plain text-to-image.
```bash
python examples/editing/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "Change the man's coat to yellow." \
--image examples/editing/data/images/1.webp \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--output output_edited.png \
--profile \
--compare
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 80.541 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.61 GiB
generations : 1 call(s), 1 image(s) total, 25.871 s wall
avg per image : 25.871 s
image tokens : patch_size=32, avg 4029 tok/image (4029)
throughput : 155.74 tok/s
generation peak mem : allocated 39.50 GiB, reserved 41.32 GiB, cpu RSS 5.61 GiB
================================================================
```
---
## Interleaved Text-Image Generation
Interleaved generation produces multiple images and corresponding text in a single inference call. Per-image token count is lower, but overall VRAM usage and wall time are substantially higher.
```bash
python examples/interleave/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT/ \
--prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." \
--resolution "16:9" \
--output_dir outputs/interleave/ \
--stem demo \
--profile
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 74.821 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.63 GiB
generations : 1 call(s), 6 image(s) total, 296.118 s wall
avg per image : 49.353 s
image tokens : patch_size=32, avg 2304 tok/image (2304)
throughput : 46.68 tok/s
generation peak mem : allocated 49.22 GiB, reserved 69.18 GiB, cpu RSS 5.63 GiB
================================================================
```
---
## Task Comparison Summary
| Task | Load Peak VRAM (GiB) | Gen Peak VRAM (GiB) | CPU RSS (GiB) | Avg Time (s) | Throughput (tok/s) |
|-------------|:--------------------:|:-------------------:|:-------------:|:------------:|:------------------:|
| t2i | 32.77 / 33.10 | 34.83 / 35.82 | 5.59 | 22.108 | 185.27 |
| t2i-think | 32.77 / 33.10 | 35.02 / 35.94 | 5.58 | 38.342 | 106.83 |
| editing | 32.77 / 33.10 | 39.50 / 41.32 | 5.61 | 25.871 | 155.74 |
| interleave | 32.77 / 33.10 | 49.22 / 69.18 | 5.63 | 49.353 | 46.68 |
> VRAM columns are formatted as `allocated / reserved`. CPU RSS is the peak RSS during the generation phase.
## Low-VRAM Inference (Text-to-Image as Example)
### VRAM Budget Cap (`--max_memory`)
The `--max_memory` parameter caps the GPU VRAM budget to simulate consumer-grade GPUs with varying VRAM capacities, covering 32 GB (e.g. RTX 5090), 24 GB (e.g. RTX 4090), 16 GB (e.g. RTX 4080), 12 GB (e.g. RTX 4070), and 8 GB (e.g. RTX 4060). Model layers exceeding the VRAM budget are automatically offloaded to CPU memory, so CPU RSS rises significantly as the GPU budget decreases.
> It is recommended to set `max_memory` slightly below the GPU's physical VRAM (e.g. use `26GiB`–`28GiB` for a 32 GB card) to leave enough headroom and avoid OOM errors during inference.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "..." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--device_map auto \
--max_memory "0=<N>GiB,cpu=80GiB" \
--profile
```
| GPU Budget | Target GPU | Load Peak VRAM (GiB) | Gen Peak VRAM (GiB) | Load CPU RSS (GiB) | Gen CPU RSS (GiB) | Avg Time (s) | Throughput (tok/s) |
|:----------:|:------------------:|:--------------------:|:-------------------:|:-----------------:|:-----------------:|:------------:|:------------------:|
| 27 GiB | RTX 5090 (32 GB) | 25.71 / 25.71 | 27.76 / 28.31 | 5.62 | 10.27 | 87.692 | 46.71 |
| 20 GiB | RTX 4090 (24 GB) | 18.52 / 18.52 | 20.58 / 21.12 | 5.59 | 19.50 | 174.961 | 23.41 |
| 13 GiB | RTX 4080 (16 GB) | 11.33 / 11.34 | 13.39 / 13.93 | 5.62 | 24.12 | 250.757 | 16.33 |
| 9 GiB | RTX 4070 (12 GB) | 7.74 / 7.74 | 9.79 / 10.33 | 5.55 | 28.76 | 290.039 | 14.12 |
| 7 GiB | RTX 4060 (8 GB) | 5.58 / 5.59 | 7.64 / 8.18 | 5.56 | 28.76 | 316.323 | 12.95 |
> VRAM columns are formatted as `allocated / reserved`. As the GPU budget decreases, model layers are progressively offloaded to CPU, causing CPU RSS to rise and inference throughput to drop.
### VRAM Optimization Mode (`--vram_mode`)
The `--vram_mode` parameter selects the VRAM optimization strategy, trading off inference speed against VRAM footprint.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "..." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--vram_mode <full|balanced|low> \
--profile
```
| `--vram_mode` | Strategy | Load Peak VRAM (GiB) | Gen Peak VRAM (GiB) | Load CPU RSS (GiB) | Gen CPU RSS (GiB) | Avg Time (s) | Throughput (tok/s) |
|:-------------:|:----------------------------------------------------------------------|:--------------------:|:-------------------:|:-----------------:|:-----------------:|:------------:|:------------------:|
| `full` | Entire model resident on GPU, no offload (default, fastest) | 32.77 / 33.10 | 34.83 / 35.82 | 5.59 | 5.59 | 22.108 | 185.27 |
| `balanced` | Async prefetch (H2D overlapped with compute), greatly reduced VRAM | — | 6.78 / 12.50 | 0.95 | 47.18 | 112.394 | 36.44 |
| `low` | Synchronous CPU↔GPU swap per layer, minimum GPU VRAM, slowest | — | 5.34 / 5.85 | 0.98 | 47.22 | 130.191 | 31.46 |
> VRAM columns are formatted as `allocated / reserved`. `balanced` and `low` modes use lazy loading — no GPU VRAM is allocated during model load (shown as —); weights are swapped in on demand, causing CPU RSS to rise significantly during generation.
# 显存性能分析
本文档记录了 SenseNova-U1-8B-MoT 模型在不同推理任务下的显存占用与性能基准数据。所有测试均通过 `--profile` 参数启用,运行环境为单张 NVIDIA H100 80G GPU。
---
## 文生图
标准文生图推理,不启用思维链。
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--profile
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 88.857 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.59 GiB
generations : 1 call(s), 1 image(s) total, 22.108 s wall
avg per image : 22.108 s
image tokens : patch_size=32, avg 4096 tok/image (4096)
throughput : 185.27 tok/s
generation peak mem : allocated 34.83 GiB, reserved 35.82 GiB, cpu RSS 5.59 GiB
================================================================
```
---
## 文生图(思维链)
启用思维链推理(`--think`),模型在生成图像前先输出推理过程,生成耗时和显存略有增加。
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--profile \
--think \
--print_think
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 82.060 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.58 GiB
generations : 1 call(s), 1 image(s) total, 38.342 s wall
avg per image : 38.342 s
image tokens : patch_size=32, avg 4096 tok/image (4096)
throughput : 106.83 tok/s
generation peak mem : allocated 35.02 GiB, reserved 35.94 GiB, cpu RSS 5.58 GiB
================================================================
```
---
## 图像编辑
图像编辑任务需同时输入原图与编辑指令,因额外处理输入图像,生成峰值显存高于纯文生图。
```bash
python examples/editing/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "Change the man's coat to yellow." \
--image examples/editing/data/images/1.webp \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--output output_edited.png \
--profile \
--compare
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 80.541 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.61 GiB
generations : 1 call(s), 1 image(s) total, 25.871 s wall
avg per image : 25.871 s
image tokens : patch_size=32, avg 4029 tok/image (4029)
throughput : 155.74 tok/s
generation peak mem : allocated 39.50 GiB, reserved 41.32 GiB, cpu RSS 5.61 GiB
================================================================
```
---
## 图文交错生成
交错生成任务会在一次推理中产生多张图像与对应文字,单图 token 数较少但整体显存和耗时显著更高。
```bash
python examples/interleave/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT/ \
--prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." \
--resolution "16:9" \
--output_dir outputs/interleave/ \
--stem demo \
--profile
```
```
================================================================
Profile summary
================================================================
config : vram_mode=full, attn_backend=flash, dtype=bfloat16
model load : 74.821 s
load peak memory : allocated 32.77 GiB, reserved 33.10 GiB, cpu RSS 5.63 GiB
generations : 1 call(s), 6 image(s) total, 296.118 s wall
avg per image : 49.353 s
image tokens : patch_size=32, avg 2304 tok/image (2304)
throughput : 46.68 tok/s
generation peak mem : allocated 49.22 GiB, reserved 69.18 GiB, cpu RSS 5.63 GiB
================================================================
```
---
## 各任务显存对比
| 任务 | 加载峰值显存 (GiB) | 生成峰值显存 (GiB) | CPU 内存 (GiB) | 平均耗时 (s) | 吞吐量 (tok/s) |
|----------------|:-----------------:|:-----------------:|:-------------:|:-----------:|:-------------:|
| t2i | 32.77 / 33.10 | 34.83 / 35.82 | 5.59 | 22.108 | 185.27 |
| t2i-think | 32.77 / 33.10 | 35.02 / 35.94 | 5.58 | 38.342 | 106.83 |
| editing | 32.77 / 33.10 | 39.50 / 41.32 | 5.61 | 25.871 | 155.74 |
| interleave | 32.77 / 33.10 | 49.22 / 69.18 | 5.63 | 49.353 | 46.68 |
> 显存列格式为 `allocated / reserved`;CPU 内存为生成阶段 RSS 峰值。
## 低显存推理(以文生图为例)
### 显存上限约束(`--max_memory`)
通过 `--max_memory` 参数限制 GPU 可用显存上限,模拟不同显存规格的消费级 GPU,涵盖 32 GB(如 RTX 5090)、24 GB(如 RTX 4090)、16 GB(如 RTX 4080)、12 GB(如 RTX 4070)及 8 GB(如 RTX 4060)等典型配置。模型超出显存上限的部分将自动卸载至 CPU 内存,因此 CPU RSS 会随 GPU 预算降低而显著升高。
> 建议将 `max_memory` 设置为略低于 GPU 物理显存的值(例如 32 GB 显卡可设为 `26GiB`–`28GiB`),以预留足够的显存余量,避免推理过程中出现 OOM。
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "..." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--device_map auto \
--max_memory "0=<N>GiB,cpu=80GiB" \
--profile
```
| GPU 预算 | 目标显卡 | 加载峰值显存 (GiB) | 生成峰值显存 (GiB) | 加载 CPU RSS (GiB) | 生成 CPU RSS (GiB) | 平均耗时 (s) | 吞吐量 (tok/s) |
|:----------:|:------------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|
| 27 GiB | RTX 5090 (32 GB) | 25.71 / 25.71 | 27.76 / 28.31 | 5.62 | 10.27 | 87.692 | 46.71 |
| 20 GiB | RTX 4090 (24 GB) | 18.52 / 18.52 | 20.58 / 21.12 | 5.59 | 19.50 | 174.961 | 23.41 |
| 13 GiB | RTX 4080 (16 GB) | 11.33 / 11.34 | 13.39 / 13.93 | 5.62 | 24.12 | 250.757 | 16.33 |
| 9 GiB | RTX 4070 (12 GB) | 7.74 / 7.74 | 9.79 / 10.33 | 5.55 | 28.76 | 290.039 | 14.12 |
| 7 GiB | RTX 4060 (8 GB) | 5.58 / 5.59 | 7.64 / 8.18 | 5.56 | 28.76 | 316.323 | 12.95 |
> 显存列格式为 `allocated / reserved`;随 GPU 预算降低,模型层逐步卸载至 CPU,生成 CPU RSS 相应升高,推理吞吐量下降。
### 显存优化模式(`--vram_mode`)
通过 `--vram_mode` 参数切换显存优化策略,在推理速度与显存占用之间进行权衡。
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "..." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--vram_mode <full|balanced|low> \
--profile
```
| `--vram_mode` | 策略说明 | 加载峰值显存 (GiB) | 生成峰值显存 (GiB) | 加载 CPU RSS (GiB) | 生成 CPU RSS (GiB) | 平均耗时 (s) | 吞吐量 (tok/s) |
|:-------------:|:------------------------------------------------|:-----------------:|:-----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|
| `full` | 整模型常驻 GPU,不做 offload(默认,速度最快) | 32.77 / 33.10 | 34.83 / 35.82 | 5.59 | 5.59 | 22.108 | 185.27 |
| `balanced` | 异步预取(H2D 与计算重叠),显存占用大幅降低 | — | 6.78 / 12.50 | 0.95 | 47.18 | 112.394 | 36.44 |
| `low` | 每层同步 CPU↔GPU 交换,GPU 显存占用最小,速度最慢 | — | 5.34 / 5.85 | 0.98 | 47.22 | 130.191 | 31.46 |
> 显存列格式为 `allocated / reserved`;`balanced` 和 `low` 模式采用懒加载,模型权重加载阶段无 GPU 显存分配(以 — 表示),运行时按需换入,因此生成阶段 CPU RSS 显著升高。
\ No newline at end of file
# Inference Infrastructure
This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
## Overview
SenseNova-U1 is exposed as one unified multimodal model, but the understanding and generation paths exhibit different execution shapes in production. They tend to prefer different scheduling policies, parallelization strategies, and resource ratios, rather than a single shared serving configuration. When both are coupled inside one monolithic runtime, these choices become unnecessarily tied together, which can leave both paths operating away from their respective optimal points.
To avoid this coupling, SenseNova-U1 adopts a **disaggregated** architecture:
- **LightLLM** for understanding, text streaming, and control flow
- **LightX2V** for image generation
These two engines exchange generation state through pinned shared memory and high-performance transfer kernels. The handoff is lightweight, while each side can still run with its own optimal execution policy.
![LightLLM + LightX2V decoupled architecture](./assets/lightllm_x2v.png)
This design provides practical benefits in production:
- Independent parallelism (for example, understanding with `TP=2` (Tensor Parallel=2), generation
with `CFG=2` (CFG Parallel=2) or `SP=2` (Sequence Parallel=2)).
- Independent resource allocation (different GPU counts and memory budgets).
- Independent scaling for text-heavy vs. image-heavy traffic.
- Better operational isolation and simpler performance tuning.
The same architecture can be deployed in two modes, depending on your hardware budget and traffic pattern:
- **Separate**: LightLLM and LightX2V run on different GPU groups.
- **Colocate**: LightLLM and LightX2V run as separate processes on the same GPU.
In most production setups, `Separate` is the default choice because it gives clearer bottleneck control and independent scaling. `Colocate` is useful for quick validation, generation-heave scenes, or smaller GPU setups.
### Attention for Multimodal Prefill of NEO-Unify
NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FlashAttention3 (FA3) codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
To preserve the causal-triangle speedup whenever possible, the kernel makes the decision per M-block. It OR-reduces the image_token_tag values inside the current block: if the block contains no image token, it keeps the standard causal K-range; if the block contains image tokens, it extends the K-range to cover the required image span. As a result, pure-text blocks still follow the normal causal path, while only the relevant blocks pay the extra work needed by the hybrid mask.
![NEO-Unify multimodal attention behavior](./assets/attn.png)
The overhead therefore does not depend on a fixed ratio, but on how image tokens are distributed across the sequence and across M-block boundaries. When image rows are concentrated in only part of the sequence, the extra work is correspondingly localized. For text-only requests, image_token_tag is empty, and the kernel falls back to vanilla FA3 with no additional overhead.
The benchmark below compares two implementations for Neo-style multimodal prefill:
- **Triton implementation**: easier to migrate into existing codebases, with lower
integration cost and faster iteration.
- **FA3 implementation**: higher absolute performance on supported hardware.
<div align="center">
| batch | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | speedup (×) |
|:-------:|:-----------:|:---------------:|:-----------:|:--------:|:-----------:|
| 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
| 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
| 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
| 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
| 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
| 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
| 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
| 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
| 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
| 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
| 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
| 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
| 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
| 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
| 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
</div>
### Deployment
For a concise deployment runbook (Docker image, startup command, and API tests),
see [`deployment.md`](./deployment.md).
### Generation Performance
The table below reports **2048x2048** image generation latency for
**SenseNova-U1-8B-MoT(NEO-Unify)**. Fill in measured numbers for each machine and deployment profile.
Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
<div align="center">
| GPU | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
|:----:|:-----------------:|:-------------------------:|:----------------------:|
| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
</div>
In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
### Cross-Model Speed Comparison
The table below compares the latency of a single diffusion step for
**2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
all measurements are taken on **H100**; the `TP2+CFG2` result uses
`2x H100`.
Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
<div align="center">
| Model | Understanding | Generation | Per-step latency (s/step) |
|:-----------------:|:-------------:|:----------:|:-------------------------:|
| Qwen-Image-2512 | 7B | 20B | 1.478 |
| Z-Image | 4B | 6B | 1.110 |
| GLM-Image | 9B | 7B | 1.394 |
| ERNIE-Image | 8B | 8B | 1.565 |
| LongCat-Image | 8B | 6B | 0.796 |
| SenseNova-U1-8B-MoT (Neo-Unify) | 8B | 8B | 0.312 |
| SenseNova-U1-8B-MoT (Neo-Unify, TP2+CFG2) | 8B | 8B | 0.158 |
</div>
# 推理基础设施
本文档介绍 **SenseNova-U1** 的推理基础设施,其构建于 **[LightLLM](https://github.com/ModelTC/lightllm)****[LightX2V](https://github.com/ModelTC/lightx2v)** 之上。
## 总览
SenseNova-U1 对外呈现为一个统一的多模态模型,但在实际生产中,理解与生成两条路径的执行形态差异显著:二者在调度策略、并行方案与资源配比上各有偏好,难以用同一套服务配置覆盖。若将它们耦合在单一的整体运行时中,各自的选择会被无谓地绑死,结果往往是两条路径都偏离各自的最优工作点。
为避免这种耦合,SenseNova-U1 采用**解耦**架构:
- **LightLLM** 负责理解、文本流式输出与控制流;
- **LightX2V** 负责图像生成。
两个引擎通过 pinned 共享内存与高性能传输内核交换生成状态,交接过程轻量,同时各自仍可按最优执行策略独立运行。
![LightLLM + LightX2V 解耦架构](./assets/lightllm_x2v.png)
该设计在生产中具有以下实际收益:
- 并行策略相互独立(例如理解侧 `TP=2`(张量并行=2),生成侧 `CFG=2`(CFG 并行=2)或 `SP=2`(序列并行=2));
- 资源配额相互独立(可分配不同的 GPU 数量与显存预算);
- 针对文本密集型与图像密集型流量可分别弹性扩缩;
- 运维隔离更清晰,性能调优也更简单。
同一架构可根据硬件预算与流量特征,在两种模式下部署:
- **Separate(分离部署)**:LightLLM 与 LightX2V 运行在不同的 GPU 组上;
- **Colocate(共置部署)**:LightLLM 与 LightX2V 作为独立进程运行在同一张 GPU 上。
生产环境中,`Separate` 是默认选择,其瓶颈定位更清晰,也便于独立扩缩;`Colocate` 则更适合快速验证、生成密集型场景或 GPU 数量有限的部署。
### NEO-Unify 多模态 Prefill 的注意力
NEO-Unify 的 prefill 注意力并非标准因果注意力:文本 token 仍保持因果,而图像 token 则会同时关注整个文本前缀以及完整的图像 span。为支持这种混合掩码,我们对栈内两套注意力实现都进行了改造——Triton 内核与官方 FlashAttention3 (FA3) 代码库。我们的 FA3 分支见 [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention)
具体做法是新增一个可选的 image_token_tag 参数,用以逐行调整掩码:文本行沿用标准因果掩码;图像行不再采用朴素的因果截断,而是被允许关注其之前的全部文本 token,以及所在图像 span 内的全部图像 token。
为在尽可能多的情况下保留因果三角形带来的加速,内核按 M-block 粒度进行判断——对当前 block 内的 image_token_tag 做 OR-reduce:若该 block 不含图像 token,则维持标准因果 K-range;若含有图像 token,则将 K-range 扩展至覆盖所需的图像 span。因此纯文本 block 仍走常规因果路径,只有真正相关的 block 才承担混合掩码引入的额外开销。
![NEO-Unify 多模态注意力行为](./assets/attn.png)
额外开销并不取决于某个固定比例,而是取决于图像 token 在序列中的分布,以及它们跨 M-block 边界的方式。当图像行仅集中在序列的某一部分时,额外开销也被相应地局部化。对于纯文本请求,image_token_tag 为空,内核即回落至原生 FA3,没有任何额外开销。
下表对比了 Neo 风格多模态 prefill 的两种实现:
- **Triton 实现**:更容易迁移到现有代码库,集成成本低、迭代更快;
- **FA3 实现**:在受支持的硬件上绝对性能更高。
<div align="center">
| batch | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | 加速比 (×) |
|:-------:|:-----------:|:---------------:|:-----------:|:--------:|:----------:|
| 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
| 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
| 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
| 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
| 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
| 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
| 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
| 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
| 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
| 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
| 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
| 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
| 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
| 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
| 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
</div>
### 部署
Docker 镜像、启动命令与 API 测试的简明操作手册,请参见 [`deployment_CN.md`](./deployment_CN.md)
### 生成性能
下表给出 **SenseNova-U1-8B-MoT(NEO-Unify)**
**2048x2048** 图像生成任务上的基准模版。列出了不同机型与部署配置下的实测数据。
注:TP2+CFG2 表示张量并行=2 + CFG 并行=2。
<div align="center">
| GPU | 部署配置 | 单步延迟 (s/step) | 端到端延迟 (s) |
|:----:|:--------:|:-----------------:|:--------------:|
| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
</div>
在 NEO-Unify 中,生成阶段所用的 KV cache 由理解模块提供,因此 T2I(文生图)与 I2I(图像编辑)在运行时特征上几乎一致。为简洁起见,此处仅给出 T2I 的延迟数据。
### 跨模型速度对比
下表对比了在启用**CFG**条件下,生成 **2048x2048** 图像时单个 diffusion step 的延迟。除特别说明外,所有数据均在 **H100** 上测得;其中 `SenseNova-U1-8B-MoT (NEO-Unify, TP2+CFG2)` 使用的是 `2x H100`
注:TP2+CFG2 表示张量并行=2 + CFG 并行=2。
<div align="center">
| 模型 | 理解模块 | 生成模块 | 单步延迟 (s/step) |
|:-------------------------:|:--------:|:--------:|:-----------------:|
| Qwen-Image-2512 | 7B | 20B | 1.478 |
| Z-Image | 4B | 6B | 1.110 |
| GLM-Image | 9B | 7B | 1.394 |
| ERNIE-Image | 8B | 8B | 1.565 |
| LongCat-Image | 8B | 6B | 0.796 |
| SenseNova-U1-8B-MoT (NEO-Unify) | 8B | 8B | 0.312 |
| SenseNova-U1-8B-MoT (NEO-Unify, TP2+CFG2) | 8B | 8B | 0.158 |
</div>
# Installation (Transformers Inference)
This guide covers setting up the Python environment for running SenseNova-U1 locally with the `transformers` backend.
> **Software versions:** Python 3.11, torch 2.8, CUDA 12.8 (cu128). Update `pyproject.toml` index URLs if your driver requires a different CUDA version.
We recommend [**uv**](https://docs.astral.sh/uv/) to manage the Python environment.
> uv installation guide: <https://docs.astral.sh/uv/getting-started/installation/>
## 1. Clone the repository
```bash
git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1
```
## 2. Install dependencies with uv
```bash
uv sync
source .venv/bin/activate
```
The `sensenova_u1` package is installed in editable mode, so the canonical [NEO-Unify model](../src/sensenova_u1/models/neo_unify/) is automatically registered with `transformers.Auto*` at import time.
> **Older NVIDIA drivers:** the default index is CUDA 12.8. If your driver
> does not support cu128, change `[tool.uv.sources]` / `[[tool.uv.index]]`
> in `pyproject.toml` to e.g. `https://download.pytorch.org/whl/cu126` (and
> adjust the pinned torch / torchvision versions accordingly) before
> running `uv sync`.
## Optional: flash-attn
`flash-attn` is declared as an optional extra;
without it the model transparently falls back to torch SDPA;
once flash-attn is importable the runtime picks it automatically (`--attn_backend auto`).
```bash
# (a) Build from source via PyPI
uv sync --extra flash
# (b) Install a prebuilt CUDA wheel matching your torch + Python
uv pip install /path/to/flash_attn-2.8.3+cu12torch28cxx11abitrue-cp311-cp311-*.whl
```
# 安装指南(Transformers 推理)
本指南介绍如何搭建 Python 环境,以使用 `transformers` 后端在本地运行 SenseNova-U1。
> **软件版本:** Python 3.11、torch 2.8、CUDA 12.8(cu128)。如果本机驱动需要其他 CUDA 版本,请相应修改 `pyproject.toml` 中的 index URL。
我们推荐使用 [**uv**](https://docs.astral.sh/uv/) 管理 Python 环境。
> uv 安装指南:<https://docs.astral.sh/uv/getting-started/installation/>
## 1. 克隆仓库
```bash
git clone https://github.com/OpenSenseNova/SenseNova-U1.git
cd SenseNova-U1
```
## 2. 使用 uv 安装依赖
```bash
uv sync
source .venv/bin/activate
```
`sensenova_u1` 会以可编辑模式安装,因此在 import 时,标准的 [NEO-Unify 模型](../src/sensenova_u1/models/neo_unify/) 会自动注册到 `transformers.Auto*` 接口。
> **较旧的 NVIDIA 驱动:** 默认 index 对应 CUDA 12.8。若驱动不支持 cu128,请先将
> `pyproject.toml` 中的 `[tool.uv.sources]` / `[[tool.uv.index]]` 改为例如
> `https://download.pytorch.org/whl/cu126`(并同步调整 torch / torchvision 的固定版本),
> 再执行 `uv sync`。
## 可选:flash-attn
`flash-attn` 以可选依赖(extra)的形式提供:未安装时模型会自动回退到 torch SDPA;一旦可以 import flash-attn,运行时就会自动启用(`--attn_backend auto`)。
```bash
# (a) 通过 PyPI 从源码编译
uv sync --extra flash
# (b) 安装与当前 torch + Python 匹配的预编译 CUDA wheel
uv pip install /path/to/flash_attn-2.8.3+cu12torch28cxx11abitrue-cp311-cp311-*.whl
```
# Parameter Breakdown
`SenseNova-U1-8B-MoT` contains roughly **8B understanding parameters** and
**8B generation parameters**. To avoid confusion caused by the naming and to
present the architecture more accurately, we provide a small inspection
script that parses parameter names of the loaded checkpoint and reports a
detailed parameter breakdown.
## Run the script
```bash
python scripts/inspect_model_params.py \
--model_path sensenova/SenseNova-U1-8B-MoT
```
Useful argments:
- `--dtype {float32,float16,bfloat16}` (default: `bfloat16`) — load dtype. It
does **not** affect parameter counts; it only affects the reported `memory`
column, since each `bf16/fp16` element occupies 2 bytes versus 4 bytes for
`fp32`.
- `--show_groups <name1,name2>` (default: `shared`) — list member parameters
of the specified groups. Use `all` for every group, or an empty string to
disable.
- `--custom_groups_json <path>` — override the default grouping rules with a
JSON file of the form `{"group_name": ["prefix1", "prefix2"]}`.
## Example output
```text
Model: sensenova/SenseNova-U1-8B-MoT
Load dtype: bfloat16
Total params: 17.552B
Total memory: 35.105GB (bfloat16)
---------------------------------------------------------------------
group params memory (bfloat16) ratio
---------------------------------------------------------------------
generation_transformer 8.186B 16.373GB 46.64%
understanding_transformer 8.121B 16.243GB 46.27%
shared 1.245B 2.489GB 7.09%
---------------------------------------------------------------------
Pathway breakdown (shared counted in both):
---------------------------------------------------------------------
pathway params memory (bfloat16) ratio
---------------------------------------------------------------------
understanding pathway 9.366B 18.732GB 53.36%
generation pathway 9.431B 18.862GB 53.73%
---------------------------------------------------------------------
Members of group 'shared' (2 params, 1.245B total, 2.489GB @ bfloat16)
---------------------------------------------------------------------
param name numel dtype
---------------------------------------------------------------------
language_model.model.embed_tokens.weight 622.330M bfloat16
language_model.lm_head.weight 622.330M bfloat16
```
## How to read it
### 1. Parameters (mutually exclusive, sums to 100%)
Each parameter is counted exactly once and assigned to one of three groups
based on its module path:
- `understanding_transformer`**8.12B (46%)** — vision und.
(`vision_model.*`) plus the LLM expert without `_mot_gen` suffix
(`language_model.*` minus the generation expert and the shared text I/O).
- `generation_transformer`**8.19B (47%)** — generation-side modules
(`fm_modules.*`: vision gen., flow-matching head, timestep / noise
embedders) plus the LLM expert with `_mot_gen` suffix
(`language_model.*` containing `_mot_gen`).
- `shared`**1.25B (7%)** — text-token I/O reused by both pathways:
`language_model.model.embed_tokens` and `language_model.lm_head`.
### 2. Pathway coverage (forward activations, ratios sum to >100%)
A *pathway* sums the parameters that are actually traversed during the forward pass of one task.
Because both tasks reuse the `shared` group, the ratios overlap and add up to more than 100%.
- **Understanding pathway**`understanding_transformer + shared`**9.37B (53%)**.
Image goes through `vision_model` → tokens go through `embed_tokens`
the LLM runs on the `non-_mot_gen` expert → `lm_head` produces text logits.
- **Generation pathway** (single-turn thinking interleave) ≈
`generation_transformer + shared`**9.43B (54%)**.
The condition image goes through `fm_modules.vision_model_mot_gen`, while
the text prompt goes through `embed_tokens` → the LLM runs on the
`_mot_gen` expert → text reasoning is produced via `lm_head` and
the image is decoded via `fm_modules.fm_head`.
### Why `embed_tokens` and `lm_head` are "shared", not "understanding-only"
`embed_tokens` is needed by every text token and is therefore obviously
shared. `lm_head` is also exercised by the generation pathway in some scenarios,
e.g., t2i-reasoning runs a thinking phase that emits text tokens **before** any image token is produced,
so `lm_head` is on the critical path of both pathways — hence the "shared" label.
# 模型参数分解
`SenseNova-U1-8B-MoT` 包括约 **8B 理解参数****8B 生成参数**。为了避免命名带来的对参数量的误解,同时更好呈现模型的结构,我们提供了一个查看模型参数的脚本,可以根据模型中参数名称的解析,列出详细的模型参数统计。
## 运行脚本
```bash
python scripts/inspect_model_params.py \
--model_path sensenova/SenseNova-U1-8B-MoT
```
常用参数:
- `--dtype {float32,float16,bfloat16}`(默认:`bfloat16`)—— 加载精度。它**不影响**参数数量,只影响 `memory` 一列:`bf16/fp16` 每个元素 2 字节,`fp32` 每个 4 字节。
- `--show_groups <name1,name2>`(默认:`shared`)—— 列出指定组里的具体参数。可填 `all` 列出所有组,填空字符串关闭。
- `--custom_groups_json <path>`—— 用形如 `{"group_name": ["prefix1", "prefix2"]}` 的 JSON 文件覆盖默认的分组规则。
## 输出示例
```text
Model: sensenova/SenseNova-U1-8B-MoT
Load dtype: bfloat16
Total params: 17.552B
Total memory: 35.105GB (bfloat16)
---------------------------------------------------------------------
group params memory (bfloat16) ratio
---------------------------------------------------------------------
generation_transformer 8.186B 16.373GB 46.64%
understanding_transformer 8.121B 16.243GB 46.27%
shared 1.245B 2.489GB 7.09%
---------------------------------------------------------------------
Pathway breakdown (shared counted in both):
---------------------------------------------------------------------
pathway params memory (bfloat16) ratio
---------------------------------------------------------------------
understanding pathway 9.366B 18.732GB 53.36%
generation pathway 9.431B 18.862GB 53.73%
---------------------------------------------------------------------
Members of group 'shared' (2 params, 1.245B total, 2.489GB @ bfloat16)
---------------------------------------------------------------------
param name numel dtype
---------------------------------------------------------------------
language_model.model.embed_tokens.weight 622.330M bfloat16
language_model.lm_head.weight 622.330M bfloat16
```
## 如何理解输出
### 1. 模型的参数量(互斥归属,加起来 100%)
第一段表示**模型的参数量**:每个参数在物理上只算一次,归到三组之一:
- `understanding_transformer`**8.12B (46%)**:理解侧的视觉 (`vision_model.*`) 加上 LLM 中**不带** `_mot_gen` 后缀的理解专家分支(`language_model.*` 去掉 `_mot_gen` 副本和 `shared` 文本 I/O)。
- `generation_transformer`**8.19B (47%)**:生成侧模块 (`fm_modules.*`:生成的视觉、flow-matching head、timestep / noise embedders) 加上 LLM 中**带** `_mot_gen` 后缀的生成专家分支(`language_model.*` 中包含 `_mot_gen` 的部分)。
- `shared`**1.25B (7%)**:两条 pathway 都会复用的文本 token 输入输出,即 `language_model.model.embed_tokens``language_model.lm_head`
### 2. 模型 forward 时经过的参数量(pathway 覆盖,加起来 >100%)
第一段反映"参数如何存储";**pathway** 反映"跑某个任务时前向实际经过的参数集合"。由于两条 pathway 都会经过 `shared` 部分,所以两者占比相加会超过 100%。
- **理解 pathway**`understanding_transformer + shared`**9.37B (53%)**
图像经过 `vision_model` → token 序列经过 `embed_tokens` → LLM 走非 `_mot_gen` 分支 → `lm_head` 输出 logits
- **生成 pathway**(以较复杂的单轮 thinking interleave 为例)≈ `generation_transformer + shared`**9.43B (54%)**
输入图像经过 `fm_modules.vision_model_mot_gen`,文本 prompt 经过 `embed_tokens` → LLM 走 `_mot_gen` 分支 → 经过 `lm_head` 输出 thinking 文本,再经过 `fm_modules.fm_head` 输出图像
### 为什么 `embed_tokens` 与 `lm_head` 算 `shared`,而不是仅属于"理解"
`embed_tokens` 是任何文本 token 的入口,自然是两条 pathway 共享的。`lm_head` 在一些情况下同样会被生成 pathway 调用,例如:t2i-reasoning 在出图前会先经过 thinking 阶段输出一段文本 token,这一阶段必须走 `lm_head`,因此它实际上也是两条 pathway 共用的关键模块——这正是把它归为 `shared` 的原因。
# Prompt Enhancement for SenseNova-U1
> Short user prompts — especially for **infographic** generation —
> often under-constrain the image model. Running the raw prompt through a
> strong LLM enhancer first consistently lifts structure, typography,
> information density, and "brief-readability" of the final image. This
> document describes how to turn it on, which upstream LLMs we recommend,
> and what the tradeoffs look like.
## 1. When to use
Use `--enhance` when:
- The user prompt is short or only names a topic (e.g. `"A chart about AI hardware in 2026"`).
- You are generating for demo / deck / poster use and can afford one extra
LLM round-trip before the T2I call.
Skip `--enhance` when:
- The user already supplies a long, structured, production-ready prompt.
- Latency or third-party API cost is the primary concern.
## 2. How it works
```
user prompt ──► LLM (system prompt = infographic expander) ──► expanded prompt ──► SenseNova-U1
```
## 3. Configuration
All configuration is environment-variable based so the same script can
switch backends without code changes.
| Env var | Default | Purpose |
| :------ | :------ | :------ |
| `U1_ENHANCE_BACKEND` | `chat_completions` | `chat_completions` (OpenAI-compatible) or `anthropic` |
| `U1_ENHANCE_ENDPOINT` | Gemini OpenAI-compat URL | Full `/chat/completions` or `/v1/messages` URL |
| `U1_ENHANCE_MODEL` | `gemini-3.1-pro` | Model name string sent in the request body |
| `U1_ENHANCE_API_KEY` | _unset_ | Bearer token (required) |
First, create a `.env` file and populate it with the four required parameters. Then just add `--enhance` to your `examples/t2i/inference.py` command line.
Add `--print_enhance` to echo the original + enhanced prompt for
debugging.
To use **SenseNova 6.7 Flash-Lite** as the enhancer, get an API key from
[SenseNova Console · token-plan](https://platform.sensenova.cn/token-plan),
then set:
```bash
U1_ENHANCE_BACKEND=chat_completions
U1_ENHANCE_ENDPOINT=https://token.sensenova.cn/v1/chat/completions
U1_ENHANCE_MODEL=sensenova-6.7-flash-lite
U1_ENHANCE_API_KEY=<your SenseNova API key>
```
### 3.1 Recommended backends
| Model | Backend | Endpoint template | Notes |
| :---- | :------ | :---------------- | :---- |
| **Gemini 3.1 Pro** (Default) | `chat_completions` | `https://generativelanguage.googleapis.com/v1beta/openai/chat/completions` | Best overall infographic quality in our internal bench. Excellent at structured / hierarchical content. |
| SenseNova 6.7 Flash-Lite | `chat_completions` | `https://token.sensenova.cn/v1/chat/completions` | Near Gemini 3.1 Pro quality on Chinese content at lower per-token cost, preferred for production. |
| Anthropic Claude (Sonnet/Opus) | `anthropic` | `https://api.anthropic.com/v1/messages` | Strong typography discipline, slightly less "information-dense" out of the box. |
| Kimi 2.5 | `chat_completions` | `https://api.moonshot.cn/v1/chat/completions` | Good Chinese enhancements, weaker for English-dense infographics in our runs. |
| Gemini 3.1 Flash-Lite (Third-party service) | `chat_completions` | `https://aigateway.edgecloudapp.com/v1/f194fd69361cd590f1fa136c9c90eca1/senseai` | The overall quality of the information chart is high and its generation speed is fast. |
| Kimi 2.5/Qwen3.6-Plus (Third-party service) | `chat_completions` | `https://coding.dashscope.aliyuncs.com/v1/chat/completions` | Good Chinese enhancements. Different models can be flexibly selected. |
## 4. Qualitative comparison
> The table below will be populated with side-by-side samples from the same
> handful of base prompts, rendered at `2048×2048` with identical sampler
> knobs. PRs with new backends welcome.
| Base prompt | No enhance | Gemini 3.1 Pro | SenseNova | Qwen3.6-Plus | Kimi 2.5 |
| :---------- | :------------- | :------------- | :------------- | :------------- | :------------- |
| 生成一副西红柿炒鸡蛋的中文教程图 | <img src="assets/showcases/prompt_enhancement/case1.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_kimi_enhanced.webp" width="150"> |
| 生成一张介绍乒乓球比赛规则的图片 | <img src="assets/showcases/prompt_enhancement/case2.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_kimi_enhanced.webp" width="150"> |
| Popularizing the importance of three meals a day | <img src="assets/showcases/prompt_enhancement/case3.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_kimi_enhanced.webp" width="150"> |
| <details><summary>点击查看详细 Prompt</summary>这张信息图的标题是“猫咪与狗狗的终极对决”,采用了日系极致可爱与强烈色彩对比的插画风格。整体布局为左右对称的双栏对比结构,背景是带有细腻水彩纸纹理的米白色。画面通过色彩进行强烈的视觉分区,左半部分背景叠加了浅薄荷绿色的半透明波点图案,右半部分背景叠加了暖珊瑚粉色的对角线斜纹图案。长宽比为16:9。\n\n画面的正上方居中位置,使用超大号的粗体圆润无衬线字体写着主标题“猫咪与狗狗的终极对决”。主标题下方,使用稍小字号的深灰色黑体字写着副标题“毛孩子性格与生活方式指南”。在副标题的两侧,分别画着一个带有粉色肉垫的猫爪印图案和一个带有灰色指甲的狗爪印图案。\n\n在画面的正中央垂直方向,有一条由明黄色虚线构成的中轴线,将画面完美切割为左右两部分。中轴线的正中央,放置着一个带有爆炸星芒边缘的亮橙色圆形徽章,徽章内部用夸张的粗体等宽英文字母写着“VS”。\n\n画面左侧是猫咪的专属区域。顶部有一幅精美的插画:一只拥有大眼睛、脸颊红润的胖乎乎英国短毛猫,头顶带着一个小皇冠。插画下方用深绿色的粗体字写着“傲娇猫星人”。向下延伸,有三个垂直排列的信息模块。第一个模块中,画着一只蜷缩在原木高书架顶层熟睡的橘猫,旁边紧挨着文字“独立自主:每天需要16小时睡眠”。第二个模块中,画着一个印有小鱼骨头图案的浅蓝色陶瓷碗,碗里装满新鲜的生鱼片和鸡肉块,碗的右侧写着“纯肉食动物:需要高蛋白”。第三个模块中,画着一个半开的棕色纸箱,纸箱缝隙里露出一双发光的猫眼,旁边写着“暗中观察:喜欢狭小隐蔽的空间”。在左侧的最底部,有一个带边框的提示框,里面用倾斜的黑体字写着“专家提示:给猫咪充足的私人空间”。\n\n画面右侧是狗狗的专属区域。顶部有一幅生动的插画:一只吐着舌头、耳朵飞扬的金色寻回犬,脖子上戴着红色的波点项圈。插画下方用深红色的粗体字写着“热情汪星人”。向下延伸,同样有三个垂直排列的信息模块,与左侧保持完美的水平对齐。第一个模块中,画着一只前爪腾空、嘴里叼着绿色飞盘的边境牧羊犬,旁边紧挨着文字“社交达人:需要户外互动与奔跑”。第二个模块中,画着一个不锈钢宠物碗,里面装着混合了骨头形状饼干、胡萝卜丁和肉粒的狗粮,碗的左侧写着“杂食动物:营养均衡最重要”。第三个模块中,画着一只站立在后腿上、用双爪抱着人类大腿的小型贵宾犬,旁边写着“随时求抱抱:极度依赖主人的陪伴”。在右侧的最底部,有一个与左侧对称的提示框,里面用倾斜的黑体字写着“专家提示:保证每日充足的户外运动”。\n\n在画面的正下方,跨越左右两个区域,有一个淡黄色的宽大横幅。横幅内部用醒目的深藏青色粗体字写着“结论:无论性格如何,都是我们的完美伴侣!”横幅两端分别画着一颗跳动的红色爱心图案。整个画面信息密度极高,文字排版层次分明,色彩对比强烈且极具亲和力,所有元素均清晰可见且无重叠。图像的整体宽高比设定为9:16。</details> | <img src="assets/showcases/prompt_enhancement/case4.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_kimi_enhanced.webp" width="150"> |
# SenseNova-U1 提示词增强
> 简短的用户提示词——尤其是用于**信息图**生成时——通常会对图像模型限制不足。先将原始提示词通过强力 LLM 增强器处理,能够持续提升最终图像的结构性、排版质量、信息密度和"简报可读性"。本文档介绍如何开启该功能、我们推荐的上游 LLM,以及使用过程中需要权衡的因素。
## 1. 使用时机
在以下情况使用 `--enhance`
- 用户提示词较短,或仅指明了一个主题(例如 `"2026年人工智能硬件图表"`)。
- 为演示文稿/幻灯片/海报生成素材,并且在调用文生图(T2I)模型之前,可以接受增加一轮额外的 LLM 交互开销。
在以下情况跳过 `--enhance`
- 用户已提供一段长篇、结构化且可直接用于生产的提示词。
- 延迟或第三方 API 成本是首要考量因素。
## 2. 工作原理
```
user prompt ──► LLM (system prompt = infographic expander) ──► expanded prompt ──► SenseNova-U1
```
## 3. 配置
所有配置均基于环境变量,因此同一脚本无需修改代码即可切换后端。
| 环境变量 | 默认值 | 用途 |
| :------ | :------ | :------ |
| `U1_ENHANCE_BACKEND` | `chat_completions` | `chat_completions`(兼容 OpenAI)或 `anthropic` |
| `U1_ENHANCE_ENDPOINT` | Gemini OpenAI 兼容 URL | 完整的 `/chat/completions``/v1/messages` URL |
| `U1_ENHANCE_MODEL` | `gemini-3.1-pro` | 请求体中发送的模型名称字符串 |
| `U1_ENHANCE_API_KEY` | xxx | Bearer token(必填) |
首先创建 `.env` 文件并填入四个必要参数,然后在 `examples/t2i/inference.py` 命令行中添加 `--enhance` 即可。
添加 `--print_enhance` 可显示原始提示词和增强后的提示词,便于调试。
如需使用 **SenseNova 6.7 Flash-Lite** 作为增强器,请从
[SenseNova 控制台 · Token Plan](https://platform.sensenova.cn/token-plan) 获取 API 密钥,然后设置:
```bash
U1_ENHANCE_BACKEND=chat_completions
U1_ENHANCE_ENDPOINT=https://token.sensenova.cn/v1/chat/completions
U1_ENHANCE_MODEL=sensenova-6.7-flash-lite
U1_ENHANCE_API_KEY=<your SenseNova API key>
```
### 3.1 推荐后端
| 模型 | 后端 | 端点模板 | 备注 |
| :---- | :------ | :---------------- | :---- |
| **Gemini 3.1 Pro**(默认) | `chat_completions` | `https://generativelanguage.googleapis.com/v1beta/openai/chat/completions` | 内部测评中信息图整体质量最佳,尤其擅长结构化、层级化内容。 |
| SenseNova 6.7 Flash-Lite | `chat_completions` | `https://token.sensenova.cn/v1/chat/completions` | 生成的中文质量接近 Gemini 3.1 Pro,单 token 成本更低,推荐用于生产环境。 |
| Anthropic Claude(Sonnet/Opus) | `anthropic` | `https://api.anthropic.com/v1/messages` | 排版规范性强,开箱即用时信息密度略低。 |
| Kimi 2.5 | `chat_completions` | `https://api.moonshot.cn/v1/chat/completions` | 中文增强效果好,测试显示生成以英文为主的信息图质量不高。 |
| Gemini 3.1 Flash-Lite(第三方服务) | `chat_completions` | `https://aigateway.edgecloudapp.com/v1/f194fd69361cd590f1fa136c9c90eca1/senseai` | 信息图整体质量高,生成速度快。 |
| Kimi 2.5/Qwen3.6-Plus(第三方服务) | `chat_completions` | `https://coding.dashscope.aliyuncs.com/v1/chat/completions` | 中文增强效果好,可灵活选择不同模型。 |
## 4. 效果对比
> 下表数据为使用相同基础提示词、分辨率及采样参数生成的图片。
| 基础提示词 | 无增强 | Gemini 3.1 Pro | SenseNova | Qwen3.6-Plus | Kimi 2.5 |
| :---------- | :------------- | :------------- | :------------- | :------------- | :------------- |
| 生成一副西红柿炒鸡蛋的中文教程图 | <img src="assets/showcases/prompt_enhancement/case1.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case1_kimi_enhanced.webp" width="150"> |
| 生成一张介绍乒乓球比赛规则的图片 | <img src="assets/showcases/prompt_enhancement/case2.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case2_kimi_enhanced.webp" width="150"> |
| Popularizing the importance of three meals a day | <img src="assets/showcases/prompt_enhancement/case3.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case3_kimi_enhanced.webp" width="150"> |
| <details><summary>点击查看详细 Prompt</summary>这张信息图的标题是“猫咪与狗狗的终极对决”,采用了日系极致可爱与强烈色彩对比的插画风格。整体布局为左右对称的双栏对比结构,背景是带有细腻水彩纸纹理的米白色。画面通过色彩进行强烈的视觉分区,左半部分背景叠加了浅薄荷绿色的半透明波点图案,右半部分背景叠加了暖珊瑚粉色的对角线斜纹图案。长宽比为16:9。\n\n画面的正上方居中位置,使用超大号的粗体圆润无衬线字体写着主标题“猫咪与狗狗的终极对决”。主标题下方,使用稍小字号的深灰色黑体字写着副标题“毛孩子性格与生活方式指南”。在副标题的两侧,分别画着一个带有粉色肉垫的猫爪印图案和一个带有灰色指甲的狗爪印图案。\n\n在画面的正中央垂直方向,有一条由明黄色虚线构成的中轴线,将画面完美切割为左右两部分。中轴线的正中央,放置着一个带有爆炸星芒边缘的亮橙色圆形徽章,徽章内部用夸张的粗体等宽英文字母写着“VS”。\n\n画面左侧是猫咪的专属区域。顶部有一幅精美的插画:一只拥有大眼睛、脸颊红润的胖乎乎英国短毛猫,头顶带着一个小皇冠。插画下方用深绿色的粗体字写着“傲娇猫星人”。向下延伸,有三个垂直排列的信息模块。第一个模块中,画着一只蜷缩在原木高书架顶层熟睡的橘猫,旁边紧挨着文字“独立自主:每天需要16小时睡眠”。第二个模块中,画着一个印有小鱼骨头图案的浅蓝色陶瓷碗,碗里装满新鲜的生鱼片和鸡肉块,碗的右侧写着“纯肉食动物:需要高蛋白”。第三个模块中,画着一个半开的棕色纸箱,纸箱缝隙里露出一双发光的猫眼,旁边写着“暗中观察:喜欢狭小隐蔽的空间”。在左侧的最底部,有一个带边框的提示框,里面用倾斜的黑体字写着“专家提示:给猫咪充足的私人空间”。\n\n画面右侧是狗狗的专属区域。顶部有一幅生动的插画:一只吐着舌头、耳朵飞扬的金色寻回犬,脖子上戴着红色的波点项圈。插画下方用深红色的粗体字写着“热情汪星人”。向下延伸,同样有三个垂直排列的信息模块,与左侧保持完美的水平对齐。第一个模块中,画着一只前爪腾空、嘴里叼着绿色飞盘的边境牧羊犬,旁边紧挨着文字“社交达人:需要户外互动与奔跑”。第二个模块中,画着一个不锈钢宠物碗,里面装着混合了骨头形状饼干、胡萝卜丁和肉粒的狗粮,碗的左侧写着“杂食动物:营养均衡最重要”。第三个模块中,画着一只站立在后腿上、用双爪抱着人类大腿的小型贵宾犬,旁边写着“随时求抱抱:极度依赖主人的陪伴”。在右侧的最底部,有一个与左侧对称的提示框,里面用倾斜的黑体字写着“专家提示:保证每日充足的户外运动”。\n\n在画面的正下方,跨越左右两个区域,有一个淡黄色的宽大横幅。横幅内部用醒目的深藏青色粗体字写着“结论:无论性格如何,都是我们的完美伴侣!”横幅两端分别画着一颗跳动的红色爱心图案。整个画面信息密度极高,文字排版层次分明,色彩对比强烈且极具亲和力,所有元素均清晰可见且无重叠。图像的整体宽高比设定为9:16。</details> | <img src="assets/showcases/prompt_enhancement/case4.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_gemini_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_sensenova_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_qwen_enhanced.webp" width="150"> | <img src="assets/showcases/prompt_enhancement/case4_kimi_enhanced.webp" width="150"> |
# SenseNova-U1 Showcases
[← Back to README](../README.md)
All samples below were generated by **SenseNova-U1** (see the main README
for the runnable commands). Images are stored as lossy WebP under
[`docs/assets/showcases/`](./assets/showcases/); click any thumbnail to
open the full-resolution render.
---
## Text-to-Image
The main table presents the complete n × 3 grid layouts, covering landscape, square, and portrait formats at different resolutions.
#### 🖼️ *Text-to-Image (General)*
Reproducible prompts are in
[`examples/t2i/data/samples.jsonl`](../examples/t2i/data/samples.jsonl).
| | | |
| :---: | :---: | :---: |
| [<img width="300" alt="t2i general dense face hd 07" src="./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
| [<img width="260" alt="t2i general face hd 13" src="./assets/showcases/t2i_general/1_1_face_hd_13.webp">](./assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./assets/showcases/t2i_general/1_1_face_hd_17.webp">](./assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
| [<img width="260" alt="t2i general landscape 06" src="./assets/showcases/t2i_general/1_1_landscape_06.webp">](./assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./assets/showcases/t2i_general/1_1_landscape_07.webp">](./assets/showcases/t2i_general/1_1_landscape_07.webp) |
| [<img width="200" alt="t2i general portrait artistic 02 a" src="./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./assets/showcases/t2i_general/9_16_human_pose_11.webp">](./assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./assets/showcases/t2i_general/9_16_artistic_07.webp">](./assets/showcases/t2i_general/9_16_artistic_07.webp) |
| [<img width="200" alt="t2i general portrait text rendering 02" src="./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
#### 🖼️ *Text-to-Image (Reasoning)*
Reproducible prompts are in
[`examples/t2i/data/samples_reasoning.jsonl`](../examples/t2i/data/samples_reasoning.jsonl).
<table>
<tr>
<th style="width: 20%">Original Text</th>
<th style="width: 50%">Reasoning Process</th>
<th style="width: 30%">Resulting Image</th>
</tr>
<tr>
<td style="vertical-align: top;">The playful craft that embodies Russian cultural charm</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is the matryoshka, identified as a Russian craft. Essential modifiers include playful and Russian cultural charm. The relation is that the craft embodies the charm. 2. <b>Reasoning Process:</b> The prompt identifies the matryoshka, the iconic Russian wooden doll set. 3. <b>Establish the frame:</b> The frame captures the matryoshka set in the foreground. The composition focuses on the Russian craft to show the playful nature of the doll set. 4. <b>Set the lighting and color:</b> Lighting illuminates the matryoshka to reveal the Russian cultural charm. The color palette supports the playful craft aesthetic. 5. <b>Lock the style:</b> The style emphasizes the wooden nature of the Russian craft. The finish reflects the charm of the matryoshka. 6. <b>Explicit Prompt:</b> A set of three colorful, hand-painted wooden matryoshka dolls arranged by size on a rustic wooden table, bright floral patterns, soft natural light.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/1_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A gigantic bubble in the immediate foreground with a small town barely visible inside</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The prompt requires a gigantic bubble positioned in the immediate foreground. A small town must be visible inside the bubble. 2. <b>Reasoning Process:</b> The bubble acts as a transparent lens, potentially distorting the town's appearance due to its curvature. 3. <b>Establish the frame:</b> The shot scale is an extreme close-up, placing the bubble so close it fills the majority of the frame. The focus plane is set on the bubble's surface, rendering the background outside the town soft and out of focus. 4. <b>Set the lighting and color:</b> Natural daylight creates bright specular highlights on the upper surface of the bubble. The color palette is dominated by the cool transparency of the bubble and the muted tones of the town within. 5. <b>Lock the style:</b> The overall image style is photorealistic with a high level of detail on the bubble's texture. The rendering captures the glossy finish and physical properties of the bubble accurately. 6. <b>Explicit Prompt:</b> An extreme close-up of a massive, transparent soap bubble floating in the immediate foreground, with a tiny, detailed miniature town visible inside the curvature, refracted and magnified by the surface tension, soft natural daylight, dreamlike, high detail, photorealistic texture.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/3_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A chocolate bar left in direct sunlight, highlighting the state of the chocolate</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a chocolate bar positioned in direct sunlight. The focus is on the state of the chocolate, specifically how the heat affects it. 2. <b>Reasoning Process:</b> Heat causes chocolate to melt, losing its structured shape and becoming a viscous, glossy liquid. 3. <b>Establish the frame:</b> The composition is a close-up shot that fills the frame with the chocolate bar to emphasize detail. The angle is slightly elevated to show the top surface and the pooling liquid clearly. 4. <b>Build the environment:</b> The chocolate bar rests on a generic surface that supports the object without distracting from the main subject. The background is blurred to keep attention on the foreground elements and the chocolate. 5. <b>Set the lighting and color:</b> Direct sunlight creates bright highlights on the melting chocolate, emphasizing its glossy texture. The lighting is warm and intense, casting distinct shadows and illuminating the rich brown colors of the liquid. 6. <b>Explicit Prompt:</b> A close-up of a melting chocolate bar on a surface, with the edges losing their defined shape and pooling into a glossy, viscous puddle under the heat of the sun.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/6_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
#### 🖼️ *Text-to-Image (Infographics)*
Reproducible prompts are in
[`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl).
<table align="center">
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0004.webp"><img width="300" alt="t2i landscape 0001" src="./assets/showcases/t2i_infographic/0004.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0012.webp"><img width="300" alt="t2i landscape 0002" src="./assets/showcases/t2i_infographic/0012.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0005.webp"><img width="300" alt="t2i landscape 0003" src="./assets/showcases/t2i_infographic/0005.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0018.webp"><img width="300" alt="t2i landscape 0004" src="./assets/showcases/t2i_infographic/0018.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0024.webp"><img width="300" alt="t2i landscape 0005" src="./assets/showcases/t2i_infographic/0024.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0013.webp"><img width="300" alt="t2i landscape 0006" src="./assets/showcases/t2i_infographic/0013.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0006.webp"><img width="300" alt="t2i landscape 0007" src="./assets/showcases/t2i_infographic/0006.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0015.webp"><img width="300" alt="t2i landscape 0008" src="./assets/showcases/t2i_infographic/0015.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0025.webp"><img width="300" alt="t2i landscape 0009" src="./assets/showcases/t2i_infographic/0025.webp"></a></td>
</tr>
</table>
<table align="center">
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0000.webp"><img width="220" alt="t2i landscape 0010" src="./assets/showcases/t2i_infographic/0000.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0003.webp"><img width="220" alt="t2i landscape 0011" src="./assets/showcases/t2i_infographic/0003.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0001.webp"><img width="220" alt="t2i landscape 0012" src="./assets/showcases/t2i_infographic/0001.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0022.webp"><img width="220" alt="t2i landscape 0013" src="./assets/showcases/t2i_infographic/0022.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0016.webp"><img width="220" alt="t2i image 0014" src="./assets/showcases/t2i_infographic/0016.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0010.webp"><img width="220" alt="t2i image 0015" src="./assets/showcases/t2i_infographic/0010.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0007.webp"><img width="220" alt="t2i image 0016" src="./assets/showcases/t2i_infographic/0007.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0021.webp"><img width="220" alt="t2i image 0017" src="./assets/showcases/t2i_infographic/0021.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0009.webp"><img width="220" alt="t2i image 0018" src="./assets/showcases/t2i_infographic/0009.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0020.webp"><img width="220" alt="t2i image 0019" src="./assets/showcases/t2i_infographic/0020.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0008.webp"><img width="220" alt="t2i image 0020" src="./assets/showcases/t2i_infographic/0008.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0002.webp"><img width="220" alt="t2i image 0021" src="./assets/showcases/t2i_infographic/0002.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0011.webp"><img width="230" alt="t2i image 0022" src="./assets/showcases/t2i_infographic/0011.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0023.webp"><img width="230" alt="t2i image 0023" src="./assets/showcases/t2i_infographic/0023.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0027.webp"><img width="230" alt="t2i image 0024" src="./assets/showcases/t2i_infographic/0027.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0026.webp"><img width="230" alt="t2i image 0025" src="./assets/showcases/t2i_infographic/0026.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0029.webp"><img width="230" alt="t2i image 0022" src="./assets/showcases/t2i_infographic/0029.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0030.webp"><img width="230" alt="t2i image 0023" src="./assets/showcases/t2i_infographic/0030.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0031.webp"><img width="230" alt="t2i image 0024" src="./assets/showcases/t2i_infographic/0031.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0032.webp"><img width="230" alt="t2i image 0025" src="./assets/showcases/t2i_infographic/0032.webp"></a></td>
</tr>
</table>
## Image Editing
Side-by-side compare montages below show `input(s) | output`, with the
edit instruction rendered along the bottom. The same
unified model handles single-image attribute / style / relighting edits
and multi-reference (subject + accessory + pose) composition.
#### ✏️ *Image Editing (General)*
Reproducible prompts are in
[`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl).
| | |
| :---: | :---: |
| <div align="center"><a href="../examples/editing/data/images/1.webp"><img width="180" alt="editing input 1" src="../examples/editing/data/images/1.webp"></a> <a href="../docs/assets/showcases/editing/1_out.webp"><img width="180" alt="editing output 1" src="../docs/assets/showcases/editing/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <div align="center"><a href="../examples/editing/data/images/3.webp"><img width="180" alt="editing input 3" src="../examples/editing/data/images/3.webp"></a> <a href="../docs/assets/showcases/editing/3_out.webp"><img width="180" alt="editing output 3" src="../docs/assets/showcases/editing/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
| <div align="center"><a href="../examples/editing/data/images/2.webp"><img width="180" alt="editing input 2" src="../examples/editing/data/images/2.webp"></a> <a href="../docs/assets/showcases/editing/2_out.webp"><img width="180" alt="editing output 2" src="../docs/assets/showcases/editing/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <div align="center"><a href="../examples/editing/data/images/4.webp"><img width="180" alt="editing input 4" src="../examples/editing/data/images/4.webp"></a> <a href="../docs/assets/showcases/editing/4_out.webp"><img width="180" alt="editing output 4" src="../docs/assets/showcases/editing/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
| <div align="center"><a href="../examples/editing/data/images/5.webp"><img width="180" alt="editing input 5" src="../examples/editing/data/images/5.webp"></a> <a href="../docs/assets/showcases/editing/5_out.webp"><img width="180" alt="editing output 5" src="../docs/assets/showcases/editing/5_out.webp"></a><br><sub>Turn the image into an American comic style.</sub></div> | <div align="center"><a href="../examples/editing/data/images/8.webp"><img width="180" alt="editing input 8" src="../examples/editing/data/images/8.webp"></a> <a href="../docs/assets/showcases/editing/8_out.webp"><img width="180" alt="editing output 8" src="../docs/assets/showcases/editing/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> |
| <div align="center"><a href="../examples/editing/data/images/6.webp"><img width="180" alt="editing input 6" src="../examples/editing/data/images/6.webp"></a> <a href="../docs/assets/showcases/editing/6_out.webp"><img width="180" alt="editing output 6" src="../docs/assets/showcases/editing/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> | <div align="center"><a href="../examples/editing/data/images/7.webp"><img width="180" alt="editing input 7" src="../examples/editing/data/images/7.webp"></a> <a href="../docs/assets/showcases/editing/7_out.webp"><img width="180" alt="editing output 7" src="../docs/assets/showcases/editing/7_out.webp"></a><br><sub>Remove the person on the far right wearing a green skirt and a green top.</sub></div> |
#### ✏️ *Image Editing (Reasoning)*
Reproducible prompts are in
[`examples/editing/data/samples_reasoning.jsonl`](../examples/editing/data/samples_reasoning.jsonl).
<table>
<tr>
<th style="width: 20%">Original Text</th>
<th style="width: 30%">Original Image</th>
<th style="width: 20%">Reasoning Process</th>
<th style="width: 30%">Resulting Image</th>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like one hour later.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/034_temporal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like immediately after someone stands up from sitting on it for a long time.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag’s shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/036_causal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Draw an image showing the side view of the provided traffic cone.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a 3D perspective view of a traffic cone. 2. <b>Instruction Understanding:</b> The instruction asks for a side view. 3. <b>Reasoning Process:</b> A side view of a standard traffic cone results in a triangular silhouette with a flat rectangular base. 4. <b>Expected Visual Changes:</b> The perspective is flattened into this 2D-like geometric profile. 5. <b>Elements to Preserve:</b> The cone's height and color should remain consistent with the original object. 6. <b>Explicit Edit Prompt:</b> Edit the perspective view into a flat side-profile silhouette of a triangle with a rectangular base, keeping the red color and proportions unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/039_spatial_reasoning_draw_an_image_showing_the_si_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Change the water to high-concentration saltwater</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/042_physics_change_the_water_to_high-con_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">What the fruit looks like when ripe in the picture</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/044_biology_what_the_fruit_looks_like_wh_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Correct the unreasonable part in the image.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a kettle pouring water onto a mug, but the stream is misaligned and missing the cup. 2. <b>Instruction Understanding:</b> The instruction asks to fix the physical inconsistency. 3. <b>Reasoning Process:</b> The water stream must be redirected to connect the spout to the mug, maintaining the trajectory of gravity. 4. <b>Expected Visual Changes:</b> The water stream will be redirected to connect the spout to the mug. 5. <b>Elements to Preserve:</b> The kettle, mug, and background must remain unchanged while the water path is corrected. 6. <b>Explicit Edit Prompt:</b> Draw a continuous water stream connecting the kettle spout to the mug, keeping the kettle, mug, and background unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/046_anomaly_correction_correct_the_unreasonable_par_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Modify the matrix in the image to an upper triangular matrix</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a 2x2 matrix with values 1, 2, 3, and 4. 2. <b>Instruction Understanding:</b> The instruction asks to convert this to an upper triangular matrix. 3. <b>Reasoning Process:</b> By definition, an upper triangular matrix has zeros below the main diagonal, so the entry '3' must be changed to '0' while keeping '1', '2', and '4' as they are, and this modification satisfies the mathematical property requested. 4. <b>Expected Visual Changes:</b> The entry '3' in the lower-left position will be changed to '0'. 5. <b>Elements to Preserve:</b> The grid lines, the matrix structure, and the other entries must remain unchanged. 6. <b>Explicit Edit Prompt:</b> Change the '3' in the lower-left position to '0', while keeping the matrix structure and other entries unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/047_mathematics_modify_the_matrix_in_the_ima_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
---
## Interleaved Generation
Each case below is a single rendered response from `model.interleave_gen`:
the model first runs a `<think>...</think>` reasoning block that produces
intermediate images, then emits the final interleaved text-and-image
answer.
#### ♻️ *Interleaved Generation (General)*
Reproducible prompts are in
[`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl).
All examples are generated with think-mode reasoning; the chain-of-thought is omitted in some cases for cleaner visualization.
| |
| :---: |
| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
| [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
| [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
#### ♻️ *Interleaved Generation (Reasoning)*
Reproducible prompts are in
[`examples/interleave/data/samples_reasoning.jsonl`](../examples/interleave/data/samples_reasoning.jsonl).
| |
| :---: |
| [<img alt="interleave case 05" src="./assets/showcases/interleave/reasoning.png">](./assets/showcases/interleave/reasoning.png) |
---
## Visual Understanding
General visual understanding across spatial reasoning, multi-image comparison, OCR, geometry, and knowledge-intensive QA:
#### 📝 *Visual Understanding (General)*
Reproducible prompts are in [`examples/vqa/data/samples.jsonl`](../examples/vqa/data/samples.jsonl).
| |
| :---: |
| [<img alt="vqa general cases" src="./assets/showcases/vqa/general_case_all.webp">](./assets/showcases/vqa/general_case_all.webp) |
#### 📝 *Visual Understanding (Agentic)*
| |
| :---: |
| [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |
| [<img alt="vqa agentic case 2" src="./assets/showcases/vqa/agentic_case_2.webp">](./assets/showcases/vqa/agentic_case_2.webp) |
# SenseNova-U1 效果展示
[← 返回 README](../README_CN.md)
以下所有样例均由 **SenseNova-U1** 生成(可运行命令详见主 README)。图像以有损 WebP 格式存放于 [`docs/assets/showcases/`](./assets/showcases/),点击任意缩略图即可查看完整分辨率的渲染结果。
---
## 文生图
主表格展示完整的 n × 3 网格布局,涵盖不同分辨率下的横版、方版和竖版格式。
#### 🖼️ *文生图(通用)*
可复现的 prompt 位于 [`examples/t2i/data/samples.jsonl`](../examples/t2i/data/samples.jsonl)
| | | |
| :---: | :---: | :---: |
| [<img width="300" alt="t2i general dense face hd 07" src="./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
| [<img width="260" alt="t2i general face hd 13" src="./assets/showcases/t2i_general/1_1_face_hd_13.webp">](./assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./assets/showcases/t2i_general/1_1_face_hd_17.webp">](./assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
| [<img width="260" alt="t2i general landscape 06" src="./assets/showcases/t2i_general/1_1_landscape_06.webp">](./assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./assets/showcases/t2i_general/1_1_landscape_07.webp">](./assets/showcases/t2i_general/1_1_landscape_07.webp) |
| [<img width="200" alt="t2i general portrait artistic 02 a" src="./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./assets/showcases/t2i_general/9_16_human_pose_11.webp">](./assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./assets/showcases/t2i_general/9_16_artistic_07.webp">](./assets/showcases/t2i_general/9_16_artistic_07.webp) |
| [<img width="200" alt="t2i general portrait text rendering 02" src="./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
#### 🖼️ *文生图(推理)*
可复现的 prompt 位于 [`examples/t2i/data/samples_reasoning.jsonl`](../examples/t2i/data/samples_reasoning.jsonl)
<table>
<tr>
<th style="width: 20%">原始指令</th>
<th style="width: 50%">推理过程</th>
<th style="width: 30%">生成结果</th>
</tr>
<tr>
<td style="vertical-align: top;">The playful craft that embodies Russian cultural charm</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is the matryoshka, identified as a Russian craft. Essential modifiers include playful and Russian cultural charm. The relation is that the craft embodies the charm. 2. <b>Reasoning Process:</b> The prompt identifies the matryoshka, the iconic Russian wooden doll set. 3. <b>Establish the frame:</b> The frame captures the matryoshka set in the foreground. The composition focuses on the Russian craft to show the playful nature of the doll set. 4. <b>Set the lighting and color:</b> Lighting illuminates the matryoshka to reveal the Russian cultural charm. The color palette supports the playful craft aesthetic. 5. <b>Lock the style:</b> The style emphasizes the wooden nature of the Russian craft. The finish reflects the charm of the matryoshka. 6. <b>Explicit Prompt:</b> A set of three colorful, hand-painted wooden matryoshka dolls arranged by size on a rustic wooden table, bright floral patterns, soft natural light.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/1_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A gigantic bubble in the immediate foreground with a small town barely visible inside</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The prompt requires a gigantic bubble positioned in the immediate foreground. A small town must be visible inside the bubble. 2. <b>Reasoning Process:</b> The bubble acts as a transparent lens, potentially distorting the town's appearance due to its curvature. 3. <b>Establish the frame:</b> The shot scale is an extreme close-up, placing the bubble so close it fills the majority of the frame. The focus plane is set on the bubble's surface, rendering the background outside the town soft and out of focus. 4. <b>Set the lighting and color:</b> Natural daylight creates bright specular highlights on the upper surface of the bubble. The color palette is dominated by the cool transparency of the bubble and the muted tones of the town within. 5. <b>Lock the style:</b> The overall image style is photorealistic with a high level of detail on the bubble's texture. The rendering captures the glossy finish and physical properties of the bubble accurately. 6. <b>Explicit Prompt:</b> An extreme close-up of a massive, transparent soap bubble floating in the immediate foreground, with a tiny, detailed miniature town visible inside the curvature, refracted and magnified by the surface tension, soft natural daylight, dreamlike, high detail, photorealistic texture.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/3_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A chocolate bar left in direct sunlight, highlighting the state of the chocolate</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a chocolate bar positioned in direct sunlight. The focus is on the state of the chocolate, specifically how the heat affects it. 2. <b>Reasoning Process:</b> Heat causes chocolate to melt, losing its structured shape and becoming a viscous, glossy liquid. 3. <b>Establish the frame:</b> The composition is a close-up shot that fills the frame with the chocolate bar to emphasize detail. The angle is slightly elevated to show the top surface and the pooling liquid clearly. 4. <b>Build the environment:</b> The chocolate bar rests on a generic surface that supports the object without distracting from the main subject. The background is blurred to keep attention on the foreground elements and the chocolate. 5. <b>Set the lighting and color:</b> Direct sunlight creates bright highlights on the melting chocolate, emphasizing its glossy texture. The lighting is warm and intense, casting distinct shadows and illuminating the rich brown colors of the liquid. 6. <b>Explicit Prompt:</b> A close-up of a melting chocolate bar on a surface, with the edges losing their defined shape and pooling into a glossy, viscous puddle under the heat of the sun.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/6_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
#### 🖼️ *文生图(信息图)*
可复现的 prompt 位于 [`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl)
<table align="center">
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0004.webp"><img width="300" alt="t2i landscape 0001" src="./assets/showcases/t2i_infographic/0004.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0012.webp"><img width="300" alt="t2i landscape 0002" src="./assets/showcases/t2i_infographic/0012.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0005.webp"><img width="300" alt="t2i landscape 0003" src="./assets/showcases/t2i_infographic/0005.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0018.webp"><img width="300" alt="t2i landscape 0004" src="./assets/showcases/t2i_infographic/0018.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0024.webp"><img width="300" alt="t2i landscape 0005" src="./assets/showcases/t2i_infographic/0024.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0013.webp"><img width="300" alt="t2i landscape 0006" src="./assets/showcases/t2i_infographic/0013.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0006.webp"><img width="300" alt="t2i landscape 0007" src="./assets/showcases/t2i_infographic/0006.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0015.webp"><img width="300" alt="t2i landscape 0008" src="./assets/showcases/t2i_infographic/0015.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0025.webp"><img width="300" alt="t2i landscape 0009" src="./assets/showcases/t2i_infographic/0025.webp"></a></td>
</tr>
</table>
<table align="center">
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0000.webp"><img width="220" alt="t2i landscape 0010" src="./assets/showcases/t2i_infographic/0000.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0003.webp"><img width="220" alt="t2i landscape 0011" src="./assets/showcases/t2i_infographic/0003.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0001.webp"><img width="220" alt="t2i landscape 0012" src="./assets/showcases/t2i_infographic/0001.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0022.webp"><img width="220" alt="t2i landscape 0013" src="./assets/showcases/t2i_infographic/0022.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0016.webp"><img width="220" alt="t2i image 0014" src="./assets/showcases/t2i_infographic/0016.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0010.webp"><img width="220" alt="t2i image 0015" src="./assets/showcases/t2i_infographic/0010.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0007.webp"><img width="220" alt="t2i image 0016" src="./assets/showcases/t2i_infographic/0007.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0021.webp"><img width="220" alt="t2i image 0017" src="./assets/showcases/t2i_infographic/0021.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0009.webp"><img width="220" alt="t2i image 0018" src="./assets/showcases/t2i_infographic/0009.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0020.webp"><img width="220" alt="t2i image 0019" src="./assets/showcases/t2i_infographic/0020.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0008.webp"><img width="220" alt="t2i image 0020" src="./assets/showcases/t2i_infographic/0008.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0002.webp"><img width="220" alt="t2i image 0021" src="./assets/showcases/t2i_infographic/0002.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0011.webp"><img width="230" alt="t2i image 0022" src="./assets/showcases/t2i_infographic/0011.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0023.webp"><img width="230" alt="t2i image 0023" src="./assets/showcases/t2i_infographic/0023.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0027.webp"><img width="230" alt="t2i image 0024" src="./assets/showcases/t2i_infographic/0027.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0026.webp"><img width="230" alt="t2i image 0025" src="./assets/showcases/t2i_infographic/0026.webp"></a></td>
</tr>
<tr>
<td align="center"><a href="./assets/showcases/t2i_infographic/0029.webp"><img width="230" alt="t2i image 0022" src="./assets/showcases/t2i_infographic/0029.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0030.webp"><img width="230" alt="t2i image 0023" src="./assets/showcases/t2i_infographic/0030.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0031.webp"><img width="230" alt="t2i image 0024" src="./assets/showcases/t2i_infographic/0031.webp"></a></td>
<td align="center"><a href="./assets/showcases/t2i_infographic/0032.webp"><img width="230" alt="t2i image 0025" src="./assets/showcases/t2i_infographic/0032.webp"></a></td>
</tr>
</table>
## 图像编辑
下方的并排对比图展示 `输入 | 输出`,编辑指令渲染在每个对比图的底部。同一个统一模型既能完成单图的属性、风格及重光照编辑,也能处理多参考图(主体 + 配饰 + 姿态)的合成任务。
#### ✏️ *图像编辑(通用)*
可复现的 prompt 位于 [`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl)
| | |
| :---: | :---: |
| <div align="center"><a href="../examples/editing/data/images/1.webp"><img width="180" alt="editing input 1" src="../examples/editing/data/images/1.webp"></a> <a href="../docs/assets/showcases/editing/1_out.webp"><img width="180" alt="editing output 1" src="../docs/assets/showcases/editing/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <div align="center"><a href="../examples/editing/data/images/3.webp"><img width="180" alt="editing input 3" src="../examples/editing/data/images/3.webp"></a> <a href="../docs/assets/showcases/editing/3_out.webp"><img width="180" alt="editing output 3" src="../docs/assets/showcases/editing/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
| <div align="center"><a href="../examples/editing/data/images/2.webp"><img width="180" alt="editing input 2" src="../examples/editing/data/images/2.webp"></a> <a href="../docs/assets/showcases/editing/2_out.webp"><img width="180" alt="editing output 2" src="../docs/assets/showcases/editing/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <div align="center"><a href="../examples/editing/data/images/4.webp"><img width="180" alt="editing input 4" src="../examples/editing/data/images/4.webp"></a> <a href="../docs/assets/showcases/editing/4_out.webp"><img width="180" alt="editing output 4" src="../docs/assets/showcases/editing/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
| <div align="center"><a href="../examples/editing/data/images/5.webp"><img width="180" alt="editing input 5" src="../examples/editing/data/images/5.webp"></a> <a href="../docs/assets/showcases/editing/5_out.webp"><img width="180" alt="editing output 5" src="../docs/assets/showcases/editing/5_out.webp"></a><br><sub>Turn the image into an American comic style.</sub></div> | <div align="center"><a href="../examples/editing/data/images/8.webp"><img width="180" alt="editing input 8" src="../examples/editing/data/images/8.webp"></a> <a href="../docs/assets/showcases/editing/8_out.webp"><img width="180" alt="editing output 8" src="../docs/assets/showcases/editing/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> |
| <div align="center"><a href="../examples/editing/data/images/6.webp"><img width="180" alt="editing input 6" src="../examples/editing/data/images/6.webp"></a> <a href="../docs/assets/showcases/editing/6_out.webp"><img width="180" alt="editing output 6" src="../docs/assets/showcases/editing/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> | <div align="center"><a href="../examples/editing/data/images/7.webp"><img width="180" alt="editing input 7" src="../examples/editing/data/images/7.webp"></a> <a href="../docs/assets/showcases/editing/7_out.webp"><img width="180" alt="editing output 7" src="../docs/assets/showcases/editing/7_out.webp"></a><br><sub>Remove the person on the far right wearing a green skirt and a green top.</sub></div> |
#### ✏️ *图像编辑(推理)*
可复现的 prompt 位于 [`examples/editing/data/samples_reasoning.jsonl`](../examples/editing/data/samples_reasoning.jsonl)
<table>
<tr>
<th style="width: 20%">编辑指令</th>
<th style="width: 30%">原始图像</th>
<th style="width: 20%">推理过程</th>
<th style="width: 30%">编辑结果</th>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like one hour later.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/034_temporal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Draw what it will look like immediately after someone stands up from sitting on it for a long time.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag's shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/036_causal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Draw an image showing the side view of the provided traffic cone.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a 3D perspective view of a traffic cone. 2. <b>Instruction Understanding:</b> The instruction asks for a side view. 3. <b>Reasoning Process:</b> A side view of a standard traffic cone results in a triangular silhouette with a flat rectangular base. 4. <b>Expected Visual Changes:</b> The perspective is flattened into this 2D-like geometric profile. 5. <b>Elements to Preserve:</b> The cone's height and color should remain consistent with the original object. 6. <b>Explicit Edit Prompt:</b> Edit the perspective view into a flat side-profile silhouette of a triangle with a rectangular base, keeping the red color and proportions unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/039_spatial_reasoning_draw_an_image_showing_the_si_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Change the water to high-concentration saltwater</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/042_physics_change_the_water_to_high-con_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">What the fruit looks like when ripe in the picture</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/044_biology_what_the_fruit_looks_like_wh_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Correct the unreasonable part in the image.</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a kettle pouring water onto a mug, but the stream is misaligned and missing the cup. 2. <b>Instruction Understanding:</b> The instruction asks to fix the physical inconsistency. 3. <b>Reasoning Process:</b> The water stream must be redirected to connect the spout to the mug, maintaining the trajectory of gravity. 4. <b>Expected Visual Changes:</b> The water stream will be redirected to connect the spout to the mug. 5. <b>Elements to Preserve:</b> The kettle, mug, and background must remain unchanged while the water path is corrected. 6. <b>Explicit Edit Prompt:</b> Draw a continuous water stream connecting the kettle spout to the mug, keeping the kettle, mug, and background unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/046_anomaly_correction_correct_the_unreasonable_par_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
<tr>
<td style="vertical-align: top;">Modify the matrix in the image to an upper triangular matrix</td>
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
<td><div style="max-height: 200px; overflow-y: auto;">
1. <b>Source Image Analysis:</b> The source image shows a 2x2 matrix with values 1, 2, 3, and 4. 2. <b>Instruction Understanding:</b> The instruction asks to convert this to an upper triangular matrix. 3. <b>Reasoning Process:</b> By definition, an upper triangular matrix has zeros below the main diagonal, so the entry '3' must be changed to '0' while keeping '1', '2', and '4' as they are, and this modification satisfies the mathematical property requested. 4. <b>Expected Visual Changes:</b> The entry '3' in the lower-left position will be changed to '0'. 5. <b>Elements to Preserve:</b> The grid lines, the matrix structure, and the other entries must remain unchanged. 6. <b>Explicit Edit Prompt:</b> Change the '3' in the lower-left position to '0', while keeping the matrix structure and other entries unchanged.</div></td>
<td style="vertical-align: top;"><img src="../docs/assets/showcases/editing/047_mathematics_modify_the_matrix_in_the_ima_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
</tr>
</table>
---
## 图文交错生成
下方每个案例均为 `model.interleave_gen` 的一次完整响应:模型先在 `<think>...</think>` 推理块中生成若干中间图像,再输出最终图文交错的答案。
可复现的 prompt 位于 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl)。
所有示例均带 think 推理生成;为可视化简洁,部分示例未展示思维链。
| |
| :---: |
| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
| [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
| [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
#### ♻️ *图文交错生成(推理)*
| |
| :---: |
| [<img alt="interleave reasoning case 2" src="./assets/showcases/interleave/reasoning.png">](./assets/showcases/interleave/reasoning.png) |
---
## 视觉理解
涵盖空间推理、多图比较、OCR、几何以及知识密集型问答的通用视觉理解能力:
可复现的 prompt 位于 [`examples/vqa/data/samples.jsonl`](../examples/vqa/data/samples.jsonl)
| |
| :---: |
| [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |
| [<img alt="vqa agentic case 2" src="./assets/showcases/vqa/agentic_case_2.webp">](./assets/showcases/vqa/agentic_case_2.webp) |
| [<img alt="vqa general cases" src="./assets/showcases/vqa/general_case_all.webp">](./assets/showcases/vqa/general_case_all.webp) |
# SenseNova-U1-8B-MoT-Infographic 📊
**SenseNova-U1-8B-MoT-Infographic** is built upon the base U1-8B-MoT model with an extended MT training phase. During both the MT and SFT stages, we optimized the data distribution between visual understanding and generation tasks. In the RL stage, we further refined the reward recipe to reduce the occurrence of unintended black backgrounds in generated infographics.
- **Benchmark Performance:** Compared with the base **SenseNova-U1-8B-MoT** model, BizGenEval hard/easy increased from **39.8 / 61.1** to **46.6 / 65.4** (**+6.8 / +4.3 points**), and IGenBench Q-ACC/I-ACC increased from **51.3 / 4.2** to **69.5 / 17.0** (**+18.2 / +12.8 points**), while maintaining robust visual understanding capabilities without substantial degradation.
- **Generation Quality:** The model produces complex infographics across 100+ styles and layouts, with improved visual aesthetics and text rendering — including dense small text such as arXiv-style pages.
## Benchmark Highlights
| Model | BizGenEval Avg. (hard / easy) ↑ | IGenBench Q-ACC↑ | IGenBench I-ACC ↑ | OneIG(EN) ↑ | OneIG(ZH) ↑ |
| --- | ---: | ---: | ---: | ---: | ---: |
| ***Commercial Models*** | |
| Nano-Banana-Pro | 76.7 / 93.7 | 90.6 | 48.8 | 58.1 | 56.8 |
| Nano-Banana-2.0 | 68.5 / 92.5 | 85.6 | 34.4 | 54.0 | 54.9 |
| GPT-Image-1.5 | 35.9 / 81.6 | 55.0 | 12.0 | - | - |
| Qwen-Image-2.0 | 45.5 / 65.8 | 50.0 | 3.0 | 54.1 | 50.9 |
| Seedream-4.5 | 30.1 / 66.2 | 61.0 | 6.0 | 56.4 | 55.0 |
| ***Open-source Models*** | |
| **SenseNova-U1-8B-MoT-Infographic** | **46.6 / 65.4** | **69.5** | **17.0** | **55.6** | **53.3** |
| **SenseNova-U1-8B-MoT** | 39.8 / 61.1 | 51.3 | 4.2 | 54.5 | 53.8 |
| Z-Image | 8.2 / 43.8 | 30.0 | 1.0 | 54.6 | 53.5 |
| Qwen-Image-2512 | 6.3 / 41.0 | 32.2 | 1.0 | 53.0 | 51.5 |
| Qwen-Image | 2.8 / 23.8 | 36.0 | 0.0 | 53.9 | 54.8 |
| Bagel | 2.0 / 3.7 | 4.9 | 0.0 | 36.1 | 37.0 |
<sub>IGenBench scores are reported as percentages. Models are ordered by the arithmetic mean of BizGenEval hard, BizGenEval easy, IGenBench Q-ACC, and IGenBench I-ACC within the commercial and open-source groups separately. OneIG is included as a general generation reference. Full per-category results are intended for the Hugging Face model card.</sub>
## Qualitative Comparison
We present a qualitative comparison between the base **SenseNova-U1-8B-MoT** and the fine-tuned **SenseNova-U1-8B-MoT-Infographic** model across five key dimensions: background stability, chart accuracy, text rendering accuracy, arXiv paper rendering quality, and overall visual aesthetics.
### Background Stability
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/b2e4a010-921b-4d34-9ded-2f224a18ad16" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/318743df-6ee8-4ac8-960c-a4f75d2c51f3" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1804c1c1-6a5d-4608-b39d-2eea01c5fa6b" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/746f3604-c81a-4e36-8f81-a58606575ac5" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“版权视觉概览”,整体采用横向分栏布局,分为上下两个主要部分。上半部分为视觉化概览区,由四个彩色矩形区块并列组成,每个区块通过图标和简短标题传达一个核心概念;下半部分为“【版权基础常识】”详细解释区,包含四个编号条目,对应上半部分的四个主题,提供更详尽的文字说明。<br><br>**上半部分:版权视觉概览**<br><br>此区域由四个水平排列的彩色方块构成,从左至右依次为浅蓝色、浅黄色、浅绿色和浅紫色,每个方块内含一组图标和下方的中文标题。<br><br>1. **第一块(浅蓝色):创作即产生**<br> * **图标**:左侧是一个发光的灯泡,中间是一个带有笔的文档图标,右侧是一个锁头图标,三者之间用箭头连接,表示“创意 → 创作 → 保护”的流程。<br> * **文字**<br> * 图标下方有小字“自动保护”。<br> * 方块底部有大字标题“创作即产生”。<br><br>2. **第二块(浅黄色):核心权利**<br> * **图标**:中心是一只手掌向上托举,上方有多个元素围绕:一个带©符号的圆圈、一个喇叭、一堆金币和美元符号、以及多个指向不同方向的箭头,象征权利的多种表现形式和收益。<br> * **文字**<br> * 图标下方无额外小字。<br> * 方块底部有大字标题“核心权利”。<br><br>3. **第三块(浅绿色):特定条件平衡**<br> * **图标**:一个天平,左侧托盘上有打开的书本和标有“NEWS”的麦克风,代表“合理使用”;右侧托盘上有一个带锁的文件夹,代表“受控作品”。天平向右侧倾斜。<br> * **文字**<br> * 左侧托盘下方标注“合理使用”。<br> * 右侧托盘下方标注“受控作品”。<br> * 方块底部有大字标题“特定条件平衡”。<br><br>4. **第四块(浅紫色):保护期限**<br> * **图标**:左侧是一个沙漏,中间是一个向右的粗箭头,右侧是一个墓碑(顶部有十字架)。沙漏下方还有一个时钟图标。<br> * **文字**:<br> * 墓碑旁标注“作者有生之年 + X年”。<br> * 方块底部有大字标题“保护期限”。<br><br>**下半部分:【版权基础常识】**<br><br>此区域位于上半部分下方,背景为白色,包含四个独立的文本框,每个文本框都有一个彩色标题栏和下方的详细说明文字,颜色与上半部分对应。<br><br>1. **1. 自动获得保护**<br> * **标题栏**:蓝色背景,白色文字“1. 自动获得保护”。<br> * **正文**:“作品创作完成之时起,即自动享有版权,无需登记(登记主要是举证)。”<br><br>2. **2. 核心权利**<br> * **标题栏**:橙黄色背景,白色文字“2. 核心权利”。<br> * **正文**:“包括人身权(如署名权、修改权)和财产权(如复制权、发行权、信息网络传播权,可许可或转让获利)。”<br><br>3. **3. 合理使用**<br> * **标题栏**:绿色背景,白色文字“3. 合理使用”。<br> * **正文**:“在特定条件下(如教学、新闻报道、个人学习等),可以不经许可、不支付报酬使用,但需指明作者和出处,且不得侵犯其他权利。”<br><br>4. **4. 保护期限**<br> * **标题栏**:紫色背景,白色文字“4. 保护期限”。<br> * **正文**:“一般为作者有生之年加死后50年(中国大陆等多数地区),期限届满后进入公有领域。”<br><br>**整体风格与数据编码**<br>该信息图采用扁平化设计风格,色彩鲜明且分区清晰。通过颜色编码(蓝、黄、绿、紫)将四个主题进行视觉区分,并在上下两部分保持一致。图标作为主要的数据可视化手段,直观地表达了抽象概念。所有文字均为简体中文,内容结构严谨,逻辑清晰,旨在以图文结合的方式普及版权基础知识。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以中文为主要语言,采用横向四格布局,清晰呈现一个品牌从衰落到复兴的四个关键阶段。整体风格为手绘卡通插画,色彩柔和,线条简洁,具有亲和力和叙事性。每个阶段由上方的标题、中间的插图和下方的文字说明三部分构成,通过虚线分隔,结构分明。<br><br>第一阶段标题为“1. 曾经的辉煌与没落”,插图描绘了一座破败的城堡,城堡上挂着悲伤的表情,周围散落着皇冠,象征昔日荣耀的消逝;旁边立有标牌“OLD BRAND”,背景中可见大本钟,暗示传统或历史品牌。下方文字说明:“曾经是市场领导者,但未能跟上时代步伐,逐渐被遗忘,面临生存危机。”<br><br>第二阶段标题为“2. 创新与重塑”,插图展示四人团队围坐讨论,其中一人指向白板上的绿色叶子标志设计,周围环绕齿轮、灯泡(代表创意)和标牌“NEW IDEAS”。下方文字说明:“进行深度市场调研,重新定位品牌,引入创新设计和数字化策略,重塑核心价值。”<br><br>第三阶段标题为“3. 成功翻盘”,插图包含一只浴火重生的凤凰,象征涅槃;右侧是上升趋势的柱状图,下方是一个带有爱心的包裹,代表产品交付;一群欢呼的人群表达喜悦。下方文字说明:“凭借新产品和新形象重获消费者信任,业绩逆势上扬,重新赢得市场份额。”<br><br>第四阶段标题为“4. 未来展望”,插图描绘一枚火箭从地球轨道发射升空,周围有星星、云朵和一片绿叶,象征可持续发展;下方横幅写着“FUTURE READY”。下方文字说明:“持续创新,关注可持续发展和用户连接,立志成为更具影响力的未来品牌。”<br><br>整个信息图通过视觉隐喻(如城堡、凤凰、火箭)和数据图表(柱状图)结合,生动讲述了一个品牌从危机到复兴的完整故事,强调创新、用户信任和可持续发展的重要性。所有文本均为简体中文,无英文以外的其他语言。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/c07fe790-3d5f-4842-8515-e791504b86ca" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/7b243a49-5c4f-44af-9c10-8f0402bbadf9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ee66732e-7b2d-4e5e-b4f7-7ab3b326cff1" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/d672d362-709c-47aa-b12e-f32a79c07434" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "College Entrance Pathway Reforce Comparison" presents a structured comparison of key aspects for prospective students in Guangdong, China, aiming to enter college through a specialized entrance examination. The layout is organized as a multi-column table with four main columns: "Content Item / Evaluation Criteria", "Statistics", "Quotes", and "Key Terms". Each row corresponds to a distinct evaluation criterion or step in the preparation process, with visual icons, text, and data points enhancing clarity.<br><br>The infographic uses a clean, minimalist design with black line art icons on a light beige background. Text is primarily in bold sans-serif font, with headings emphasized for readability. Data is encoded using icons (e.g., graduation cap, calendar, books, target, rocket) to visually represent concepts, while numerical values are explicitly labeled for precision.<br><br>The first row addresses **Eligibility Criteria**:<br>- In the "Statistics" column, it features an icon of a person checking a map of Guangdong with the text: "Official Eligibility Requirements Confirm if you qualify to register".<br>- The "Quotes" column lists three eligible groups with corresponding icons: "Final-Year Guangdong Junior College Student", "Guangdong Resident <2 Years Post Graduation", and "Eligible Retired Military Personnel".<br>- The "Key Terms" column shows a magnifying glass over a document with the label: "Eligibility Verification".<br><br>The second row covers **Exam Structure & Scoring Breakdown**:<br>- "Statistics" displays icons representing different test types and scores: 100 pts (graduation cap), 200 pts (person at desk), 1000 pts (document with pen), 150 pts (document with pen). Below: "Total 500 points across 4 test papers".<br>- "Quotes" lists four subject components in document-shaped boxes: "Political Theory (100 pts)", "Major-Aligned Public Subject (100 pts)", "Professional Subject 1 (150 pts)", "Professional Subject 2 (150 pts)".<br>- "Key Terms" includes a balance scale icon with "Score Distribution".<br><br>The third row details the **Official Annual Exam Timeline**:<br>- "Statistics" contains a horizontal timeline with icons of a calendar and clock, labeled "Annual Key Timeline".<br>- "Quotes" provides a detailed timeline: Jan: Registration Open → Jan: Admission Open → Mid-Mar: Exam Date → Mid-Apr: Score Release → May-Jun: Admission Offers.<br>- "Key Terms" shows a calendar and clock with "Critical Dates".<br><br>The next three rows outline a three-step preparation strategy:<br><br>**Step 1 - Confirm Target Major & Institution**:<br>- "Statistics": Icon of a person holding a map with a target, text: "Confirm your target 6 months in advance".<br>- "Quotes": Two bullet points: "Download official exam syllabi and past professional subject papers from the target institution’s admission portal" and "Cross-verify that your junior college major meets the target major’s prerequisite requirements".<br>- "Key Terms": Clock and books with "Target Selection".<br><br>**Step 2 - Public Subject Foundation Building**:<br>- "Statistics": Icon of a person studying with books and a coffee cup, text: "Complete 3 months of structured public subject study".<br>- "Quotes": Two bullet points: "Complete 5+ years of past public subject exam papers to identify recurring test points" and "Political Theory allocates 30% of total score to current affairs from the past calendar year".<br>- "Key Terms": Box with lightbulb and "Core Knowledge".<br><br>**Step 3 - Professional Subject Sprint Revision**:<br>- "Statistics": Icon of a running person with a book and clock, text: "Focus on high-weight professional subjects in the final 2 months".<br>- "Quotes": Two bullet points: "Practice past professional subject papers from your target institution and review core major textbooks" and "60% of professional subject questions are repeated or adapted from past 3 years of papers for most institutions".<br>- "Key Terms": Trophy and gears with "Intensive Review".<br><br>Red horizontal lines separate the first three criteria from the three-step strategy, while a blue line separates Step 1 from Steps 2 and 3, visually grouping related content. All textual information is preserved exactly as presented, including spelling variations like "Oficial" (likely intended as "Official"). The infographic serves as a strategic roadmap combining official requirements, scoring details, timelines, and actionable preparation steps for candidates.</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "12-Month Market Performance: US vs. Asia" presents a structured, puzzle-piece-based visual analysis comparing the performance of US and Asian equity markets over a 12-month period. The layout is organized into three main steps, arranged in a central vertical flow with interconnected puzzle pieces, emphasizing a modular, analytical approach to market comparison. The design uses clean black-and-white line art with light blue accents for key sections, icons for visual representation, and clear typography for readability.<br><br>**Step 1** (top center) introduces the scope of the analysis. It features an illustration of four people examining charts, symbolizing data analysis. To the right, it defines the market indices being compared:<br>- **US Markets**: S&P 500, NASDAQ<br>- **Asian Markets**: Nikkei 225, Hang Seng, KOSPI, CSI 300<br><br>It also lists the types of data analyzed:<br>- Trailing Return (represented by a rising bar chart icon)<br>- Average Daily Volume (represented by a stacked bar chart icon)<br>- Top Sector Return (represented by a pie chart icon)<br><br>**Step 2** (left side, labeled "Metrics that account for 72% of short-term S&P 500 volatility") focuses on US Market Core Driving Indicators. This section contains icons representing industry (factory), finance (bank building), money (hand holding dollar sign), and labor (worker in hard hat). Below these icons, a light blue banner reads "US Market Core Driving Indicators". Specific metrics are listed with red warning triangle icons:<br>- CPI YoY: 3.2%<br>- Federal Funds Rate: 5.25–5.5%<br>- Non-farm Payrolls: +187k July 2024<br><br>**Step 3** (right side, labeled "Metrics that predict 68% of MSCI Asia Ex-Japan 3-month forward returns") focuses on Asian Market Core Leading Indicators. This section includes icons for shipping (container), manufacturing (gears), and calculation (calculator). A light blue banner below reads "Asian Market Core Leading Indicators". Specific metrics are listed:<br>- Manufacturing PMI: 51.2 (with red warning triangle)<br>- Q2 Export Growth: +6.8% YoY (with red warning triangle)<br>- Avg Policy Rate: 3.1% (with information circle icon)<br><br>At the bottom center, a large puzzle piece titled "Policy Shifts & Market Volatility Correlation" displays a line graph with two fluctuating lines:<br>- **US VIX (navy line)** — representing US market volatility<br>- **Asian Avg Volatility (green line)** — representing average Asian market volatility<br><br>Arrows connect the two lines, indicating correlation. Below the graph, key insights are provided with red warning triangles:<br>- Rate hike impact: +27% US VIX<br>- Trade policy impact: +34% Asian VIX<br>- Cross-regional sell-off correlation: 0.68<br><br>The overall structure visually represents how US and Asian market performances are driven by distinct but interrelated economic indicators, with a central focus on their volatility dynamics and policy impacts. The use of puzzle pieces metaphorically suggests that these components fit together to form a complete picture of global market trends. The infographic employs consistent iconography, color-coding (red for warnings, blue for core sections), and clear textual labeling to convey complex financial data in an accessible format.</div></details></td>
</tr>
</table>
### Chart Accuracy
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/43b9d92d-8c06-4eb6-944f-6a822403182c" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/88db7fe6-2b8b-494f-8eb8-979d38d6edbd" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/f2528515-c414-4db5-9c63-36fd2400021f" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/269e777a-2a21-49f2-82a2-fbceedb08027" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create a vintage-styled ornithology infographic with a light beige background and dark tree silhouettes framing the left and right sides. At the top, center-aligned, place the main title 'Alter bringt Vorsicht' in a bold, serif font, followed by the subtitle 'Sekunden pro Minute für Suchen vs. Picken bei Staren verschiedener Altersgruppen'. The central visualization is a diverging horizontal bar chart featuring age groups vertically listed in the middle: 'Jungvogel', 'Einjähriger Jungvogel', 'Zweijähriger Altvogel', 'Ausgewachsener Altvogel', and 'Senior'. For each group, a Rust Red bar extending to the right represents 'Picken', while a Sepia Tone bar extending to the left represents 'Suchen'. Numerical labels are placed inside each bar near the outer edge. Below the chart, two x-axes ranging from 0 to 40 reflect the time in seconds for both directions. A legend in the bottom right corner displays square markers for 'Picken' (Rust Red) and 'Suchen' (Sepia Tone). The given data is : [{"age_group": "Jungvogel", "pecking": 20, "scanning": 10}, {"age_group": "Einjähriger Jungvogel", "pecking": 30, "scanning": 20}, {"age_group": "Zweijähriger Altvogel", "pecking": 35, "scanning": 25}, {"age_group": "Ausgewachsener Altvogel", "pecking": 28, "scanning": 30}, {"age_group": "Senior", "pecking": 20, "scanning": 35}]</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: right; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic with the title 'Ajustes de Velocidad del Sprint' and the subtitle 'Desviación de alcance del equipo backend: Trabajo planificado vs. Corrección de errores no planificados' centered at the top. The central visual is a vertical bar chart set against a light gray background, featuring a horizontal baseline at the zero-point mark. Positive values (planned work) are represented by green bars extending upwards, while negative values (unplanned bug fixes) are represented by yellow bars extending downwards. Each bar is topped or bottomed with multi-line text containing the module name and its value in 'pts'. Specifically, the bars are labeled 'Autenticación de usuario' (50 pts), 'API de pago' (35 pts), 'Sincronización de inventario' (22 pts), 'BD de analítica' (14 pts), 'Servicio de correo' (-10 pts), 'Webhooks' (-28 pts), 'Carga de multimedia' (-42 pts), and 'Pasarela heredada' (-70 pts). An illustration of mechanical gears is positioned in the upper-right area, and a cluster of small bug icons is placed in the lower-left area. The vertical axis shows tick labels ranging from +60 pts to -80 pts. The given data is : [{"category": "Autenticación de usuario", "unit": "pts", "value": 50}, {"category": "API de pago", "unit": "pts", "value": 35}, {"category": "Sincronización de inventario", "unit": "pts", "value": 22}, {"category": "BD de analítica", "unit": "pts", "value": 14}, {"category": "Servicio de correo", "unit": "pts", "value": -10}, {"category": "Webhooks", "unit": "pts", "value": -28}, {"category": "Carga de multimedia", "unit": "pts", "value": -42}, {"category": "Pasarela heredada", "unit": "pts", "value": -70}]</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/bec1bd39-72bf-4bed-abdc-df02ab2c0fec" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/5ac254e1-b1c4-4530-9070-4fb4774451b4" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/38ae1611-e9c8-4d97-a867-e2cc43ced2ad" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/7f0dab24-e67a-4ee9-b9c0-ee97e283ac9f" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that features a title and a subtitle centered at the top, reading 'Fastest Cuisines to Prepare' and 'Average Ghost Kitchen Handover Time by Item Type (Minutes)' respectively. The main visual is a horizontal grouped bar chart combining a Fast-food neon visual style with checkerboard borders along the edges, featuring a centered legend above the chart area for 'QuickEats' (cyan neon border) and 'DashNow' (orange neon border). To the bottom right of the bar chart, there is a simple illustration of two mopeds waiting for orders. The chart's vertical axis lists four categories, each preceded by a simple icon, while the horizontal axis represents handover time in minutes with numerical labels at 0, 5, 10, 15, and 20, supplemented by dotted vertical gridlines. Each category features a pair of black bars representing the two platforms, with exact values displayed directly inside the right end of each bar. For 'Classic Tacos', QuickEats takes 10.0 minutes while DashNow takes 11.5 minutes. 'Supreme Burritos' require the longest preparation, with 17.5 minutes for QuickEats and 19.0 minutes for DashNow. 'Spicy Nachos' take 9.5 minutes on QuickEats and 10.0 minutes on DashNow. Finally, 'Mini Quesadillas' are the fastest, taking 8.0 minutes for QuickEats and 8.5 minutes for DashNow. The given data is : [{"category": "Classic Tacos", "platform": "QuickEats", "unit": "Minutes", "value": 10.0}, {"category": "Classic Tacos", "platform": "DashNow", "unit": "Minutes", "value": 11.5}, {"category": "Supreme Burritos", "platform": "QuickEats", "unit": "Minutes", "value": 17.5}, {"category": "Supreme Burritos", "platform": "DashNow", "unit": "Minutes", "value": 19.0}, {"category": "Spicy Nachos", "platform": "QuickEats", "unit": "Minutes", "value": 9.5}, {"category": "Spicy Nachos", "platform": "DashNow", "unit": "Minutes", "value": 10.0}, {"category": "Mini Quesadillas", "platform": "QuickEats", "unit": "Minutes", "value": 8.0}, {"category": "Mini Quesadillas", "platform": "DashNow", "unit": "Minutes", "value": 8.5}]</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that presents a centered title at the top, stating "Übertaktet vs. Standard-Takt", with the subtitle "Temperaturanstieg bei langen Gaming-Sessions" directly below it. The main visual is a line chart spanning the width of the infographic on a dark background, embodying a Gamer Aesthetic with vibrant RGB neon accents. This chart has a vertical axis on the left labeled with numerical values in increments of 10 from 30°C to 100°C, and a horizontal axis at the bottom with time labels: '0m', '15m', '30m', '45m', '60m', '75m', '90m', '105m', and '120m'. Horizontal grid lines mark each 10°C increment. A horizontal legend is positioned under the subtitle, containing a cyan circular marker and line for "Standard-Takt" and a magenta circular marker and line for "Übertaktet (+150MHz)". Two data series are plotted as glowing neon lines with hollow circular markers at each data point, accompanied by gradient shading below each line. The cyan "Standard-Takt" line shows a steep rise from 38°C at 0m to 68°C at 15m, followed by a flat plateau reaching 73.5°C at 120m. The magenta "Übertaktet" line displays a similar initial spike from 42°C to 75°C, but continues with a gradual linear creep up to 93°C at 120m. Spike annotations (callout boxes) point to the final data points on the right, highlighting the peak temperatures: a magenta box reads "Peak: 93°C" and a cyan box reads "Peak: 73.5°C". A stylized thermometer line-art icon is subtly placed in the center of the chart's background. The given data is : [{"profile": "Standard-Takt", "temperature": 38, "time": "0m"}, {"profile": "Übertaktet", "temperature": 42, "time": "0m"}, {"profile": "Standard-Takt", "temperature": 68, "time": "15m"}, {"profile": "Übertaktet", "temperature": 75, "time": "15m"}, {"profile": "Standard-Takt", "temperature": 71, "time": "30m"}, {"profile": "Übertaktet", "temperature": 79, "time": "30m"}, {"profile": "Standard-Takt", "temperature": 72, "time": "45m"}, {"profile": "Übertaktet", "temperature": 82, "time": "45m"}, {"profile": "Standard-Takt", "temperature": 72.5, "time": "60m"}, {"profile": "Übertaktet", "temperature": 85, "time": "60m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "75m"}, {"profile": "Übertaktet", "temperature": 87, "time": "75m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "90m"}, {"profile": "Übertaktet", "temperature": 89, "time": "90m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "105m"}, {"profile": "Übertaktet", "temperature": 91, "time": "105m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "120m"}, {"profile": "Übertaktet", "temperature": 93, "time": "120m"}]</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/2387f67f-2a65-4001-b579-9c1264f4796f" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/5d8c92b4-4066-4f5d-af77-0f71fd86e51a" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1b0ea5e1-b241-4902-9bc0-485229f60592" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/d50977de-d5ab-42fd-bcfe-a2001b18fbcb" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that displays data in a vertical diverging bar chart format. At the top left of the visualization, there is a title: 'Anomalie de l'Atlantique Sud : Dérive magnétique', and a subtitle: 'Vecteurs de dérive vers l'est et l'ouest en kilomètres par rapport à la ligne de base historique'. In the upper left area below the text, an icon of a compass rose is placed within a magnetic field line curve. The main chart features a horizontal zero-axis line, labeled with a '0' on the far left, representing the historical coordinate baseline. The x-axis at the bottom displays the decades '1980', '1990', '2000', '2010', and '2020', each marked with a small vertical tick. For each decade, a vertical bar extends from the zero-axis, with its corresponding data label positioned directly at the end of the bar. The data shows westward drift represented by blue bars extending below the axis for '1980' with a value of '-15 km' and '1990' with a value of '-32 km'. Eastward drift is represented by red bars extending above the axis for '2000' with a value of '+10 km', '2010' with a value of '+45 km', and '2020' with a value of '+68 km'. The overall visual style mimics a geophysical science journal, utilizing compass red and blue color tones. The given data is : [{"decade": "1980", "drift_km": -15}, {"decade": "1990", "drift_km": -32}, {"decade": "2000", "drift_km": 10}, {"decade": "2010", "drift_km": 45}, {"decade": "2020", "drift_km": 68}]</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic in a corporate report minimalism style with muted corporate grays and blues, featuring a large title, 'Seasonal Fluctuations in 15-Year Mortgages', at the top. Directly below it is a subtitle, 'Historical prepayment velocities showing seasonal housing market trends'. Underneath the subtitle, a horizontal legend identifies two categories with small square icons: 'Spring/Summer Originations' in lighter gray-blue and 'Fall/Winter Originations' in darker gray-blue. The main visual is a multi-line chart in a wide landscape orientation. The vertical axis has numeric labels at 0.0, 5.0, 10.0, 15.0, and 20.0, with horizontal grid lines extending across the plot. The horizontal axis features labels: 'Jan 2018', 'Apr', 'Jul', 'Oct', 'Jan 2019', 'Apr', and 'Jul'. An icon depicting a sleek house silhouette is positioned in the upper left corner of the chart's plotting area. Two distinct lines represent the categories, characterized by cyclical seasonal bumps in the summer months. Both lines have square markers at each data point, with numerical values displayed near them. The lighter line for 'Spring/Summer Originations' plots a value of 8.0 in Jan 2018, rising to 12.5 in Apr, peaking at 16.0 in Jul, dipping to 11.0 in Oct, dropping further to 7.5 in Jan 2019, climbing to 13.0 in Apr, and reaching 17.5 in Jul. The darker line for 'Fall/Winter Originations' mirrors this pattern, starting at 6.5 in Jan 2018, increasing to 9.0 in Apr, hitting 14.5 in Jul, falling to 10.0 in Oct, bottoming out at 6.0 in Jan 2019, rising to 10.5 in Apr, and ending at 15.0 in Jul. The given data is : [{"category": "Spring/Summer Originations", "date": "2018-01", "value": 8.0}, {"category": "Fall/Winter Originations", "date": "2018-01", "value": 6.5}, {"category": "Spring/Summer Originations", "date": "2018-04", "value": 12.5}, {"category": "Fall/Winter Originations", "date": "2018-04", "value": 9.0}, {"category": "Spring/Summer Originations", "date": "2018-07", "value": 16.0}, {"category": "Fall/Winter Originations", "date": "2018-07", "value": 14.5}, {"category": "Spring/Summer Originations", "date": "2018-10", "value": 11.0}, {"category": "Fall/Winter Originations", "date": "2018-10", "value": 10.0}, {"category": "Spring/Summer Originations", "date": "2019-01", "value": 7.5}, {"category": "Fall/Winter Originations", "date": "2019-01", "value": 6.0}, {"category": "Spring/Summer Originations", "date": "2019-04", "value": 13.0}, {"category": "Fall/Winter Originations", "date": "2019-04", "value": 10.5}, {"category": "Spring/Summer Originations", "date": "2019-07", "value": 17.5}, {"category": "Fall/Winter Originations", "date": "2019-07", "value": 15.0}]</div></details></td>
</tr>
</table>
### Text Rendering Accuracy and Size Appropriateness
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/e293eea7-6ec3-44a0-80a8-8ab2760bf6a7" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1ce99b25-eb09-4f4c-b8a5-76f27571e1a3" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/24a21e2e-4974-4b7a-98f7-e3ae49f750fa" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ebe69047-e668-4e14-83b3-df6261a92ca7" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以手绘笔记本风格呈现,标题为“吉伊卡哇带你游:加泰罗尼亚国家艺术博物馆(MNAC)三天两夜不绕路攻略”,副标题为“行程路线与时间安排(中文清晰版)”。整体采用暖黄色调背景,搭配棕色边框和螺旋装订线设计,营造出温馨可爱的旅行手册氛围。内容分为三个主要垂直区块,分别对应DAY 1、DAY 2、DAY 3,每个区块顶部有圆形时钟图标和“DAY X”标签,结构清晰。<br><br>每个日期区块内均以时间轴形式列出具体行程,使用圆点连接时间点与活动描述,右侧配有吉伊卡哇系列的可爱卡通形象插画(如白熊、蓝猫、兔子等),增强趣味性。所有文字均为简体中文,字体清晰易读,视觉层次分明。<br><br>---<br><br>**DAY 1:抵达与初探**<br>- **10:00** 抵达巴塞罗那,酒店办理入住 (Poble Sec区) —— 配有白熊拖着行李箱的插画。<br>- **12:00** 午餐:西班牙Tapas —— 插画未显示。<br>- **14:00** 前往西班牙广场 (Plaza de España),远眺MNAC全景 —— 配有西班牙广场建筑插画及地图箭头。<br>- **16:00** 参观MNAC外部建筑与周围花园 —— 配有蓝猫在花丛中跳跃的插画。<br>- **19:00** 欣赏魔幻喷泉表演 (Magic Fountain) —— 配有带闪光效果的白熊插画。<br>- **20:30** 晚餐:附近餐厅 —— 插画未显示。<br><br>---<br><br>**DAY 2:MNAC深度艺术之旅**<br>- **09:30** 早餐,步行至MNAC入口 —— 配有白熊吃面包的插画。<br>- **10:00** 进入MNAC (建议提前购票),参观罗马式艺术馆藏 —— 配有古典油画插画。<br>- **12:30** 馆内简餐或附近午休 —— 插画未显示。<br>- **14:00** 参观哥特式、文艺复兴及巴洛克艺术馆藏 —— 配有蒙娜丽莎风格肖像画插画及蓝猫形象。<br>- **16:30** 探索现代艺术馆藏 (加泰罗尼亚现代主义) —— 配有抽象艺术风格插画。<br>- **18:30** 前往MNAC屋顶观景台,俯瞰城市日落 —— 配有兔子举手机拍照的插画。<br>- **20:00** 晚餐:Arenas商场附近 —— 插画未显示。<br><br>---<br><br>**DAY 3:蒙特惠奇山周边与返程**<br>- **09:00** 早餐,退房寄存行李 —— 插画未显示。<br>- **10:00** 乘坐缆车前往蒙特惠奇城堡 (Montjuïc Castle) —— 配有缆车插画,内含三只卡通动物。<br>- **12:00** 参观米罗基金会 (Joan Miró Foundation) —— 配有米罗风格抽象雕塑插画。<br>- **13:30** 午餐:奥林匹克港附近海鲜饭 —— 插画未显示。<br>- **15:00** 漫步奥林匹克公园 —— 插画未显示。<br>- **16:30** 提取行李,前往机场/车站返程 —— 配有开心挥手的白熊插画。<br><br>---<br><br>**底部交通贴士栏**<br>配有公交车、地铁、步行鞋图标,文字为:“交通贴士:善用T-casual交通卡,步行探索更佳!”<br><br>---<br><br>整体图表类型为时间序列流程图,通过垂直分栏与水平时间轴结合的方式组织信息。数据编码方式包括时间点(精确到分钟)、地点名称、活动描述及配套插画,所有信息均按逻辑顺序排列,便于用户快速理解并执行三天行程计划。视觉元素丰富,兼具实用性和趣味性,适合旅游攻略类内容传播。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents a comprehensive architectural and structural analysis of the Temple of Kom Ombo, an ancient Egyptian temple located on the west bank of the Nile River. The title "TEMPLE OF KOM OMBO" is prominently displayed in a hand-drawn, white-bordered box in the lower-right corner of the image, set against a brown background that mimics sandstone or earth tones. The overall layout is divided into multiple sections: a central photographic image of the temple ruins under a clear blue sky, surrounded by illustrative technical diagrams, annotated floor plans, and textual data blocks, all rendered in white line art and text for high contrast.<br><br>The central photograph shows the main hypostyle hall and surrounding structures of the temple, with visitors walking among the columns and courtyards, providing a sense of scale. In the background, the Nile River and palm trees are visible, situating the temple in its natural environment. The ruins are constructed from light-colored sandstone blocks, consistent with the material noted in the text.<br><br>In the upper-left quadrant, a 3D axonometric diagram illustrates the overall dimensions of the temple complex: approximately 62 meters by 51 meters, labeled along the axes. Adjacent to this, a list of key structural facts is presented in bullet points:<br>- TEMPLE AXIS: DOUBLE SANCTUARY FOR SOBEK & HORUS<br>- OVERALL DIMENSIONS (APPROX. 62M x 51M)<br>- CONSTRUCTION MATERIAL: SANDSTONE BLOCKS<br>- COLUMN HEIGHTS: UP TO 12 METERS<br><br>Above the central photo, two schematic diagrams illustrate architectural details:<br>- A top-down view of the hypostyle hall showing 30 columns arranged in a grid, labeled “HYPOSTYLE HALL (30 COLUMNS)” and pointing to “TWO SANCTUARIES.”<br>- A cross-section labeled “PYLON AND HYPOSTYLE SECTION,” which includes a detailed vertical cutaway showing the roofing system supported by columns, with arrows indicating load paths down to foundations.<br><br>To the right of the central image, text notes “TWO ENTRANCES SYMBOLIZING DUALITY,” emphasizing the temple’s unique dual dedication. This concept is reinforced in the lower section of the infographic, where a detailed floor plan is overlaid on the brown ground area.<br><br>The floor plan, drawn in white lines, is annotated with various features:<br>- INNER TEMPLE (FOR SOBEK) — marked with a rectangular inner sanctum.<br>- INNER TEMPLE (FOR HAROERIS) — another distinct inner sanctum, indicating the dual religious function.<br>- NILOMETER — a structure used to measure the Nile’s water level.<br>- BIRTH HOUSE (MAMMISHI) — a smaller chamber associated with fertility rituals.<br>- MUMMIFIED CROCODILE MUSEUM SITE — indicating a location within the temple complex for sacred crocodile mummies.<br>- TWO ENTRANCES SYMBOLIZING DUALITY — shown as two separate entryways on the plan.<br><br>Surrounding the floor plan are inset images of relief carvings, each labeled:<br>- MEDICAL INSTRUMENT RELIEFS — depicting figures with tools.<br>- TWO ENTRANCES RELIEFS — showing doorways flanked by deities.<br>- CALENDAR RELIEFS — illustrating scenes related to timekeeping or agricultural cycles.<br><br>Additional annotations point to structural aspects:<br>- “STRUCTURAL LOAD PATHS FROM COLUMNS TO FOUNDATIONS” — illustrated with curved arrows tracing the force transfer from columns through the walls to the ground.<br>- The pylon and hypostyle section diagram also labels “ROOFING SYSTEM” and shows how the roof beams rest on column capitals.<br><br>All textual content is in English, using a clean, sans-serif font that enhances readability. The visual style blends real photography with technical illustrations and hand-drawn elements, creating an educational and engaging format suitable for tourists, students, or archaeologists. The infographic effectively communicates both the physical characteristics and symbolic significance of the Temple of Kom Ombo, highlighting its duality, engineering, and cultural importance.</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/7fef7b78-f696-40cc-bdb6-3f712103f516" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/a005796d-34be-45ba-85ad-b591e29bb664" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/f7762af7-6989-4362-8d37-1c5e5e84d061" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ee00a7df-3158-4331-a14c-bf53a1bc8b85" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“面试成功技巧图解”,采用左右分栏对比结构,左侧为“成功”部分(白色背景,红色主题色),右侧为“失败”部分(灰色背景,黑色主题色),通过鲜明的视觉对比突出两种截然不同的面试表现。整体布局清晰,分为三个主要横向模块,每个模块对应一个核心维度,并配有图标、标题和要点列表。<br><br>标题“面试成功技巧图解”位于顶部中央,使用加粗黑体字,字号较大,居中对齐,作为整个图表的总纲。<br><br>左侧“成功”栏目上方有红色标题“成功”,下方副标题为“积极、准备、专业”,用黑色字体书写,强调成功者的特质。右侧“失败”栏目上方有黑色标题“失败”,下方副标题为“被动、随意、消极”,同样用黑色字体,形成对立。<br><br>第一模块:准备<br>- 成功侧:<br> - 图标:红色文件夹叠加放大镜,象征研究与准备。<br> - 标题:“充分准备”<br> - 要点列表:<br> • 深入研究公司<br> • 模拟问答练习<br> • 携带作品集<br>- 失败侧:<br> - 图标:黑色文件夹内含问号,象征无知或未准备。<br> - 标题:“准备不足”<br> - 要点列表:<br> • 对公司一无所知<br> • 回答模糊不清<br> • 两手空空赴约<br><br>第二模块:沟通<br>- 成功侧:<br> - 图标:红色对话气泡(带声波)连接两人,中间有眼睛符号,象征倾听与交流。<br> - 标题:“有效沟通”<br> - 要点列表:<br> • 清晰表达观点<br> • 积极倾听反馈<br> • 自信肢体语言<br>- 失败侧:<br> - 图标:黑色对话气泡内含红色叉号,下方两人表情沮丧,象征沟通中断。<br> - 标题:“沟通障碍”<br> - 要点列表:<br> • 表达混乱无序<br> • 打断对方说话<br> • 眼神闪躲回避<br><br>第三模块:态度<br>- 成功侧:<br> - 图标:红色笑脸配向上箭头,象征积极向上的态度。<br> - 标题:“积极态度”<br> - 要点列表:<br> • 展现热情兴趣<br> • 提出深度问题<br> • 表达真诚感谢<br>- 失败侧:<br> - 图标:黑色哭脸配向下箭头,象征消极情绪。<br> - 标题:“消极态度”<br> - 要点列表:<br> • 缺乏热情活力<br> • 只关注薪资待遇<br> • 抱怨前雇主<br><br>所有模块之间由细线分隔,成功侧各模块间以红色横线分隔,失败侧以黑色横线分隔。左右两栏由一条粗红黑双色竖线分隔,强化对比效果。整体设计简洁明了,通过颜色、图标和文字三重编码,直观传达面试成功与失败的关键差异。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以“相对论时间膨胀图解”为标题,采用几何抽象风格设计,整体布局清晰,色彩鲜明,主要使用红、黄、蓝、黑、白五种颜色构成蒙德里安风格的几何图形背景。图中通过视觉化元素和文字说明,系统解释了狭义相对论中的时间膨胀效应。<br><br>**标题部分**<br>位于图像顶部中央,黑色粗体字“相对论时间膨胀图解”,字体醒目,占据显著位置,明确点出主题。<br><br>**主体图示部分**<br>- **左侧圆形区域代表“地球”**:蓝色圆圈内嵌红色方块,方块中有一个黄色边框的白色时钟,指针指向约12:05(或12:06),时钟下方标注“速度慢,时间快”。此区域象征静止或低速参考系。<br>- **右侧三角形区域代表“太空飞船”**:黄色三角形内嵌蓝色圆圈,圆圈中有一个蓝色边框的白色时钟,指针指向约12:10,时钟下方标注“速度快,时间慢”。此区域象征高速运动参考系。<br>- **连接箭头与标签**:一条粗大的黄色箭头从地球指向太空飞船,表示运动方向;箭头中部标有“时间膨胀效应”,并配有一个向上弯曲的黄色弧形箭头,强调时间流逝差异。<br>- **背景装饰**:由多种几何形状(矩形、三角形、半圆、四分之一圆)拼接而成,形成类似蒙德里安艺术风格的抽象图案,增强视觉吸引力,但不干扰核心信息传递。<br><br>**底部三栏文本说明部分**<br>位于图像下半部分,分为三个垂直栏目,分别用不同颜色标题区分:<br><br>1. **基本概念**(红色标题):<br> - 文本内容:“运动物体的速度越快,其自身的时间流逝相对于静止观察者就越慢。这是狭义相对论的推论。”<br> - 说明时间膨胀的基本定义及其理论来源。<br><br>2. **核心原理**(黄色标题):<br> - 文本内容:“光速在真空中对于所有观察者都是不变的,导致空间和时间必须改变以维持这一恒定。”<br> - 解释时间膨胀的根本原因——光速不变原理。<br><br>3. **应用实例**(蓝色标题):<br> - 文本内容:“GPS卫星因高速运动和地球引力,需进行时间校正以确保定位准确。粒子加速器中的高能粒子寿命延长。”<br> - 列举两个实际应用场景,体现理论的现实意义。<br><br>**整体结构与设计特点**<br>- 图表类型:概念性示意图 + 文本说明组合。<br>- 数据编码方式:通过时钟指针位置的视觉差异(地球时钟较慢,飞船时钟较快)直观展示时间流逝速率的不同,配合文字标签强化理解。<br>- 视觉元素:大量使用几何图形、对比色块、箭头和图标,使复杂物理概念易于理解。<br>- 信息层级:从宏观图示到微观文字,由形象到抽象,逻辑递进,层次分明。<br><br>该信息图成功将抽象的物理学概念转化为直观、易懂的视觉语言,兼具科学性和艺术性。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/bd227342-bed9-4c03-85c9-499b2a1dc97e" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/7dbed837-ac18-4d34-9db3-30e69ce37431" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/08be1f31-3328-4883-88c3-804a3a47ae93" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/84f176e5-4fa5-4514-9f74-da123d04ed73" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以黑板风格设计,标题为“地方特色&活动微信公众号推广全指南”,整体采用手绘粉笔字效果,配以彩色图标和箭头,视觉上模拟真实黑板书写场景。内容结构清晰,分为三个主要部分,通过灰色弧形箭头连接,形成逻辑递进关系:从推广内容核心方向 → 高转化活动推广玩法 → 微信公众号生态适配推广技巧。<br><br>第一部分:“推广内容核心方向:深挖本地特色记忆点”,强调通过三类高流量本地内容吸引用户共鸣并吸引外地游客打卡:<br>- **本土美食**(黄色椭圆标签):包含老字号小吃、季节性特色食俗、社区隐藏小店探店内容,配有热汤碗与筷子图标。<br>- **人文风物**(棕色椭圆标签):涵盖非遗技艺传承故事、老街老巷历史、本地名人旧居探访内容,配有传统建筑与布鞋图标。<br>- **便民福利**(粉色椭圆标签):包括本地专属消费券、景区免票政策、节庆活动预告等内容,配有优惠券与礼盒图标。<br><br>第二部分:“高转化活动推广3种实用玩法”,旨在拉满参与转化率:<br>- **节庆市集玩法**(橙色椭圆标签):公众号预热发早鸟票+留言抽免费参与名额+现场打卡返现,配有灯笼与摊位图标。<br>- **非遗体验玩法**(绿色椭圆标签):开放公众号专属报名通道+提前发布体验官预告内容+活动后用户投稿返现,配有陶艺与织布机图标。<br>- **消费促进玩法**(紫色椭圆标签):联合本地商家推出公众号专属消费券包+到店核销送定制周边,配有购物袋与银行卡图标。<br><br>第三部分:“微信公众号生态适配推广技巧”,聚焦降低推广成本:<br>- **内容呈现技巧**(蓝色椭圆标签):封面图用本地标志性建筑/美食做视觉符号,首图放置活动倒计时海报,文末加一键报名跳转链接,配有手机图标。<br>- **渠道联动技巧**(黄色椭圆标签):视频号发布活动花絮挂载公众号链接,朋友圈广告定向推送给本地18-60岁人群,本地社群转发带专属抽奖码,配有三人社交网络图标。<br>- **私域留存技巧**(绿色椭圆标签):活动参与者引导添加企业微信,拉入本地福利群后续持续推送活动信息,配有微信对话气泡图标。<br><br>整个信息图布局呈垂直流线型,各模块之间以曲线箭头连接,右侧点缀有简笔小人和感叹号等装饰元素,增强趣味性和可读性。文字排版层次分明,主标题白色粗体,副标题与核心概念使用黄色或彩色突出,细节说明则为白色常规字体。所有文本均为中文,无英文或其他语言内容。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为《儿童营养补充全指南:科学建议+产品选购要点》,采用漫画风格设计,色彩鲜明,以红、黄、蓝为主色调,布局清晰分为左右两大板块,每个板块又细分为多个模块,图文并茂地呈现了儿童营养补充的科学指导与实用建议。<br><br>整体结构分为“科学参考指引”和“实操应用指南”两大核心部分,通过卡通插图、图标、爆炸式对话框、标签等视觉元素增强可读性与吸引力。<br><br>---<br><br>**第一部分:科学参考指引**<br><br>1. **分龄营养补充重点清单**<br> - 标题:“分龄营养补充重点清单”,副标题:“分龄补营养,精准更高效;对应年龄段按需补充,避免过度摄入”<br> - 内容按年龄分三个阶段:<br> - **0-6月龄**:每日常规补充维生素D 400IU,纯母乳喂养宝宝需额外补充维生素K。配图:婴儿头像、Vit D注射器、Vit K胶囊。<br> - **7月龄-3岁**:重点补充铁(Fe)、锌(Zn)、DHA,每日维生素D补充量维持在400-600IU。配图:幼儿头像、放大镜观察胶囊、Fe和Zn符号。<br> - **4-12岁**:重点补充钙(Ca)、维生素A、B族维生素(B_B),保证每日蛋白质摄入量达标。配图:男孩头像、Ca气泡、B_B气泡、鸡蛋、牛奶瓶、眼睛图标。<br><br>2. **营养补充原则&常见避坑指南**<br> - 标题:“营养补充原则&常见避坑指南”,副标题:“科学补营养,这些坑要避开”<br> - 包含两个核心原则:<br> - **优先膳食摄入**(绿色对勾):核心原则1:日常均衡饮食是营养摄入的首要来源,不可用补充剂代替正常三餐。配图:孩子用餐场景,盘中有蔬菜、水果、肉类。<br> - **按需适量补充**(红色STOP标志):核心原则2:营养素补充并非越多越好,过量摄入维生素A、钙等可能引发中毒或代谢负担。配图:多瓶补剂被红色叉号覆盖。<br> - **避坑指南**(黄色标签):<br> - ① 不做体检评估盲目跟风补 ❌<br> - ② 把网红补剂当零食给孩子吃 ❌<br> - ③ 用成人补充剂减量给儿童服用 ❌<br> - 配图:红色“避坑”爆炸框,带有闪电效果。<br><br>---<br><br>**第二部分:实操应用指南**<br><br>1. **儿童营养补充产品3步选购法**<br> - 标题:“儿童营养补充产品3步选购法”,副标题:“儿童补剂选购3步判断法”<br> - 三步法分别由放大镜图标引导:<br> - **看合规标识**:优先选择带蓝帽标识的保健食品,或有婴幼儿/儿童专用备案标识的正规产品,拒绝三无产品。配图:放大镜聚焦“蓝帽”标志。<br> - **看配料成分**:优先选择无额外添加蔗糖、香精、人工色素、防腐剂的产品,致敏原标注清晰明确。配图:文件上贴有“无添加”印章,绿色对勾。<br> - **看适配年龄**:选择标注对应适用年龄段的儿童专用产品,不要自行将成人补充剂减量给孩子服用。配图:药瓶标签上“年龄”被红圈突出。<br><br>2. **常见儿童补剂适用场景对照表**<br> - 标题:“常见儿童补剂适用场景对照表”<br> - 表格形式,两列:左侧“补剂类型”,右侧“适用场景”,背景色交替为红、蓝。<br> - 具体内容:<br> - **维生素D滴剂** → 全年龄段儿童日常常规补充,预防佝偻病、促进钙吸收。配图:滴管瓶、骨头图标。<br> - **铁剂** → 体检确诊缺铁性贫血,或日常红肉、动物肝脏摄入不足的儿童。配图:滴管瓶、儿童头像。<br> - **DHA藻油** → 日常深海鱼摄入不足的儿童,辅助促进视网膜和大脑发育。配图:鱼形胶囊、大脑与眼睛图标。<br> - **钙剂** → 日常奶量不足、身高增长偏缓,经体检确认缺钙的儿童。配图:白色药片、儿童测量身高图。<br><br>---<br><br>**视觉与排版特征:**<br>- 整体采用网格化布局,四个主要模块分布在2x2的象限中。<br>- 使用大量漫画元素:如爆炸框、对话气泡、箭头、感叹号、禁止符号等。<br>- 图标系统丰富:Vit D、Fe、Zn、Ca、B_B、蓝帽、无添加、年龄、STOP等均有专属图形标识。<br>- 字体加粗、阴影、边框强调关键信息,如标题、数字、警示语。<br>- 色彩编码明确:黄色用于提示重点,蓝色用于说明步骤,红色用于警示或禁止。<br><br>该信息图内容全面,逻辑清晰,兼具科学性和实用性,适合家长快速掌握儿童营养补充的核心知识与选购技巧。</div></details></td>
</tr>
</table>
### Paper Rendering Quality
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/cf47774c-9724-4140-98d6-32c1eee1b6b7" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/bbe55788-6e19-46f6-8f3b-b0098e89d210" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/90c231fb-fef5-4559-b0a7-b994bcc65386" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/b95f8a67-d7ef-40a1-be13-51890e761c4c" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is laid out with two tables at the top, followed by a two-column text layout. The tables span the full width of the text area. The text includes a section heading.<br><br>[paragraphs]<br><br>the TOPIC MODELER, the GENDER SEGMENTER, and an OTHER module (transcript length and duration). We test for a linear relationship between each pair of variables: $H_O : r = 0$, $H_A : r \neq 0$, where $H_O$ is the origi-nal hypothesis, $H_A$ is the alternate hypothesis, and $r$ is the Pearson’s correlation coefficient. We follow Reddy et al. (2021) and Yang et al. (2019) and apply a Bonferroni cor-rection to our $\alpha$ value of $0.05$, setting $\alpha = 0.05/z$, where $z = \binom{124}{2} = 7,626$ for LDA, representing the number of feature relationships we consider. Hence, we reject $H_O$ in favor of $H_A$ if $p \leq \alpha$. Given the largeness of $z$, our $\alpha$ value becomes small, making our criteria for significance strict and thus suitable for investigating our research ques-tions. Furthermore, we filter our correlations $r$, such that $\Vert r\Vert > 0.1$ for our LDA experiments, and $\Vert r\Vert > 0.05$ for our BERTopic experiments (due to the smaller sample size of 10,000 podcasts, and fewer samples may have higher vari-ance). Our results focus on a selection of these significant correlations; the full results are available on the project web-site: https://www.gendered-discourse.net/extended-results.<br><br>### RQ0: How Are Women and Men’s Discourse Different?<br><br>Using GDCF, our Gendered Discourse Correlation Frame-work shown in Figure 2, we then analyze significant corre-lations between between the gender features from the GEN-DER SEGMENTER module (Doukhan et al. 2018a), and the topic features from the TOPIC MODELER module (Blei, Ng, and Jordan 2003). We use the discourse topics to automati-cally form gendered discourse word lists via their significant correlations.<br><br>Starting with the first row of Table 1, we see that Topic 3’s word list returned by LDA with Non-Contextual Embed-dings (Bag-Of-Words) (via the TOPIC MODELER module) contains the words women, woman, men, baby, pregnant, girls, men, doctor, health, birth (in descending weighted or-der). Based on this word list, we manually interpret this topic as being a content topic, specifically about pregnancy, as noted in the column “Topic N Categories.” Then, we look to the gender correlations in the columns “Gender” and “$r$,” and see that $r(\text{Topic 3, Women}) = +0.15$ and $r(\text{Topic 3, Men}) = -0.14$. This indicates that the topic of pregnancy positively correlates with women (identified via the GENDER SEGMENTER module), and negatively corre-lates with men. Therefore, we associate Topic 3 (Content - Pregnancy) with Women, as noted in the “Topic N Gender” column. Similarly, we make these associations in the “Topic N Gender” column for Topics 10, 49, and 71.<br><br>Next, we focus on the Topic 54 row. This topic is inter-preted using the word list get, like, know, right, people, go-ing, podcast, make, want, one. This word list does not refer to any content, hence, we manually interpret this topic as being a discourse topic. Moving to the gender correlations, we see that $r(\text{Topic 54, Women}) = \emptyset$ and $r(\text{Topic 3, Men}) = +0.12$. The reason for $r(\text{Topic 54, Women}) = \emptyset$ is because the correlation between the features Topic 54 and Women did not come back as significant. However, due to the positive correlation of $0.12$ for Topic 3 and Men, we manually as-sociate Topic 3 with Men in the “Topic N Gender” column.<br><br>[tables]<br><br>Table 1: LDA with Non-Contextual Embeddings (Bag-Of-Words): The complete set of significant correlations between gender features and topic features – both content topics and discourse topics. Based on $r$, the Topic N Gender forms the gendered (discourse) word lists via Topics 54 and 60 (the masculine word lists) and Topic 62 (the feminine word list).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 3 | Women <br> Men | 0.15 <br> -0.14 | women, woman, men, baby, pregnant, girls, men, doctor, health, birth | Content - Pregnancy | Women |<br>| Topic 10 | Women <br> Men | 0.10 <br> -0.12 | energy, body, feel, mind, space, yoga, love, beautiful, feeling, meditation | Content - Yoga | Women |<br>| Topic 49 | Women <br> Men | -0.21 <br> 0.17 | game, know, think, team, going, mean, play, year, one, good | Content - Sports | Men |<br>| Topic 71 | Women <br> Men | 0.14 <br> -0.14 | christmas, sex, girl, hair, love, get, date, girls, let, wear | Content - Dating | Women |<br>| Topic 54 | Women <br> Men | – <br> 0.12 | get, like, know, right, people, going, podcast, make, want, one | Discourse | Men |<br>| Topic 60 | Women <br> Men | -0.27 <br> 0.20 | going, know, think, get, got, one, really, good, well, yeah | Discourse | Men |<br>| Topic 62 | Women <br> Men | 0.33 <br> -0.28 | like, know, really, going, people, want, think, get, things, life | Discourse | Women |<br><br>Table 2: BERTopic with Contextual Embeddings (BERT, ChatGPT, Llama): The complete set of significant correlations between gender features and topic features for discourse topics only (content topics are omitted).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 0 | Women <br> Men | -0.08 <br> 0.10 | like, yeah, know, oh, right, podcast, got, going, think, really | Discourse | Men |<br>| Topic 2 | Women <br> Men | 0.08 <br> -0.08 | life, know, things, really, people, feel, like, want, love, going | Discourse | Women |<br>| Topic 5 | Women <br> Men | 0.08 <br> – | like, know, think, yeah, episode, really, going, anchor, kind, right | Discourse | Women |</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout with a single column. The text is justified and divided into sections and subsections, indicated by numbered headings. Important terms at the beginning of some paragraphs are bolded. A horizontal rule separates the header from the main content, and another rule separates the main content from the footnote at the bottom.<br><br>[paragraphs]<br><br>Preprint Version.<br><br>**Figure–Table Integration.** In addition to textual refinement, we extend the refinement process to include multimodal elements, to further enhance readability. For each section, the model first generates visualization requirements, such as tables with structured comparisons or figures with explanatory diagrams, together with natural language descriptions. Based on these descriptions, candidate figures and tables are synthesized. The compiled outputs are then fed back to an LLM for quality assessment, enabling automatic detection of issues such as oversized layouts or unreadable text. The LLM provides corrective suggestions, which are applied to improve the final visualizations. Finally, the text is refined again to ensure that all generated figures and tables are properly referenced within the survey.<br><br># 4 EXPERIMENTS<br><br>## 4.1 EXPERIMENTAL SETTINGS<br><br>**Implementation Details.** Following Wang et al. (2024b), we adopt **GPT-4o-mini** as our genera-tion model for its balance of responsiveness and cost. Our retrieval database contains 680K computer science papers from arXiv, with PDFs converted into structured Markdown using MinerU (Wang et al., 2024a) for consistent formatting. The details of the retrieval process are provided in App. A.1. In outline generation, the system consults 1000–1200 papers, with a maximum of 8 sections. For section drafting, each subsection retrieves up to 60 additional relevant papers, combined with those linked during outline generation. Finally, we apply two iterations of the review-and-refine loop to enhance coherence across sections and improve overall readability. Illustrative outputs compared with AutoSurvey are provided in App. A.8.<br><br>**Baselines.** We compare IterSurvey with a set of baselines, ranging from simple retrieval-augmented generation (Naive RAG), which directly drafts from retrieved documents, to more ad-vanced state-of-the-art systems. Specifically, we evaluate against AutoSurvey (Wang et al., 2024b), the first systematic framework for this task; SurveyForge (Yan et al., 2025), which combines heuris-tic outline generation based on the logical structures of human-written surveys with a memory-driven scholar navigation agent for high-quality retrieval; and SurveyGo (Wang et al., 2025), which em-ploys the LLM×MapReduce-V2 algorithm to address the long-context challenge. We also compare with SurveyX (Liang et al., 2025), which introduces an Attribute Tree-based outlining mechanism; however, due to access restrictions, we include SurveyX only in arena experiments. All methods are evaluated on the same retrieval database with generation hyperparameters aligned to their original settings for fairness.<br><br>## 4.2 AUTOMATIC EVALUATION RESULTS<br><br>**Evaluation Setup.** We employ multiple complementary protocols to evaluate the quality of gen-erated surveys. On the 20-topic suite from Wang et al. (2024b), we adopt multi-dimensional scoring with LLM-as-a-judge. Content quality is assessed along three dimensions: coverage, structure, and relevance followed from Wang et al. (2024b). Besides, citation quality is evaluated using the NLI-based protocol of Gao et al. (2023), reporting both recall and precision: _Citation Recall_ measures whether all statements in the generated text are fully supported by the cited passages, while _Citation Precision_ identifies irrelevant citations to ensure that references are pertinent and directly support the claims. To improve scoring stability and reliability, prompts are standardized and judges must pro-vide a rationale before assigning scores. For additional robustness, we aggregate outputs from three judge models: GPT-4o, Claude-3.5-Haiku, and GLM-4.5V.1 Full prompts are provided in App. A.7.<br><br>**Results.** The results on the 20 topics from Wang et al. (2024b) are reported in Tab. 1. Statistical significance was confirmed via paired t-tests, indicating that IterSurvey consistently outperforms baseline models ($p < 0.05$). We summarize the main observations below.<br><br>- **Overall superiority.** IterSurvey consistently outperforms all baselines across both content and citation quality, achieving the highest overall average score (4.75). This demonstrates that the proposed framework is effective and robust across multiple evaluation dimensions.<br><br>[page_number]<br><br>6<br><br>[footnotes]<br><br>1Specifically, we use `chatgpt-4o-latest`, `claude-3-5-haiku-20241022`, and `glm-4.5v`.</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/da32110a-c025-42cc-9483-3416b406d23e" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/cd54149f-bf28-432c-83c4-0ea461aadab8" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/9de005fa-fb69-40c2-a1b3-6569e2e6a5c9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/acb85d5a-1a20-4c71-a2fc-07760bab0c27" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>This is a single-column page containing mostly text, structured with section headings and bold inline subheadings. URLs are formatted in a monospaced font and hyperlinked.<br><br>[paragraphs]<br><br># A Image generation models<br><br>This section details the two diffusion image generation models used in this work, namely Stable Diffusion 1.4 and 1.5.<br><br>**Stable Diffusion 1.4** The Stable Diffusion model is a text-conditioned image generator model that combines an autoencoder with a diffusion model to create a latent diffusion model. The autoencoder encodes images into latent representations with a reduced dimensionality when compared to the input image, reducing the computational needs during the training phase. Text prompts, on the other hand, are encoded using a text encoder and are then cross-attended by the UNet backbone of the latent diffusion model. Finally, the loss is computed using a reconstruction objective between the noise added to the latent representation and the prediction made by the UNet.<br>Stable Diffusion 1.4 (https://huggingface.co/CompVis/stable-diffusion-v1-4) had several rounds of training on the LAION dataset (https://laion.ai/), with each round changing the input image dimension, aesthetic score, and the probability of dropping the text-conditioning to improve classifier-free guidance.<br><br>**Stable Diffusion 1.5** SD 1.5, in turn, has the same architecture and even the same starting point as 1.4, with the difference being how long the model was fine-tuned on top of SD 1.2. The 1.4 version is fine-tuned for 225 thousand steps at resolution 512x512 on “laion-aesthetics v2 5+” with a 10% probability of dropping the text-conditioning, and version 1.5 for 595 thousand steps.<br>As demonstrated in Section D Stable Diffusion 1.4 has better performance than 1.5 in our approach, therefore, we will adopt SD 1.4 for most of the experiments in this paper.<br><br># B Large language models<br><br>Here we give additional details on the large language models that we used in our experiments.<br><br>**Gemma** (Mesnard et al., 2024), trained on a diverse 6 Trillion token dataset comprising web documents, code and mathematical texts. We resorted to the 7 Billion parameter instruction-tuned decoder-only model, named _gemma-7b-it_ (https://huggingface.co/google/gemma-7b-it). This model uses a chat template, which we employ during inference.<br><br>**Llama 2** (Touvron et al., 2023), of which we used the 7 Billion parameter, pre-trained-only model, _Llama-2-7b_ (https://huggingface.co/meta-llama/Llama-2-7b-hf). This model was trained with a mix of publicly available data totalling 2 Trillion tokens. While its chat versions employ supervised fine-tuning and reinforcement learning with human feedback for alignment with human preferences in helpfulness and safety, the pre-trained-only model does not. This results in a less constrained model, but it may also cause it to disperse from the task at hand. Since this model is a pre-trained-only no chat template is needed.<br><br>**Mistral** (Jiang et al., 2023) fine-tuned on various HuggingFace instruction datasets. We resorted to the 7 Billion _Mistral-7B-Instruct-v0.2_ model (https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.2) and used the respective chat template during inference.<br><br>**Phi-2** (Gunasekar et al., 2023) is a compact 2.7 Billion model (https://huggingface.co/microsoft/ phi-2). Despite its size, it offers a competitive performance with respect to models several times its size. It was trained on 250 Billion tokens, obtained through a combination of NLP synthetic data created by GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by GPT-4. This model was not fine-tuned through reinforcement learning from human feedback and does not have guardrails.<br><br>**Model ranking**<br>A ranking of these models in terms of their performance can be found in the HuggingFace leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which assesses several LLMs that are trained under the same criteria and tested on the same benchmarks, including reasoning</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout, likely from a preprint server like arXiv. It features a title, author list with affiliations, an abstract, and the beginning of the "Introduction" section. A preprint notification ("Preprint. Under review.") is present at the bottom left. The text on the left margin ("arXiv:2502.01522v2 [cs.CV] 30 May 2025") is a vertical stamp typical of arXiv submissions.<br><br>[paragraphs]<br><br>arXiv:2502.01522v2 [cs.CV] 30 May 2025<br><br># Unpaired Deblurring via Decoupled Diffusion Model<br><br>**Junhao Cheng**$^1$, **Wei-Ting Chen**$^2$, **Xi Lu**$^1$, **Ming-Hsuan Yang**$^3$<br>$^1$Sun Yat-sen University $^2$ Microsoft $^3$ University of California, Merced<br>https://github.com/donahowe/UID-Diff<br><br>**Abstract**<br><br>Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose UID-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural features and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. We further introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of UID-Diff when encountering unknown blur patterns. Experiments on real-world datasets demonstrate that UID-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios.<br><br># 1 Introduction<br><br>Dynamic blur occurs when the camera and subject move relative to each other during the exposure time, resulting in a smeared and blurred image. Deblurring, the process of removing the blur pattern while preserving the underlying structure of degraded images, is essential for restoring high-quality images for human perception and low-level computer vision applications.<br><br>With the rapid advancement of photographic technology, a wide range of imaging devices are now employed to capture images in real-world scenarios. Due to their diverse lenses and structural designs, these devices may produce distinct blur patterns [1, 2, 3]. This diversity makes it challenging to develop an all-in-one method for deblurring images from arbitrary and varied sources. Consequently, focusing on deblurring algorithms tailored to specific domains has become increasingly significant.<br><br>As deep learning has advanced in recent years, existing deblurring models predominantly build on data-driven approaches that employ neural networks trained via supervised learning on synthetic paired data. Existing works have made efforts to develop deblurring models upon CNN [4, 5], Transformer [6, 7], and GAN [8, 9]. Recently, a new wave of research [10, 11, 12] has begun to investigate the integration of pre-trained generative diffusion models [13], such as Stable Diffusion (SD) [14], with an adapter designed to provide structural guidance for deblurring. These approaches aim to harness the generative capabilities of diffusion models to supplement missing details and generate aesthetically pleasing outputs. However, since paired blurry-sharp training data is limited in<br><br>[footnotes]<br><br>Preprint. Under review.</div></details></td>
</tr>
</table>
### Overall Layout and Content Understanding
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/6b7aaab9-0d28-487f-a7fc-a7969e311bb9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/8324436d-40e4-4ef1-8a87-9d0502305f84" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/cd510868-b859-4d7c-964e-bda7a9b1d24e" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/aeb08dff-0938-4b88-b895-14dc2fa7dad1" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以“曲尼司特”为标题,整体采用浅蓝白色调,布局清晰,分为多个模块化区域,围绕中央的透明胶囊图像展开。右上角展示曲尼司特的化学结构式及其分子式 C₁₈H₁₇NO₄。<br><br>**1. 活性成分数据(左上)**<br>- 以环形图形式展示成分比例:<br> - 曲尼司特 >98%<br> - 辅料 <2%<br>- 图下方标注:“纯度高,临床级标准”<br><br>**2. 适应症(右上)**<br>- 通过三个图标及文字说明:<br> - 鼻部图标:过敏性疾病<br> - 皮肤纹理图标:纤维化<br> - 疤痕图标:瘢痕疙瘩<br><br>**3. 剂量矩阵(中左)**<br>- 表格形式,包含两列:“口服”和“频率”<br> - 成人:100mg / 次;频率:1-3 次 / 天<br> - 儿童:咨询医生;频率:遵医嘱<br><br>**4. 药代动力学时间轴(中右)**<br>- 折线图,横轴为时间(0h 至 24h),纵轴为浓度(无刻度)<br> - 0h:吸收开始(水滴图标)<br> - 1-2h:峰值浓度(山峰图标)<br> - 4-6h:分布/代谢(循环箭头图标)<br> - 24h:排泄(垃圾桶图标)<br>- 图中标注半衰期 ≈ 5-8h<br><br>**5. 警告网格(左下)**<br>- 分为四个象限,每个配有图标和文字:<br> - 相互作用:CYP酶抑制剂/诱导剂(齿轮图标)<br> - 副作用:胃肠道不适,皮疹(胃图标)<br> - 肝功能:定期监测(肝脏图标)<br> - 肾功能:慎用(肾脏图标)<br><br>**6. 患者适用性(中下)**<br>- 两个图标组合:<br> - 成人:人物图标 + 对勾,标注“成人 适用”<br> - 儿童:人物图标 + 问号 + 医生图标,标注“儿童 咨询医生”<br><br>**7. 储存指南(右下)**<br>- 三个图标并列:<br> - 温度计图标:2-25℃ 室温<br> - 密封瓶图标:密闭<br> - 遮光图标(太阳加斜线):避光<br><br>整体设计风格现代、专业,使用大量图标辅助理解,数据可视化清晰,适合医疗或药品宣传场景。所有文本均为中文,语言准确,无冗余描述。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents an augmented reality (AR) shopping experience overlaid on a real-world retail environment. The scene is set in a brightly lit cosmetics aisle of a store, with shelves stocked with beauty products visible in the background. In the foreground, a pair of hands holds a black rectangular compact labeled "ANASTASIA BEVERLY HILLS BROW POWDER DUO" with "EBONY" and "NET WT. 2.5 OZ." printed below. A gold ring is visible on the left hand’s ring finger, and a black wristband is partially seen on the left wrist.<br><br>Overlaid on the image are several semi-transparent, rounded-corner UI elements resembling AR pop-ups or digital cards, providing contextual information about the product and the user’s shopping list.<br><br>On the left side, a vertical panel titled "SHOPPING LIST" lists four items:<br>1. Face Wash — marked with an “X” (completed)<br>2. Shampoo — marked with an “X” (completed)<br>2. Eye Cream — marked with an empty checkbox (not completed; duplicated item number)<br>3. Eye Cream — marked with an empty checkbox (not completed)<br><br>This suggests a possible error or duplication in the list, with two entries for "Eye Cream".<br><br>In the center-right, a speech-bubble-shaped label displays the price: "$23.00".<br><br>To the right of the product, a larger panel titled "PRODUCT DETAILS:" provides information about the "ABH Brow Powder Duo". It features two color swatches:<br>- Left swatch: labeled "DEEP BROWN"<br>- Right swatch: labeled "BLACK"<br><br>Below the swatches, a star rating system shows four and a half filled stars, accompanied by the text "4.5 out of 5 stars".<br><br>Underneath the rating, a section titled "COMMON USES:" states: "DEFINES & FILLS BROWS".<br><br>Further down, a smaller rectangular box labeled "KEY INGREDIENTS" lists:<br>- Vitamin E<br>- Finely Milled Pigments<br><br>At the bottom right, another box titled "APPLICATION TIPS" includes a video icon (a rectangle with a play triangle) and the word "Video", indicating a multimedia tutorial is available.<br><br>The overall layout mimics an immersive AR interface, likely from a smart glasses or smartphone application, designed to enhance in-store shopping by providing instant, interactive product data directly within the user’s field of view. The visual style uses dark gray, translucent backgrounds with white text for high contrast and readability against the busy store backdrop. The design emphasizes usability, with clear categorization of information into distinct panels and intuitive icons. All textual content is in English, and no other languages are present.</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/8745f320-e406-47ab-aeb0-55e65c0322d8" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1b3765bb-bd50-42d7-921b-3819be625cd8" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/c4a703ea-56c0-4fa5-a9c4-012180795539" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/53b829cd-3df4-44a4-a518-52cfef916eee" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以深蓝色科技感背景为主,配以紫色和青色的电路板图案边框,营造出未来数字设备的视觉氛围。标题“谷歌最新血氧仪机型参数对比(社媒版)”位于顶部中央,使用发光白色字体,突出主题。整体布局为横向三栏式对比结构,左侧为参数类别标签列,中间及右侧分别为三款智能穿戴设备的参数详情。<br><br>左侧参数类别列以图标+文字形式垂直排列,包括:<br>- 芯片(图标为芯片符号)<br>- 电池(图标为电池符号)<br>- 功能(图标为心电波形符号)<br>- 重量(图标为秤盘符号)<br>- 价格(图标为价格标签符号)<br>- 发售时间(图标为日历符号)<br><br>中间三栏分别对应三款产品:<br>1. **高亮推荐机型:Google Pixel Pulse(最新推荐)** <br> - 标题上方有金色星形徽章“★ 高亮推荐机型”,并用金色边框高亮显示。<br> - 芯片:Tensor G4定制芯片<br> - 电池:7天续航,快充<br> - 功能:连续血氧监测,睡眠/压力追踪,AI健康指导<br> - 重量:28克(轻盈)<br> - 价格:¥1999<br> - 发售时间:2024年10月<br><br>2. **竞品A(例如:Apple Watch S9)** <br> - 芯片:S9 SiP芯片<br> - 电池:18小时(正常使用)<br> - 功能:按需血氧,心电图APP,摔倒检测<br> - 重量:32克<br> - 价格:¥3199<br> - 发售时间:2023年9月<br><br>3. **竞品B(例如:Garmin Venu 3)** <br> - 芯片:Elevated V5传感器<br> - 电池:14天(智能模式)<br> - 功能:全天候血氧,身体电量,GPS运动<br> - 重量:35克<br> - 价格:¥2499<br> - 发售时间:2023年8月<br><br>所有数据均采用清晰的横向分隔线组织,每项参数内容居中对齐,字体为简洁现代的无衬线体,颜色为浅蓝或白色,确保可读性。高亮推荐机型使用金色边框和更明亮的文字,形成视觉焦点。<br><br>底部有一行注释文字:“注:以上参数仅供参考,具体以官方发布为准。#科技 #健康 #谷歌新品 #血氧仪对比”,字体较小,颜色较暗,作为补充说明。<br><br>整体设计风格现代、科技感强,通过色彩对比、边框高亮和图标辅助,有效传达了各机型在关键性能指标上的差异,尤其突出了Google Pixel Pulse在续航、价格和功能集成方面的优势。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以复古手绘风格呈现,整体布局如一本打开的泛黄书页,背景为米黄色仿旧纸张质感,边缘带有不规则撕裂效果。标题“博物馆游览扩展内容与要点”位于顶部中央,字体为深棕色艺术字,两侧饰有卷曲花纹装饰,视觉上突出主题。<br><br>全图采用六点式结构化布局,围绕中心分布六个核心模块,每个模块均配有独立插画、编号标题和说明文字,通过装饰性边框、花环、丝带等元素进行区分与美化。整体设计风格温馨、文艺,融合了音乐符号、星星、薰衣草、云朵等点缀元素,营造出轻松愉悦的文化探索氛围。<br><br>各模块内容如下:<br><br>1. **沉浸式体验** <br> - 标题:“1. 沉浸式体验” <br> - 说明文字:“参与互动展览,感受历史场景还原,身临其境。” <br> - 视觉元素:左侧描绘一位金发男孩手持放大镜观察一个微缩历史街景模型(包含房屋、摊位和人物),上方有齿轮与灯泡组成的思考气泡,象征探索与发现。右侧配有一个系着粉色蝴蝶结的礼物盒,标签写有“SURPRISE”。<br><br>2. **主题讲座与工作坊** <br> - 标题:“2. 主题讲座与工作坊” <br> - 说明文字:“聆听专家深度解读,亲手制作手工艺品,学习新知。” <br> - 视觉元素:右侧展示一张木桌,桌上摆放陶壶、陶罐、刻刀等手工工具,旁边堆叠书籍与卷轴;周围环绕橄榄枝花环,上方悬挂一串风铃(含月亮、星星与铃铛),背景点缀云朵与星光。<br><br>3. **馆藏珍品探索** <br> - 标题:“3. 馆藏珍品探索” <br> - 说明文字:“寻找镇馆之宝,了解背后的故事与文化价值,深度挖掘。” <br> - 视觉元素:左侧是一个打开的木质宝箱,内有青铜鼎状文物与发光卷轴;旁有绿色玉璧吊坠、散落铜钱,以及一支点燃的白色蜡烛,烛台装饰有薰衣草与小花束。<br><br>4. **特色导览路线** <br> - 标题:“4. 特色导览路线”(置于米色丝带横幅中) <br> - 说明文字:“跟随定制路线,发现隐秘角落与独特视角,别样精彩。” <br> - 视觉元素:下方是一张展开的复古地图,标有拱门、凉亭、佛像、雕塑等景点,以红色虚线连接,并配有指南针图标,体现路径规划概念。<br><br>5. **数字化互动** <br> - 标题:“5. 数字化互动”(置于圆形波点边框内) <br> - 说明文字:“利用AR/VR技术,打破时空限制,体验虚拟现实。” <br> - 视觉元素:右侧描绘一位戴VR眼镜的人正在触控空中悬浮的陶罐图像,周围有Wi-Fi信号、数据图表、声波图等科技元素,体现数字交互场景。<br><br>6. **文化衍生品** <br> - 标题:“6. 文化衍生品” <br> - 说明文字:“选购独特纪念品,将博物馆记忆带回家,延续美好。” <br> - 视觉元素:左下角陈列多种文创商品,包括印有博物馆建筑图案的帆布袋(标有“MUSEUM”)、笔记本、明信片、徽章;右下角则是一盘精致三明治(面包上烙有五角星图案),配蓝莓与卷饼,旁有一只戴派对帽、系蓝色蝴蝶结的白鹅,口中喷出音符,充满童趣。<br><br>整张信息图通过图文结合的方式,系统介绍了博物馆参观的六大延伸活动,既传达实用信息,又兼具美学感染力,适合用于宣传册、教育海报或线上推广材料。所有文本均为中文,无英文或数字编码,语言风格亲切自然,符合大众传播需求。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/32f54d2b-c0cb-4cd0-9513-b17c10f30dab" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/3bdbdb5f-538c-4e2f-971d-7118f9409e96" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/0282904c-fe15-4eef-9bd1-5362beae14f4" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ef0e8194-3ab6-4d12-9b11-ae59068cc0a7" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为《国际货运代理业务流程与注意事项全指南》,旨在系统梳理国际货运代理从接单到交付的全流程操作规范,并重点提示各环节的核心注意事项与风险规避要点,帮助货运代理新人及外贸商家快速掌握业务逻辑,减少操作失误与纠纷。<br><br>整体设计采用“冰山模型”视觉隐喻:上方露出水面部分展示显性业务流程(共6个步骤),下方水下部分揭示隐性但关键的风险管理与合规要求(共3大注意事项)。背景为米白色纸张质感,边缘有撕纸效果,中央以蓝色渐变冰山贯穿上下,象征业务表层与深层的关联。顶部标题使用深棕色粗体字,配以浅橙色衬底条,副标题为黑色小号字体,简洁明了。<br><br>---<br><br>### 一、显性业务流程(冰山水上部分)——共6个步骤<br><br>每个步骤以圆角矩形卡片呈现,配有图标和箭头连接,形成线性流程图。卡片背景为浅米色,文字为深棕色,标题加粗。<br><br>**Step 1: 接单揽货**<br>- 标题:确认需求 签订委托<br>- 内容:<br> - 对接客户确认货物明细、运输时效、目的地<br> - 核对禁运/限运品,确认报价<br> - 签订正式委托合同,明确权责<br>- 图标:握手图标<br><br>**Step 2: 订舱配载**<br>- 标题:预定舱位 确认出运计划<br>- 内容:<br> - 选择承运商提交订舱申请<br> - 确认舱位后发放入仓通知<br> - 易碎品/危险品提前告知,申请特殊配载<br>- 图标:轮船+飞机图标<br><br>**Step 3: 报关报检**<br>- 标题:完成通关申报 确保合法出境<br>- 内容:<br> - 整理报关资料提交申报<br> - 法检货物提前报检获取通关单<br> - 反复核对信息避免错报漏报<br>- 图标:文件+印章图标<br><br>**Step 4: 货物运输**<br>- 标题:跟踪动态 同步异常<br>- 内容:<br> - 获取运输凭证发给客户<br> - 全程跟踪轨迹,异常情况第一时间告知<br> - 提醒客户购买运输保险<br>- 图标:卡车+地球图标<br><br>**Step 5: 到港清关**<br>- 标题:办理入境申报 完成货物放行<br>- 内容:<br> - 通知收货人准备清关资料提交申报<br> - 核算税费通知缴纳办理放行<br> - 核实特定品类特殊准入要求<br>- 图标:起重机+文件图标<br><br>**Step 6: 派送签收**<br>- 标题:完成交付 确认闭环<br>- 内容:<br> - 安排本地配送车辆送达指定<br> - 提醒收货人当场核验签收<br> - 货损少货第一时间留存证据启动理赔<br>- 图标:快递车图标<br><br>---<br><br>### 二、隐性注意事项(冰山水下部分)——共3大项<br><br>水下部分分为三层,由浅蓝到深蓝渐变,每层对应一个注意事项,左侧为总述,右侧为具体细则,通过虚线箭头连接。<br><br>**注意事项1:单证管理要求**<br>- 总述:所有单证需真实准确、留存备查<br>- 细则:<br> - 单单一致 单证一致(信息完全匹配)<br> - 提单信息反复核对(确认无误后出具)<br> - 单证留存≥2年(备查避免纠纷)<br><br>**注意事项2:风险防控要点**<br>- 总述:提前识别风险,减少业务损失<br>- 细则:<br> - 核实客户与货物资质(避免涉嫌违规)<br> - 提前做好特殊货物防护(温控、防震等)<br> - 异常情况同步留证(明确权责)<br><br>**注意事项3:合规操作要求**<br>- 总述:严守法规红线,避免合规风险<br>- 细则:<br> - 禁运品严禁承接(需有资质审批)<br> - 如实申报不得逃税(严禁低报虚报)<br> - 遵守制裁合规要求(不接违规业务)<br><br>---<br><br>### 三、视觉元素与数据编码<br><br>- **图表类型**:流程图 + 分层结构图(冰山模型)<br>- **数据编码方式**<br> - 步骤顺序:通过箭头连接表示时间或逻辑顺序。<br> - 注意事项层级:通过垂直分层与虚线箭头体现“主-次”关系。<br> - 信息密度:每个步骤/注意事项均包含标题、核心动作和子项列表,信息高度结构化。<br>- **图标系统**:每个步骤配有代表性的扁平化图标(如握手、轮船、文件、卡车等),增强可读性。<br>- **色彩方案**:主色调为米白、浅棕、蓝色系,营造专业、冷静、可靠的感觉。<br><br>---<br><br>该信息图结构清晰、内容详尽,既展示了国际货运代理业务的完整操作链路,又深入挖掘了隐藏在流程背后的管理、风控与合规维度,是极具实用价值的行业指导工具。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“支持向量机(SVM)扩展内容与要点”,整体采用柔和的紫色渐变背景,搭配手绘卡通风格的白色圆角矩形模块,视觉清新且富有教育性。图中通过四个主要模块和一个底部总结栏,系统介绍了SVM的核心扩展内容,各模块之间以虚线箭头连接,形成逻辑关联。<br><br>标题位于顶部中央,使用加粗黑体字并带有白色描边,突出显示主题。<br><br>**1. 基础概念模块** <br>位于左上角,标题为“基础概念”。图示展示蓝色圆形点与红色三角形点在二维空间中被一条实线分隔,两侧有虚线表示间隔区域。右侧有一个手持铅笔的卡通机器人,其对话框内写着“找到最佳分隔线!”。下方文字说明:“最大化间隔,分类数据。” 该部分强调SVM的基本目标是寻找能最大化分类间隔的最优超平面。<br><br>**2. 核技巧(Kernel Trick)模块** <br>位于右上角,标题为“核技巧(Kernel Trick)”。左侧为二维平面上无法线性分离的蓝点与红三角点;中间箭头指向右侧三维曲面图,其中数据点被映射到高维空间后可被平面清晰分割。右侧配有一支魔法棒和星星,标注“核函数”,象征核技巧的“魔法”作用。下方文字说明:“低维不可分 → 高维可分!”,直观解释了核技巧的核心思想。<br><br>**3. 软间隔(Soft Margin)模块** <br>位于左下角,标题为“软间隔(Soft Margin)”。图示与基础概念类似,但部分数据点越过了间隔边界,旁边放置一个橡皮擦图标,标注字母“C”,代表正则化参数。下方文字说明:“允许少量错误,提高泛化能力。” 该模块说明SVM在现实数据中引入松弛变量,以平衡分类准确率与模型泛化性能。<br><br>**4. 多分类(Multi-class)模块** <br>位于右下角,标题为“多分类(Multi-class)”。左侧展示两个类别的数据点(蓝色圆点与红色三角),机器人对话框标“1 vs Rest”,表示一对多策略;右侧展示两个类别分别独立处理,机器人对话框标“1 vs 1”,表示一对一策略。图示通过箭头连接至两个圆形区域,分别包含各自类别的数据点,最终合并为分类结果。下方文字说明:“策略:一对多 或 一对一。” 该部分阐明SVM如何扩展到多类别分类问题。<br><br>**5. 应用与优势栏** <br>位于底部横向长条,标题为“应用与优势”,左侧配有灯泡图标。右侧依次列出三个应用场景及优势,每个前均有对应图标:<br>- 📚 文本分类<br>- 🖼️ 图像识别<br>- 👍 小样本表现好<br><br>整体布局结构清晰,从基础到进阶,从二分类到多分类,从线性到非线性,层层递进。图中所有文字均为中文,使用简洁明了的术语,配合生动的插画,使复杂的机器学习概念易于理解。颜色编码用于区分不同类别(蓝色圆点、红色三角、绿色方块),图形元素如机器人、魔法棒、橡皮擦等增强了趣味性和记忆点。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/6cdaaf0c-93cd-42ff-af03-f85c91d369c7" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/6b209397-6f90-40b1-acb8-19053c2999ba" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/d271df25-6fb4-4e33-a6f8-12f65287d1f9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/cb134523-acda-4155-b847-c1da10f0b8f9" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“钢琴&古筝学习参考指南”,采用类似地铁线路图的视觉设计风格,以清晰的分支结构组织内容,通过不同颜色的线条(红色、蓝色、绿色)区分主要主题模块,并使用圆形节点表示关键信息点,整体布局简洁明了,背景点缀有灰色城市轮廓和浅蓝色水波纹装饰元素,增强现代感与层次感。<br><br>图表顶部中央为标题“钢琴&古筝学习参考指南”,字体加粗醒目。从标题节点向下延伸出一条红色主干线路,代表“钢琴vs古筝核心演奏技巧差异”这一核心对比主题。该红色线路在中间分叉为两条平行支线:左侧为“钢琴(西洋键盘乐器)”,右侧为“古筝(中国民族弹拨乐器)”。<br><br>在“钢琴(西洋键盘乐器)”支线下,依次列出四个关键特征节点:<br>- 手指+手腕+手臂协同发力<br>- 标准化指法逻辑<br>- 触键力度均匀、落点准确<br>- 出厂固定音高<br><br>在“古筝(中国民族弹拨乐器)”支线下,依次列出四个对应特征节点:<br>- 佩戴义甲演奏<br>- 按、颤、滑、摇多类技法<br>- 触弦角度30-45度<br>- 左手按弦控制音高<br><br>两条支线在中间交汇处用一个带有双向箭头符号的圆形节点连接,强调两者对比关系。<br><br>从红色主干线路下方引出一条蓝色线路,标记为“两类乐器通用高效练习法”。该蓝色线路包含四个节点:<br>- 热身练习:手指拉伸、音阶爬音<br>- 分节练习:4-8小节片段拆分<br>- 慢练优先:原速50%保证清晰度<br>- 录音复盘:针对性调整问题点<br><br>再从蓝色线路下方引出一条绿色线路,标记为“演奏常见误区避坑指南”。该绿色线路呈发散状,连接四个警告/提示节点,分别用禁止符号(⛔)或警示符号(⚠️)标识:<br>- ⛔ 不要塌腕:手腕与琴键/弦面持平<br>- ⚠️ 不要追求速度优先:音准、节奏、情感是核心<br>- ⛔ 不要忽视乐理学习:帮助理解曲目逻辑<br>- ⚠️ 不要长时间连续练习:每45分钟休息10分钟<br><br>所有文字内容均为中文,字体清晰易读,层级分明,通过颜色编码(红=对比、蓝=方法、绿=避坑)和图标辅助(箭头、禁止、警示),使信息传递高效直观。此图旨在为学习者提供系统性指导,涵盖乐器特性、科学练习方法及常见错误规避,兼具实用性和教育意义。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "Personal Insight & Casual Life Reflection Essay Collections: A Complete Guide" is presented in a tactile, clay-like 3D style with rounded edges and textured surfaces, giving it a handcrafted, approachable aesthetic. The background is a neutral light gray, which makes the colorful clay elements stand out. The title is displayed at the top in two lines: the first line "Personal Insight & Casual Life Reflection" is in a warm terracotta brown, while the second line "Essay Collections: A Complete Guide" uses a bright sky blue for "A Complete Guide," creating visual emphasis.<br><br>The infographic is organized as a grid with three main vertical sections (rows) and seven horizontal columns. The leftmost column contains the main categories, each labeled with a large blue clay tile with yellow text. These are:<br><br>1. **Core Traits of the Genre** – with a smaller brown clay subtitle: "Key traits of the personal insight & life reflection genre"<br>2. **Value for Readers & Writers** – with a smaller brown clay subtitle: "Why These Collections Matter"<br>3. **Actionable Engagement Steps** – with a smaller brown clay subtitle: "Simple steps for new readers and writers"<br><br>The remaining six columns form the header row, each represented by a colored clay tile with white text:<br><br>- **Infographic Multi-factor** (blue)<br>- **Statistics** (teal)<br>- **Quotes** (orange)<br>- **Key Terms** (green)<br>- **Style Preferences** (purple)<br>- **Layout Preferences** (light blue)<br>- **Other Requirements** (dark purple)<br><br>Each of the three main rows then has corresponding content under these headers. For the first two data points under each category (Statistics, Quotes, Key Terms), there is a white clay tile with a brown checkmark (√), indicating inclusion or relevance. For the last three headers (Style Preferences, Layout Preferences, Other Requirements), there is a white tile with a number "3" in the respective color of that column’s header.<br><br>Below each checkmark or number, there is a larger white clay tile with black text describing the specific point. The full structure is as follows:<br><br>---<br><br>### Row 1: Core Traits of the Genre<br><br>- **Statistics**: √ → *Rooted in real lived experience*<br>- **Quotes**: √ → *Inspired by small everyday moments*<br>- **Key Terms**: √ → *No formal structure rules*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 2: Value for Readers & Writers<br><br>- **Statistics**: √ → *Readers: Comfort & relatable validation*<br>- **Quotes**: √ → *Writers: Low-stakes creative outlet*<br>- **Key Terms**: √ → *All audiences: Record of personal growth*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 3: Actionable Engagement Steps<br><br>- **Statistics**: √ → *Step 1: Start with short essays to read*<br>- **Quotes**: √ → *Step 2: Write 100-word daily reflections*<br>- **Key Terms**: √ → *Step 3: Group by theme to build your collection*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (green) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>The design uses consistent visual encoding: checkmarks indicate presence or applicability, while the number "3" likely denotes a level of preference or importance (though no legend is provided). The colors are used both for column headers and for the numbers, creating a visual link between the header and its corresponding data point. All text is rendered in a clean, sans-serif font, and the overall layout is structured and easy to follow, with clear separation between rows and columns via subtle clay dividers. The infographic communicates a comprehensive yet accessible guide to engaging with personal insight and life reflection essays through a visually engaging and tactile design.</div></details></td>
</tr>
</table>
## Showcases (Applications)
> ✨ **Curious to see the model in action? Explore 100 generated examples in our full 👉 [ 🖼️ Infographic Model Showcases ](./u1_infographic_showcases.md) 👈 gallery!**
# SenseNova-U1-8B-MoT-Infographic 📊
**SenseNova-U1-8B-MoT-Infographic** 是在 U1-8B-MoT 模型基础上延长了 MT 阶段训练,并在 MT 与 SFT 阶段调整了理解和生成任务中的数据配比产生的模型。在 RL 阶段,我们进一步优化了 reward recipe,以减少生成信息图中非预期黑色背景的出现。
- **模型性能:** 相比基础版 **SenseNova-U1-8B-MoT** 模型,BizGenEval hard/easy 从 **39.8 / 61.1** 提升至 **46.6 / 65.4****+6.8 / +4.3 points**),IGenBench Q-ACC/I-ACC 从 **51.3 / 4.2** 提升至 **69.5 / 17.0****+18.2 / +12.8 points**),同时保持稳健的视觉理解能力,无明显退化。
- **生成质量:** 模型能够生成涵盖 100+ 种风格与布局的复杂信息图,具备更优的视觉美观度与文字渲染能力 —— 甚至能够渲染如 arXiv 风格页面等高密度小字。
## Benchmark Highlights
| Model | BizGenEval Avg. (hard / easy) ↑ | IGenBench Q-ACC ↑ | IGenBench I-ACC ↑ | OneIG(EN) ↑ | OneIG(ZH) ↑ |
| --- | ---: | ---: | ---: | ---: | ---: |
| ***Commercial Models*** ||
| Nano-Banana-Pro | 76.7 / 93.7 | 90.6 | 48.8 | 58.1 | 56.8 |
| Nano-Banana-2.0 | 68.5 / 92.5 | 85.6 | 34.4 | 54.0 | 54.9 |
| GPT-Image-1.5 | 35.9 / 81.6 | 55.0 | 12.0 | - | - |
| Qwen-Image-2.0 | 45.5 / 65.8 | 50.0 | 3.0 | 54.1 | 50.9 |
| Seedream-4.5 | 30.1 / 66.2 | 61.0 | 6.0 | 56.4 | 55.0 |
| ***Open-source Models*** ||
| **SenseNova-U1-8B-MoT-Infographic** | **46.6 / 65.4** | **69.5** | **17.0** | **55.6** | **53.3** |
| **SenseNova-U1-8B-MoT** | 39.8 / 61.1 | 51.3 | 4.2 | 54.5 | 53.8 |
| Z-Image | 8.2 / 43.8 | 30.0 | 1.0 | 54.6 | 53.5 |
| Qwen-Image-2512 | 6.3 / 41.0 | 32.2 | 1.0 | 53.0 | 51.5 |
| Qwen-Image | 2.8 / 23.8 | 36.0 | 0.0 | 53.9 | 54.8 |
| Bagel | 2.0 / 3.7 | 4.9 | 0.0 | 36.1 | 37.0 |
<sub>IGenBench 分数以百分制展示。Commercial 与 open-source 组内模型按照 BizGenEval hard、BizGenEval easy、IGenBench Q-ACC、IGenBench I-ACC 四项算术平均值排序。OneIG 作为通用生成能力参考。完整分项结果建议放在 Hugging Face model card 中。</sub>
## 生成质量对比
我们在以下五个关键维度上,对基础模型 **SenseNova-U1-8B-MoT** 与微调后的 **SenseNova-U1-8B-MoT-Infographic** 模型进行了定性对比:背景稳定性、图表准确性、文字渲染准确性、论文渲染质量,以及综合美观度。
### 背景稳定性 (Background Stability)
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/b2e4a010-921b-4d34-9ded-2f224a18ad16" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/318743df-6ee8-4ac8-960c-a4f75d2c51f3" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1804c1c1-6a5d-4608-b39d-2eea01c5fa6b" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/746f3604-c81a-4e36-8f81-a58606575ac5" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“版权视觉概览”,整体采用横向分栏布局,分为上下两个主要部分。上半部分为视觉化概览区,由四个彩色矩形区块并列组成,每个区块通过图标和简短标题传达一个核心概念;下半部分为“【版权基础常识】”详细解释区,包含四个编号条目,对应上半部分的四个主题,提供更详尽的文字说明。<br><br>**上半部分:版权视觉概览**<br><br>此区域由四个水平排列的彩色方块构成,从左至右依次为浅蓝色、浅黄色、浅绿色和浅紫色,每个方块内含一组图标和下方的中文标题。<br><br>1. **第一块(浅蓝色):创作即产生**<br> * **图标**:左侧是一个发光的灯泡,中间是一个带有笔的文档图标,右侧是一个锁头图标,三者之间用箭头连接,表示“创意 → 创作 → 保护”的流程。<br> * **文字**<br> * 图标下方有小字“自动保护”。<br> * 方块底部有大字标题“创作即产生”。<br><br>2. **第二块(浅黄色):核心权利**<br> * **图标**:中心是一只手掌向上托举,上方有多个元素围绕:一个带©符号的圆圈、一个喇叭、一堆金币和美元符号、以及多个指向不同方向的箭头,象征权利的多种表现形式和收益。<br> * **文字**<br> * 图标下方无额外小字。<br> * 方块底部有大字标题“核心权利”。<br><br>3. **第三块(浅绿色):特定条件平衡**<br> * **图标**:一个天平,左侧托盘上有打开的书本和标有“NEWS”的麦克风,代表“合理使用”;右侧托盘上有一个带锁的文件夹,代表“受控作品”。天平向右侧倾斜。<br> * **文字**<br> * 左侧托盘下方标注“合理使用”。<br> * 右侧托盘下方标注“受控作品”。<br> * 方块底部有大字标题“特定条件平衡”。<br><br>4. **第四块(浅紫色):保护期限**<br> * **图标**:左侧是一个沙漏,中间是一个向右的粗箭头,右侧是一个墓碑(顶部有十字架)。沙漏下方还有一个时钟图标。<br> * **文字**:<br> * 墓碑旁标注“作者有生之年 + X年”。<br> * 方块底部有大字标题“保护期限”。<br><br>**下半部分:【版权基础常识】**<br><br>此区域位于上半部分下方,背景为白色,包含四个独立的文本框,每个文本框都有一个彩色标题栏和下方的详细说明文字,颜色与上半部分对应。<br><br>1. **1. 自动获得保护**<br> * **标题栏**:蓝色背景,白色文字“1. 自动获得保护”。<br> * **正文**:“作品创作完成之时起,即自动享有版权,无需登记(登记主要是举证)。”<br><br>2. **2. 核心权利**<br> * **标题栏**:橙黄色背景,白色文字“2. 核心权利”。<br> * **正文**:“包括人身权(如署名权、修改权)和财产权(如复制权、发行权、信息网络传播权,可许可或转让获利)。”<br><br>3. **3. 合理使用**<br> * **标题栏**:绿色背景,白色文字“3. 合理使用”。<br> * **正文**:“在特定条件下(如教学、新闻报道、个人学习等),可以不经许可、不支付报酬使用,但需指明作者和出处,且不得侵犯其他权利。”<br><br>4. **4. 保护期限**<br> * **标题栏**:紫色背景,白色文字“4. 保护期限”。<br> * **正文**:“一般为作者有生之年加死后50年(中国大陆等多数地区),期限届满后进入公有领域。”<br><br>**整体风格与数据编码**<br>该信息图采用扁平化设计风格,色彩鲜明且分区清晰。通过颜色编码(蓝、黄、绿、紫)将四个主题进行视觉区分,并在上下两部分保持一致。图标作为主要的数据可视化手段,直观地表达了抽象概念。所有文字均为简体中文,内容结构严谨,逻辑清晰,旨在以图文结合的方式普及版权基础知识。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以中文为主要语言,采用横向四格布局,清晰呈现一个品牌从衰落到复兴的四个关键阶段。整体风格为手绘卡通插画,色彩柔和,线条简洁,具有亲和力和叙事性。每个阶段由上方的标题、中间的插图和下方的文字说明三部分构成,通过虚线分隔,结构分明。<br><br>第一阶段标题为“1. 曾经的辉煌与没落”,插图描绘了一座破败的城堡,城堡上挂着悲伤的表情,周围散落着皇冠,象征昔日荣耀的消逝;旁边立有标牌“OLD BRAND”,背景中可见大本钟,暗示传统或历史品牌。下方文字说明:“曾经是市场领导者,但未能跟上时代步伐,逐渐被遗忘,面临生存危机。”<br><br>第二阶段标题为“2. 创新与重塑”,插图展示四人团队围坐讨论,其中一人指向白板上的绿色叶子标志设计,周围环绕齿轮、灯泡(代表创意)和标牌“NEW IDEAS”。下方文字说明:“进行深度市场调研,重新定位品牌,引入创新设计和数字化策略,重塑核心价值。”<br><br>第三阶段标题为“3. 成功翻盘”,插图包含一只浴火重生的凤凰,象征涅槃;右侧是上升趋势的柱状图,下方是一个带有爱心的包裹,代表产品交付;一群欢呼的人群表达喜悦。下方文字说明:“凭借新产品和新形象重获消费者信任,业绩逆势上扬,重新赢得市场份额。”<br><br>第四阶段标题为“4. 未来展望”,插图描绘一枚火箭从地球轨道发射升空,周围有星星、云朵和一片绿叶,象征可持续发展;下方横幅写着“FUTURE READY”。下方文字说明:“持续创新,关注可持续发展和用户连接,立志成为更具影响力的未来品牌。”<br><br>整个信息图通过视觉隐喻(如城堡、凤凰、火箭)和数据图表(柱状图)结合,生动讲述了一个品牌从危机到复兴的完整故事,强调创新、用户信任和可持续发展的重要性。所有文本均为简体中文,无英文以外的其他语言。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/c07fe790-3d5f-4842-8515-e791504b86ca" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/7b243a49-5c4f-44af-9c10-8f0402bbadf9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ee66732e-7b2d-4e5e-b4f7-7ab3b326cff1" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/d672d362-709c-47aa-b12e-f32a79c07434" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "College Entrance Pathway Reforce Comparison" presents a structured comparison of key aspects for prospective students in Guangdong, China, aiming to enter college through a specialized entrance examination. The layout is organized as a multi-column table with four main columns: "Content Item / Evaluation Criteria", "Statistics", "Quotes", and "Key Terms". Each row corresponds to a distinct evaluation criterion or step in the preparation process, with visual icons, text, and data points enhancing clarity.<br><br>The infographic uses a clean, minimalist design with black line art icons on a light beige background. Text is primarily in bold sans-serif font, with headings emphasized for readability. Data is encoded using icons (e.g., graduation cap, calendar, books, target, rocket) to visually represent concepts, while numerical values are explicitly labeled for precision.<br><br>The first row addresses **Eligibility Criteria**:<br>- In the "Statistics" column, it features an icon of a person checking a map of Guangdong with the text: "Official Eligibility Requirements Confirm if you qualify to register".<br>- The "Quotes" column lists three eligible groups with corresponding icons: "Final-Year Guangdong Junior College Student", "Guangdong Resident <2 Years Post Graduation", and "Eligible Retired Military Personnel".<br>- The "Key Terms" column shows a magnifying glass over a document with the label: "Eligibility Verification".<br><br>The second row covers **Exam Structure & Scoring Breakdown**:<br>- "Statistics" displays icons representing different test types and scores: 100 pts (graduation cap), 200 pts (person at desk), 1000 pts (document with pen), 150 pts (document with pen). Below: "Total 500 points across 4 test papers".<br>- "Quotes" lists four subject components in document-shaped boxes: "Political Theory (100 pts)", "Major-Aligned Public Subject (100 pts)", "Professional Subject 1 (150 pts)", "Professional Subject 2 (150 pts)".<br>- "Key Terms" includes a balance scale icon with "Score Distribution".<br><br>The third row details the **Official Annual Exam Timeline**:<br>- "Statistics" contains a horizontal timeline with icons of a calendar and clock, labeled "Annual Key Timeline".<br>- "Quotes" provides a detailed timeline: Jan: Registration Open → Jan: Admission Open → Mid-Mar: Exam Date → Mid-Apr: Score Release → May-Jun: Admission Offers.<br>- "Key Terms" shows a calendar and clock with "Critical Dates".<br><br>The next three rows outline a three-step preparation strategy:<br><br>**Step 1 - Confirm Target Major & Institution**:<br>- "Statistics": Icon of a person holding a map with a target, text: "Confirm your target 6 months in advance".<br>- "Quotes": Two bullet points: "Download official exam syllabi and past professional subject papers from the target institution’s admission portal" and "Cross-verify that your junior college major meets the target major’s prerequisite requirements".<br>- "Key Terms": Clock and books with "Target Selection".<br><br>**Step 2 - Public Subject Foundation Building**:<br>- "Statistics": Icon of a person studying with books and a coffee cup, text: "Complete 3 months of structured public subject study".<br>- "Quotes": Two bullet points: "Complete 5+ years of past public subject exam papers to identify recurring test points" and "Political Theory allocates 30% of total score to current affairs from the past calendar year".<br>- "Key Terms": Box with lightbulb and "Core Knowledge".<br><br>**Step 3 - Professional Subject Sprint Revision**:<br>- "Statistics": Icon of a running person with a book and clock, text: "Focus on high-weight professional subjects in the final 2 months".<br>- "Quotes": Two bullet points: "Practice past professional subject papers from your target institution and review core major textbooks" and "60% of professional subject questions are repeated or adapted from past 3 years of papers for most institutions".<br>- "Key Terms": Trophy and gears with "Intensive Review".<br><br>Red horizontal lines separate the first three criteria from the three-step strategy, while a blue line separates Step 1 from Steps 2 and 3, visually grouping related content. All textual information is preserved exactly as presented, including spelling variations like "Oficial" (likely intended as "Official"). The infographic serves as a strategic roadmap combining official requirements, scoring details, timelines, and actionable preparation steps for candidates.</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "12-Month Market Performance: US vs. Asia" presents a structured, puzzle-piece-based visual analysis comparing the performance of US and Asian equity markets over a 12-month period. The layout is organized into three main steps, arranged in a central vertical flow with interconnected puzzle pieces, emphasizing a modular, analytical approach to market comparison. The design uses clean black-and-white line art with light blue accents for key sections, icons for visual representation, and clear typography for readability.<br><br>**Step 1** (top center) introduces the scope of the analysis. It features an illustration of four people examining charts, symbolizing data analysis. To the right, it defines the market indices being compared:<br>- **US Markets**: S&P 500, NASDAQ<br>- **Asian Markets**: Nikkei 225, Hang Seng, KOSPI, CSI 300<br><br>It also lists the types of data analyzed:<br>- Trailing Return (represented by a rising bar chart icon)<br>- Average Daily Volume (represented by a stacked bar chart icon)<br>- Top Sector Return (represented by a pie chart icon)<br><br>**Step 2** (left side, labeled "Metrics that account for 72% of short-term S&P 500 volatility") focuses on US Market Core Driving Indicators. This section contains icons representing industry (factory), finance (bank building), money (hand holding dollar sign), and labor (worker in hard hat). Below these icons, a light blue banner reads "US Market Core Driving Indicators". Specific metrics are listed with red warning triangle icons:<br>- CPI YoY: 3.2%<br>- Federal Funds Rate: 5.25–5.5%<br>- Non-farm Payrolls: +187k July 2024<br><br>**Step 3** (right side, labeled "Metrics that predict 68% of MSCI Asia Ex-Japan 3-month forward returns") focuses on Asian Market Core Leading Indicators. This section includes icons for shipping (container), manufacturing (gears), and calculation (calculator). A light blue banner below reads "Asian Market Core Leading Indicators". Specific metrics are listed:<br>- Manufacturing PMI: 51.2 (with red warning triangle)<br>- Q2 Export Growth: +6.8% YoY (with red warning triangle)<br>- Avg Policy Rate: 3.1% (with information circle icon)<br><br>At the bottom center, a large puzzle piece titled "Policy Shifts & Market Volatility Correlation" displays a line graph with two fluctuating lines:<br>- **US VIX (navy line)** — representing US market volatility<br>- **Asian Avg Volatility (green line)** — representing average Asian market volatility<br><br>Arrows connect the two lines, indicating correlation. Below the graph, key insights are provided with red warning triangles:<br>- Rate hike impact: +27% US VIX<br>- Trade policy impact: +34% Asian VIX<br>- Cross-regional sell-off correlation: 0.68<br><br>The overall structure visually represents how US and Asian market performances are driven by distinct but interrelated economic indicators, with a central focus on their volatility dynamics and policy impacts. The use of puzzle pieces metaphorically suggests that these components fit together to form a complete picture of global market trends. The infographic employs consistent iconography, color-coding (red for warnings, blue for core sections), and clear textual labeling to convey complex financial data in an accessible format.</div></details></td>
</tr>
</table>
### 图表准确性 (Chart Accuracy)
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/43b9d92d-8c06-4eb6-944f-6a822403182c" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/88db7fe6-2b8b-494f-8eb8-979d38d6edbd" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/f2528515-c414-4db5-9c63-36fd2400021f" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/269e777a-2a21-49f2-82a2-fbceedb08027" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create a vintage-styled ornithology infographic with a light beige background and dark tree silhouettes framing the left and right sides. At the top, center-aligned, place the main title 'Alter bringt Vorsicht' in a bold, serif font, followed by the subtitle 'Sekunden pro Minute für Suchen vs. Picken bei Staren verschiedener Altersgruppen'. The central visualization is a diverging horizontal bar chart featuring age groups vertically listed in the middle: 'Jungvogel', 'Einjähriger Jungvogel', 'Zweijähriger Altvogel', 'Ausgewachsener Altvogel', and 'Senior'. For each group, a Rust Red bar extending to the right represents 'Picken', while a Sepia Tone bar extending to the left represents 'Suchen'. Numerical labels are placed inside each bar near the outer edge. Below the chart, two x-axes ranging from 0 to 40 reflect the time in seconds for both directions. A legend in the bottom right corner displays square markers for 'Picken' (Rust Red) and 'Suchen' (Sepia Tone). The given data is : [{"age_group": "Jungvogel", "pecking": 20, "scanning": 10}, {"age_group": "Einjähriger Jungvogel", "pecking": 30, "scanning": 20}, {"age_group": "Zweijähriger Altvogel", "pecking": 35, "scanning": 25}, {"age_group": "Ausgewachsener Altvogel", "pecking": 28, "scanning": 30}, {"age_group": "Senior", "pecking": 20, "scanning": 35}]</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: right; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic with the title 'Ajustes de Velocidad del Sprint' and the subtitle 'Desviación de alcance del equipo backend: Trabajo planificado vs. Corrección de errores no planificados' centered at the top. The central visual is a vertical bar chart set against a light gray background, featuring a horizontal baseline at the zero-point mark. Positive values (planned work) are represented by green bars extending upwards, while negative values (unplanned bug fixes) are represented by yellow bars extending downwards. Each bar is topped or bottomed with multi-line text containing the module name and its value in 'pts'. Specifically, the bars are labeled 'Autenticación de usuario' (50 pts), 'API de pago' (35 pts), 'Sincronización de inventario' (22 pts), 'BD de analítica' (14 pts), 'Servicio de correo' (-10 pts), 'Webhooks' (-28 pts), 'Carga de multimedia' (-42 pts), and 'Pasarela heredada' (-70 pts). An illustration of mechanical gears is positioned in the upper-right area, and a cluster of small bug icons is placed in the lower-left area. The vertical axis shows tick labels ranging from +60 pts to -80 pts. The given data is : [{"category": "Autenticación de usuario", "unit": "pts", "value": 50}, {"category": "API de pago", "unit": "pts", "value": 35}, {"category": "Sincronización de inventario", "unit": "pts", "value": 22}, {"category": "BD de analítica", "unit": "pts", "value": 14}, {"category": "Servicio de correo", "unit": "pts", "value": -10}, {"category": "Webhooks", "unit": "pts", "value": -28}, {"category": "Carga de multimedia", "unit": "pts", "value": -42}, {"category": "Pasarela heredada", "unit": "pts", "value": -70}]</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/bec1bd39-72bf-4bed-abdc-df02ab2c0fec" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/5ac254e1-b1c4-4530-9070-4fb4774451b4" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/38ae1611-e9c8-4d97-a867-e2cc43ced2ad" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/7f0dab24-e67a-4ee9-b9c0-ee97e283ac9f" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that features a title and a subtitle centered at the top, reading 'Fastest Cuisines to Prepare' and 'Average Ghost Kitchen Handover Time by Item Type (Minutes)' respectively. The main visual is a horizontal grouped bar chart combining a Fast-food neon visual style with checkerboard borders along the edges, featuring a centered legend above the chart area for 'QuickEats' (cyan neon border) and 'DashNow' (orange neon border). To the bottom right of the bar chart, there is a simple illustration of two mopeds waiting for orders. The chart's vertical axis lists four categories, each preceded by a simple icon, while the horizontal axis represents handover time in minutes with numerical labels at 0, 5, 10, 15, and 20, supplemented by dotted vertical gridlines. Each category features a pair of black bars representing the two platforms, with exact values displayed directly inside the right end of each bar. For 'Classic Tacos', QuickEats takes 10.0 minutes while DashNow takes 11.5 minutes. 'Supreme Burritos' require the longest preparation, with 17.5 minutes for QuickEats and 19.0 minutes for DashNow. 'Spicy Nachos' take 9.5 minutes on QuickEats and 10.0 minutes on DashNow. Finally, 'Mini Quesadillas' are the fastest, taking 8.0 minutes for QuickEats and 8.5 minutes for DashNow. The given data is : [{"category": "Classic Tacos", "platform": "QuickEats", "unit": "Minutes", "value": 10.0}, {"category": "Classic Tacos", "platform": "DashNow", "unit": "Minutes", "value": 11.5}, {"category": "Supreme Burritos", "platform": "QuickEats", "unit": "Minutes", "value": 17.5}, {"category": "Supreme Burritos", "platform": "DashNow", "unit": "Minutes", "value": 19.0}, {"category": "Spicy Nachos", "platform": "QuickEats", "unit": "Minutes", "value": 9.5}, {"category": "Spicy Nachos", "platform": "DashNow", "unit": "Minutes", "value": 10.0}, {"category": "Mini Quesadillas", "platform": "QuickEats", "unit": "Minutes", "value": 8.0}, {"category": "Mini Quesadillas", "platform": "DashNow", "unit": "Minutes", "value": 8.5}]</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that presents a centered title at the top, stating "Übertaktet vs. Standard-Takt", with the subtitle "Temperaturanstieg bei langen Gaming-Sessions" directly below it. The main visual is a line chart spanning the width of the infographic on a dark background, embodying a Gamer Aesthetic with vibrant RGB neon accents. This chart has a vertical axis on the left labeled with numerical values in increments of 10 from 30°C to 100°C, and a horizontal axis at the bottom with time labels: '0m', '15m', '30m', '45m', '60m', '75m', '90m', '105m', and '120m'. Horizontal grid lines mark each 10°C increment. A horizontal legend is positioned under the subtitle, containing a cyan circular marker and line for "Standard-Takt" and a magenta circular marker and line for "Übertaktet (+150MHz)". Two data series are plotted as glowing neon lines with hollow circular markers at each data point, accompanied by gradient shading below each line. The cyan "Standard-Takt" line shows a steep rise from 38°C at 0m to 68°C at 15m, followed by a flat plateau reaching 73.5°C at 120m. The magenta "Übertaktet" line displays a similar initial spike from 42°C to 75°C, but continues with a gradual linear creep up to 93°C at 120m. Spike annotations (callout boxes) point to the final data points on the right, highlighting the peak temperatures: a magenta box reads "Peak: 93°C" and a cyan box reads "Peak: 73.5°C". A stylized thermometer line-art icon is subtly placed in the center of the chart's background. The given data is : [{"profile": "Standard-Takt", "temperature": 38, "time": "0m"}, {"profile": "Übertaktet", "temperature": 42, "time": "0m"}, {"profile": "Standard-Takt", "temperature": 68, "time": "15m"}, {"profile": "Übertaktet", "temperature": 75, "time": "15m"}, {"profile": "Standard-Takt", "temperature": 71, "time": "30m"}, {"profile": "Übertaktet", "temperature": 79, "time": "30m"}, {"profile": "Standard-Takt", "temperature": 72, "time": "45m"}, {"profile": "Übertaktet", "temperature": 82, "time": "45m"}, {"profile": "Standard-Takt", "temperature": 72.5, "time": "60m"}, {"profile": "Übertaktet", "temperature": 85, "time": "60m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "75m"}, {"profile": "Übertaktet", "temperature": 87, "time": "75m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "90m"}, {"profile": "Übertaktet", "temperature": 89, "time": "90m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "105m"}, {"profile": "Übertaktet", "temperature": 91, "time": "105m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "120m"}, {"profile": "Übertaktet", "temperature": 93, "time": "120m"}]</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/2387f67f-2a65-4001-b579-9c1264f4796f" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/5d8c92b4-4066-4f5d-af77-0f71fd86e51a" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1b0ea5e1-b241-4902-9bc0-485229f60592" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/d50977de-d5ab-42fd-bcfe-a2001b18fbcb" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that displays data in a vertical diverging bar chart format. At the top left of the visualization, there is a title: 'Anomalie de l'Atlantique Sud : Dérive magnétique', and a subtitle: 'Vecteurs de dérive vers l'est et l'ouest en kilomètres par rapport à la ligne de base historique'. In the upper left area below the text, an icon of a compass rose is placed within a magnetic field line curve. The main chart features a horizontal zero-axis line, labeled with a '0' on the far left, representing the historical coordinate baseline. The x-axis at the bottom displays the decades '1980', '1990', '2000', '2010', and '2020', each marked with a small vertical tick. For each decade, a vertical bar extends from the zero-axis, with its corresponding data label positioned directly at the end of the bar. The data shows westward drift represented by blue bars extending below the axis for '1980' with a value of '-15 km' and '1990' with a value of '-32 km'. Eastward drift is represented by red bars extending above the axis for '2000' with a value of '+10 km', '2010' with a value of '+45 km', and '2020' with a value of '+68 km'. The overall visual style mimics a geophysical science journal, utilizing compass red and blue color tones. The given data is : [{"decade": "1980", "drift_km": -15}, {"decade": "1990", "drift_km": -32}, {"decade": "2000", "drift_km": 10}, {"decade": "2010", "drift_km": 45}, {"decade": "2020", "drift_km": 68}]</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic in a corporate report minimalism style with muted corporate grays and blues, featuring a large title, 'Seasonal Fluctuations in 15-Year Mortgages', at the top. Directly below it is a subtitle, 'Historical prepayment velocities showing seasonal housing market trends'. Underneath the subtitle, a horizontal legend identifies two categories with small square icons: 'Spring/Summer Originations' in lighter gray-blue and 'Fall/Winter Originations' in darker gray-blue. The main visual is a multi-line chart in a wide landscape orientation. The vertical axis has numeric labels at 0.0, 5.0, 10.0, 15.0, and 20.0, with horizontal grid lines extending across the plot. The horizontal axis features labels: 'Jan 2018', 'Apr', 'Jul', 'Oct', 'Jan 2019', 'Apr', and 'Jul'. An icon depicting a sleek house silhouette is positioned in the upper left corner of the chart's plotting area. Two distinct lines represent the categories, characterized by cyclical seasonal bumps in the summer months. Both lines have square markers at each data point, with numerical values displayed near them. The lighter line for 'Spring/Summer Originations' plots a value of 8.0 in Jan 2018, rising to 12.5 in Apr, peaking at 16.0 in Jul, dipping to 11.0 in Oct, dropping further to 7.5 in Jan 2019, climbing to 13.0 in Apr, and reaching 17.5 in Jul. The darker line for 'Fall/Winter Originations' mirrors this pattern, starting at 6.5 in Jan 2018, increasing to 9.0 in Apr, hitting 14.5 in Jul, falling to 10.0 in Oct, bottoming out at 6.0 in Jan 2019, rising to 10.5 in Apr, and ending at 15.0 in Jul. The given data is : [{"category": "Spring/Summer Originations", "date": "2018-01", "value": 8.0}, {"category": "Fall/Winter Originations", "date": "2018-01", "value": 6.5}, {"category": "Spring/Summer Originations", "date": "2018-04", "value": 12.5}, {"category": "Fall/Winter Originations", "date": "2018-04", "value": 9.0}, {"category": "Spring/Summer Originations", "date": "2018-07", "value": 16.0}, {"category": "Fall/Winter Originations", "date": "2018-07", "value": 14.5}, {"category": "Spring/Summer Originations", "date": "2018-10", "value": 11.0}, {"category": "Fall/Winter Originations", "date": "2018-10", "value": 10.0}, {"category": "Spring/Summer Originations", "date": "2019-01", "value": 7.5}, {"category": "Fall/Winter Originations", "date": "2019-01", "value": 6.0}, {"category": "Spring/Summer Originations", "date": "2019-04", "value": 13.0}, {"category": "Fall/Winter Originations", "date": "2019-04", "value": 10.5}, {"category": "Spring/Summer Originations", "date": "2019-07", "value": 17.5}, {"category": "Fall/Winter Originations", "date": "2019-07", "value": 15.0}]</div></details></td>
</tr>
</table>
### 文字渲染准确性及大小合适性 (Text Rendering Accuracy)
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/e293eea7-6ec3-44a0-80a8-8ab2760bf6a7" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1ce99b25-eb09-4f4c-b8a5-76f27571e1a3" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/24a21e2e-4974-4b7a-98f7-e3ae49f750fa" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ebe69047-e668-4e14-83b3-df6261a92ca7" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以手绘笔记本风格呈现,标题为“吉伊卡哇带你游:加泰罗尼亚国家艺术博物馆(MNAC)三天两夜不绕路攻略”,副标题为“行程路线与时间安排(中文清晰版)”。整体采用暖黄色调背景,搭配棕色边框和螺旋装订线设计,营造出温馨可爱的旅行手册氛围。内容分为三个主要垂直区块,分别对应DAY 1、DAY 2、DAY 3,每个区块顶部有圆形时钟图标和“DAY X”标签,结构清晰。<br><br>每个日期区块内均以时间轴形式列出具体行程,使用圆点连接时间点与活动描述,右侧配有吉伊卡哇系列的可爱卡通形象插画(如白熊、蓝猫、兔子等),增强趣味性。所有文字均为简体中文,字体清晰易读,视觉层次分明。<br><br>---<br><br>**DAY 1:抵达与初探**<br>- **10:00** 抵达巴塞罗那,酒店办理入住 (Poble Sec区) —— 配有白熊拖着行李箱的插画。<br>- **12:00** 午餐:西班牙Tapas —— 插画未显示。<br>- **14:00** 前往西班牙广场 (Plaza de España),远眺MNAC全景 —— 配有西班牙广场建筑插画及地图箭头。<br>- **16:00** 参观MNAC外部建筑与周围花园 —— 配有蓝猫在花丛中跳跃的插画。<br>- **19:00** 欣赏魔幻喷泉表演 (Magic Fountain) —— 配有带闪光效果的白熊插画。<br>- **20:30** 晚餐:附近餐厅 —— 插画未显示。<br><br>---<br><br>**DAY 2:MNAC深度艺术之旅**<br>- **09:30** 早餐,步行至MNAC入口 —— 配有白熊吃面包的插画。<br>- **10:00** 进入MNAC (建议提前购票),参观罗马式艺术馆藏 —— 配有古典油画插画。<br>- **12:30** 馆内简餐或附近午休 —— 插画未显示。<br>- **14:00** 参观哥特式、文艺复兴及巴洛克艺术馆藏 —— 配有蒙娜丽莎风格肖像画插画及蓝猫形象。<br>- **16:30** 探索现代艺术馆藏 (加泰罗尼亚现代主义) —— 配有抽象艺术风格插画。<br>- **18:30** 前往MNAC屋顶观景台,俯瞰城市日落 —— 配有兔子举手机拍照的插画。<br>- **20:00** 晚餐:Arenas商场附近 —— 插画未显示。<br><br>---<br><br>**DAY 3:蒙特惠奇山周边与返程**<br>- **09:00** 早餐,退房寄存行李 —— 插画未显示。<br>- **10:00** 乘坐缆车前往蒙特惠奇城堡 (Montjuïc Castle) —— 配有缆车插画,内含三只卡通动物。<br>- **12:00** 参观米罗基金会 (Joan Miró Foundation) —— 配有米罗风格抽象雕塑插画。<br>- **13:30** 午餐:奥林匹克港附近海鲜饭 —— 插画未显示。<br>- **15:00** 漫步奥林匹克公园 —— 插画未显示。<br>- **16:30** 提取行李,前往机场/车站返程 —— 配有开心挥手的白熊插画。<br><br>---<br><br>**底部交通贴士栏**<br>配有公交车、地铁、步行鞋图标,文字为:“交通贴士:善用T-casual交通卡,步行探索更佳!”<br><br>---<br><br>整体图表类型为时间序列流程图,通过垂直分栏与水平时间轴结合的方式组织信息。数据编码方式包括时间点(精确到分钟)、地点名称、活动描述及配套插画,所有信息均按逻辑顺序排列,便于用户快速理解并执行三天行程计划。视觉元素丰富,兼具实用性和趣味性,适合旅游攻略类内容传播。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents a comprehensive architectural and structural analysis of the Temple of Kom Ombo, an ancient Egyptian temple located on the west bank of the Nile River. The title "TEMPLE OF KOM OMBO" is prominently displayed in a hand-drawn, white-bordered box in the lower-right corner of the image, set against a brown background that mimics sandstone or earth tones. The overall layout is divided into multiple sections: a central photographic image of the temple ruins under a clear blue sky, surrounded by illustrative technical diagrams, annotated floor plans, and textual data blocks, all rendered in white line art and text for high contrast.<br><br>The central photograph shows the main hypostyle hall and surrounding structures of the temple, with visitors walking among the columns and courtyards, providing a sense of scale. In the background, the Nile River and palm trees are visible, situating the temple in its natural environment. The ruins are constructed from light-colored sandstone blocks, consistent with the material noted in the text.<br><br>In the upper-left quadrant, a 3D axonometric diagram illustrates the overall dimensions of the temple complex: approximately 62 meters by 51 meters, labeled along the axes. Adjacent to this, a list of key structural facts is presented in bullet points:<br>- TEMPLE AXIS: DOUBLE SANCTUARY FOR SOBEK & HORUS<br>- OVERALL DIMENSIONS (APPROX. 62M x 51M)<br>- CONSTRUCTION MATERIAL: SANDSTONE BLOCKS<br>- COLUMN HEIGHTS: UP TO 12 METERS<br><br>Above the central photo, two schematic diagrams illustrate architectural details:<br>- A top-down view of the hypostyle hall showing 30 columns arranged in a grid, labeled “HYPOSTYLE HALL (30 COLUMNS)” and pointing to “TWO SANCTUARIES.”<br>- A cross-section labeled “PYLON AND HYPOSTYLE SECTION,” which includes a detailed vertical cutaway showing the roofing system supported by columns, with arrows indicating load paths down to foundations.<br><br>To the right of the central image, text notes “TWO ENTRANCES SYMBOLIZING DUALITY,” emphasizing the temple’s unique dual dedication. This concept is reinforced in the lower section of the infographic, where a detailed floor plan is overlaid on the brown ground area.<br><br>The floor plan, drawn in white lines, is annotated with various features:<br>- INNER TEMPLE (FOR SOBEK) — marked with a rectangular inner sanctum.<br>- INNER TEMPLE (FOR HAROERIS) — another distinct inner sanctum, indicating the dual religious function.<br>- NILOMETER — a structure used to measure the Nile’s water level.<br>- BIRTH HOUSE (MAMMISHI) — a smaller chamber associated with fertility rituals.<br>- MUMMIFIED CROCODILE MUSEUM SITE — indicating a location within the temple complex for sacred crocodile mummies.<br>- TWO ENTRANCES SYMBOLIZING DUALITY — shown as two separate entryways on the plan.<br><br>Surrounding the floor plan are inset images of relief carvings, each labeled:<br>- MEDICAL INSTRUMENT RELIEFS — depicting figures with tools.<br>- TWO ENTRANCES RELIEFS — showing doorways flanked by deities.<br>- CALENDAR RELIEFS — illustrating scenes related to timekeeping or agricultural cycles.<br><br>Additional annotations point to structural aspects:<br>- “STRUCTURAL LOAD PATHS FROM COLUMNS TO FOUNDATIONS” — illustrated with curved arrows tracing the force transfer from columns through the walls to the ground.<br>- The pylon and hypostyle section diagram also labels “ROOFING SYSTEM” and shows how the roof beams rest on column capitals.<br><br>All textual content is in English, using a clean, sans-serif font that enhances readability. The visual style blends real photography with technical illustrations and hand-drawn elements, creating an educational and engaging format suitable for tourists, students, or archaeologists. The infographic effectively communicates both the physical characteristics and symbolic significance of the Temple of Kom Ombo, highlighting its duality, engineering, and cultural importance.</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/7fef7b78-f696-40cc-bdb6-3f712103f516" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/a005796d-34be-45ba-85ad-b591e29bb664" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/f7762af7-6989-4362-8d37-1c5e5e84d061" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ee00a7df-3158-4331-a14c-bf53a1bc8b85" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“面试成功技巧图解”,采用左右分栏对比结构,左侧为“成功”部分(白色背景,红色主题色),右侧为“失败”部分(灰色背景,黑色主题色),通过鲜明的视觉对比突出两种截然不同的面试表现。整体布局清晰,分为三个主要横向模块,每个模块对应一个核心维度,并配有图标、标题和要点列表。<br><br>标题“面试成功技巧图解”位于顶部中央,使用加粗黑体字,字号较大,居中对齐,作为整个图表的总纲。<br><br>左侧“成功”栏目上方有红色标题“成功”,下方副标题为“积极、准备、专业”,用黑色字体书写,强调成功者的特质。右侧“失败”栏目上方有黑色标题“失败”,下方副标题为“被动、随意、消极”,同样用黑色字体,形成对立。<br><br>第一模块:准备<br>- 成功侧:<br> - 图标:红色文件夹叠加放大镜,象征研究与准备。<br> - 标题:“充分准备”<br> - 要点列表:<br> • 深入研究公司<br> • 模拟问答练习<br> • 携带作品集<br>- 失败侧:<br> - 图标:黑色文件夹内含问号,象征无知或未准备。<br> - 标题:“准备不足”<br> - 要点列表:<br> • 对公司一无所知<br> • 回答模糊不清<br> • 两手空空赴约<br><br>第二模块:沟通<br>- 成功侧:<br> - 图标:红色对话气泡(带声波)连接两人,中间有眼睛符号,象征倾听与交流。<br> - 标题:“有效沟通”<br> - 要点列表:<br> • 清晰表达观点<br> • 积极倾听反馈<br> • 自信肢体语言<br>- 失败侧:<br> - 图标:黑色对话气泡内含红色叉号,下方两人表情沮丧,象征沟通中断。<br> - 标题:“沟通障碍”<br> - 要点列表:<br> • 表达混乱无序<br> • 打断对方说话<br> • 眼神闪躲回避<br><br>第三模块:态度<br>- 成功侧:<br> - 图标:红色笑脸配向上箭头,象征积极向上的态度。<br> - 标题:“积极态度”<br> - 要点列表:<br> • 展现热情兴趣<br> • 提出深度问题<br> • 表达真诚感谢<br>- 失败侧:<br> - 图标:黑色哭脸配向下箭头,象征消极情绪。<br> - 标题:“消极态度”<br> - 要点列表:<br> • 缺乏热情活力<br> • 只关注薪资待遇<br> • 抱怨前雇主<br><br>所有模块之间由细线分隔,成功侧各模块间以红色横线分隔,失败侧以黑色横线分隔。左右两栏由一条粗红黑双色竖线分隔,强化对比效果。整体设计简洁明了,通过颜色、图标和文字三重编码,直观传达面试成功与失败的关键差异。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以“相对论时间膨胀图解”为标题,采用几何抽象风格设计,整体布局清晰,色彩鲜明,主要使用红、黄、蓝、黑、白五种颜色构成蒙德里安风格的几何图形背景。图中通过视觉化元素和文字说明,系统解释了狭义相对论中的时间膨胀效应。<br><br>**标题部分**<br>位于图像顶部中央,黑色粗体字“相对论时间膨胀图解”,字体醒目,占据显著位置,明确点出主题。<br><br>**主体图示部分**<br>- **左侧圆形区域代表“地球”**:蓝色圆圈内嵌红色方块,方块中有一个黄色边框的白色时钟,指针指向约12:05(或12:06),时钟下方标注“速度慢,时间快”。此区域象征静止或低速参考系。<br>- **右侧三角形区域代表“太空飞船”**:黄色三角形内嵌蓝色圆圈,圆圈中有一个蓝色边框的白色时钟,指针指向约12:10,时钟下方标注“速度快,时间慢”。此区域象征高速运动参考系。<br>- **连接箭头与标签**:一条粗大的黄色箭头从地球指向太空飞船,表示运动方向;箭头中部标有“时间膨胀效应”,并配有一个向上弯曲的黄色弧形箭头,强调时间流逝差异。<br>- **背景装饰**:由多种几何形状(矩形、三角形、半圆、四分之一圆)拼接而成,形成类似蒙德里安艺术风格的抽象图案,增强视觉吸引力,但不干扰核心信息传递。<br><br>**底部三栏文本说明部分**<br>位于图像下半部分,分为三个垂直栏目,分别用不同颜色标题区分:<br><br>1. **基本概念**(红色标题):<br> - 文本内容:“运动物体的速度越快,其自身的时间流逝相对于静止观察者就越慢。这是狭义相对论的推论。”<br> - 说明时间膨胀的基本定义及其理论来源。<br><br>2. **核心原理**(黄色标题):<br> - 文本内容:“光速在真空中对于所有观察者都是不变的,导致空间和时间必须改变以维持这一恒定。”<br> - 解释时间膨胀的根本原因——光速不变原理。<br><br>3. **应用实例**(蓝色标题):<br> - 文本内容:“GPS卫星因高速运动和地球引力,需进行时间校正以确保定位准确。粒子加速器中的高能粒子寿命延长。”<br> - 列举两个实际应用场景,体现理论的现实意义。<br><br>**整体结构与设计特点**<br>- 图表类型:概念性示意图 + 文本说明组合。<br>- 数据编码方式:通过时钟指针位置的视觉差异(地球时钟较慢,飞船时钟较快)直观展示时间流逝速率的不同,配合文字标签强化理解。<br>- 视觉元素:大量使用几何图形、对比色块、箭头和图标,使复杂物理概念易于理解。<br>- 信息层级:从宏观图示到微观文字,由形象到抽象,逻辑递进,层次分明。<br><br>该信息图成功将抽象的物理学概念转化为直观、易懂的视觉语言,兼具科学性和艺术性。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/bd227342-bed9-4c03-85c9-499b2a1dc97e" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/7dbed837-ac18-4d34-9db3-30e69ce37431" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/08be1f31-3328-4883-88c3-804a3a47ae93" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/84f176e5-4fa5-4514-9f74-da123d04ed73" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以黑板风格设计,标题为“地方特色&活动微信公众号推广全指南”,整体采用手绘粉笔字效果,配以彩色图标和箭头,视觉上模拟真实黑板书写场景。内容结构清晰,分为三个主要部分,通过灰色弧形箭头连接,形成逻辑递进关系:从推广内容核心方向 → 高转化活动推广玩法 → 微信公众号生态适配推广技巧。<br><br>第一部分:“推广内容核心方向:深挖本地特色记忆点”,强调通过三类高流量本地内容吸引用户共鸣并吸引外地游客打卡:<br>- **本土美食**(黄色椭圆标签):包含老字号小吃、季节性特色食俗、社区隐藏小店探店内容,配有热汤碗与筷子图标。<br>- **人文风物**(棕色椭圆标签):涵盖非遗技艺传承故事、老街老巷历史、本地名人旧居探访内容,配有传统建筑与布鞋图标。<br>- **便民福利**(粉色椭圆标签):包括本地专属消费券、景区免票政策、节庆活动预告等内容,配有优惠券与礼盒图标。<br><br>第二部分:“高转化活动推广3种实用玩法”,旨在拉满参与转化率:<br>- **节庆市集玩法**(橙色椭圆标签):公众号预热发早鸟票+留言抽免费参与名额+现场打卡返现,配有灯笼与摊位图标。<br>- **非遗体验玩法**(绿色椭圆标签):开放公众号专属报名通道+提前发布体验官预告内容+活动后用户投稿返现,配有陶艺与织布机图标。<br>- **消费促进玩法**(紫色椭圆标签):联合本地商家推出公众号专属消费券包+到店核销送定制周边,配有购物袋与银行卡图标。<br><br>第三部分:“微信公众号生态适配推广技巧”,聚焦降低推广成本:<br>- **内容呈现技巧**(蓝色椭圆标签):封面图用本地标志性建筑/美食做视觉符号,首图放置活动倒计时海报,文末加一键报名跳转链接,配有手机图标。<br>- **渠道联动技巧**(黄色椭圆标签):视频号发布活动花絮挂载公众号链接,朋友圈广告定向推送给本地18-60岁人群,本地社群转发带专属抽奖码,配有三人社交网络图标。<br>- **私域留存技巧**(绿色椭圆标签):活动参与者引导添加企业微信,拉入本地福利群后续持续推送活动信息,配有微信对话气泡图标。<br><br>整个信息图布局呈垂直流线型,各模块之间以曲线箭头连接,右侧点缀有简笔小人和感叹号等装饰元素,增强趣味性和可读性。文字排版层次分明,主标题白色粗体,副标题与核心概念使用黄色或彩色突出,细节说明则为白色常规字体。所有文本均为中文,无英文或其他语言内容。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为《儿童营养补充全指南:科学建议+产品选购要点》,采用漫画风格设计,色彩鲜明,以红、黄、蓝为主色调,布局清晰分为左右两大板块,每个板块又细分为多个模块,图文并茂地呈现了儿童营养补充的科学指导与实用建议。<br><br>整体结构分为“科学参考指引”和“实操应用指南”两大核心部分,通过卡通插图、图标、爆炸式对话框、标签等视觉元素增强可读性与吸引力。<br><br>---<br><br>**第一部分:科学参考指引**<br><br>1. **分龄营养补充重点清单**<br> - 标题:“分龄营养补充重点清单”,副标题:“分龄补营养,精准更高效;对应年龄段按需补充,避免过度摄入”<br> - 内容按年龄分三个阶段:<br> - **0-6月龄**:每日常规补充维生素D 400IU,纯母乳喂养宝宝需额外补充维生素K。配图:婴儿头像、Vit D注射器、Vit K胶囊。<br> - **7月龄-3岁**:重点补充铁(Fe)、锌(Zn)、DHA,每日维生素D补充量维持在400-600IU。配图:幼儿头像、放大镜观察胶囊、Fe和Zn符号。<br> - **4-12岁**:重点补充钙(Ca)、维生素A、B族维生素(B_B),保证每日蛋白质摄入量达标。配图:男孩头像、Ca气泡、B_B气泡、鸡蛋、牛奶瓶、眼睛图标。<br><br>2. **营养补充原则&常见避坑指南**<br> - 标题:“营养补充原则&常见避坑指南”,副标题:“科学补营养,这些坑要避开”<br> - 包含两个核心原则:<br> - **优先膳食摄入**(绿色对勾):核心原则1:日常均衡饮食是营养摄入的首要来源,不可用补充剂代替正常三餐。配图:孩子用餐场景,盘中有蔬菜、水果、肉类。<br> - **按需适量补充**(红色STOP标志):核心原则2:营养素补充并非越多越好,过量摄入维生素A、钙等可能引发中毒或代谢负担。配图:多瓶补剂被红色叉号覆盖。<br> - **避坑指南**(黄色标签):<br> - ① 不做体检评估盲目跟风补 ❌<br> - ② 把网红补剂当零食给孩子吃 ❌<br> - ③ 用成人补充剂减量给儿童服用 ❌<br> - 配图:红色“避坑”爆炸框,带有闪电效果。<br><br>---<br><br>**第二部分:实操应用指南**<br><br>1. **儿童营养补充产品3步选购法**<br> - 标题:“儿童营养补充产品3步选购法”,副标题:“儿童补剂选购3步判断法”<br> - 三步法分别由放大镜图标引导:<br> - **看合规标识**:优先选择带蓝帽标识的保健食品,或有婴幼儿/儿童专用备案标识的正规产品,拒绝三无产品。配图:放大镜聚焦“蓝帽”标志。<br> - **看配料成分**:优先选择无额外添加蔗糖、香精、人工色素、防腐剂的产品,致敏原标注清晰明确。配图:文件上贴有“无添加”印章,绿色对勾。<br> - **看适配年龄**:选择标注对应适用年龄段的儿童专用产品,不要自行将成人补充剂减量给孩子服用。配图:药瓶标签上“年龄”被红圈突出。<br><br>2. **常见儿童补剂适用场景对照表**<br> - 标题:“常见儿童补剂适用场景对照表”<br> - 表格形式,两列:左侧“补剂类型”,右侧“适用场景”,背景色交替为红、蓝。<br> - 具体内容:<br> - **维生素D滴剂** → 全年龄段儿童日常常规补充,预防佝偻病、促进钙吸收。配图:滴管瓶、骨头图标。<br> - **铁剂** → 体检确诊缺铁性贫血,或日常红肉、动物肝脏摄入不足的儿童。配图:滴管瓶、儿童头像。<br> - **DHA藻油** → 日常深海鱼摄入不足的儿童,辅助促进视网膜和大脑发育。配图:鱼形胶囊、大脑与眼睛图标。<br> - **钙剂** → 日常奶量不足、身高增长偏缓,经体检确认缺钙的儿童。配图:白色药片、儿童测量身高图。<br><br>---<br><br>**视觉与排版特征:**<br>- 整体采用网格化布局,四个主要模块分布在2x2的象限中。<br>- 使用大量漫画元素:如爆炸框、对话气泡、箭头、感叹号、禁止符号等。<br>- 图标系统丰富:Vit D、Fe、Zn、Ca、B_B、蓝帽、无添加、年龄、STOP等均有专属图形标识。<br>- 字体加粗、阴影、边框强调关键信息,如标题、数字、警示语。<br>- 色彩编码明确:黄色用于提示重点,蓝色用于说明步骤,红色用于警示或禁止。<br><br>该信息图内容全面,逻辑清晰,兼具科学性和实用性,适合家长快速掌握儿童营养补充的核心知识与选购技巧。</div></details></td>
</tr>
</table>
### 论文渲染质量 (Paper Rendering Quality)
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/cf47774c-9724-4140-98d6-32c1eee1b6b7" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/bbe55788-6e19-46f6-8f3b-b0098e89d210" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/90c231fb-fef5-4559-b0a7-b994bcc65386" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/b95f8a67-d7ef-40a1-be13-51890e761c4c" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is laid out with two tables at the top, followed by a two-column text layout. The tables span the full width of the text area. The text includes a section heading.<br><br>[paragraphs]<br><br>the TOPIC MODELER, the GENDER SEGMENTER, and an OTHER module (transcript length and duration). We test for a linear relationship between each pair of variables: $H_O : r = 0$, $H_A : r \neq 0$, where $H_O$ is the origi-nal hypothesis, $H_A$ is the alternate hypothesis, and $r$ is the Pearson’s correlation coefficient. We follow Reddy et al. (2021) and Yang et al. (2019) and apply a Bonferroni cor-rection to our $\alpha$ value of $0.05$, setting $\alpha = 0.05/z$, where $z = \binom{124}{2} = 7,626$ for LDA, representing the number of feature relationships we consider. Hence, we reject $H_O$ in favor of $H_A$ if $p \leq \alpha$. Given the largeness of $z$, our $\alpha$ value becomes small, making our criteria for significance strict and thus suitable for investigating our research ques-tions. Furthermore, we filter our correlations $r$, such that $\Vert r\Vert > 0.1$ for our LDA experiments, and $\Vert r\Vert > 0.05$ for our BERTopic experiments (due to the smaller sample size of 10,000 podcasts, and fewer samples may have higher vari-ance). Our results focus on a selection of these significant correlations; the full results are available on the project web-site: https://www.gendered-discourse.net/extended-results.<br><br>### RQ0: How Are Women and Men’s Discourse Different?<br><br>Using GDCF, our Gendered Discourse Correlation Frame-work shown in Figure 2, we then analyze significant corre-lations between between the gender features from the GEN-DER SEGMENTER module (Doukhan et al. 2018a), and the topic features from the TOPIC MODELER module (Blei, Ng, and Jordan 2003). We use the discourse topics to automati-cally form gendered discourse word lists via their significant correlations.<br><br>Starting with the first row of Table 1, we see that Topic 3’s word list returned by LDA with Non-Contextual Embed-dings (Bag-Of-Words) (via the TOPIC MODELER module) contains the words women, woman, men, baby, pregnant, girls, men, doctor, health, birth (in descending weighted or-der). Based on this word list, we manually interpret this topic as being a content topic, specifically about pregnancy, as noted in the column “Topic N Categories.” Then, we look to the gender correlations in the columns “Gender” and “$r$,” and see that $r(\text{Topic 3, Women}) = +0.15$ and $r(\text{Topic 3, Men}) = -0.14$. This indicates that the topic of pregnancy positively correlates with women (identified via the GENDER SEGMENTER module), and negatively corre-lates with men. Therefore, we associate Topic 3 (Content - Pregnancy) with Women, as noted in the “Topic N Gender” column. Similarly, we make these associations in the “Topic N Gender” column for Topics 10, 49, and 71.<br><br>Next, we focus on the Topic 54 row. This topic is inter-preted using the word list get, like, know, right, people, go-ing, podcast, make, want, one. This word list does not refer to any content, hence, we manually interpret this topic as being a discourse topic. Moving to the gender correlations, we see that $r(\text{Topic 54, Women}) = \emptyset$ and $r(\text{Topic 3, Men}) = +0.12$. The reason for $r(\text{Topic 54, Women}) = \emptyset$ is because the correlation between the features Topic 54 and Women did not come back as significant. However, due to the positive correlation of $0.12$ for Topic 3 and Men, we manually as-sociate Topic 3 with Men in the “Topic N Gender” column.<br><br>[tables]<br><br>Table 1: LDA with Non-Contextual Embeddings (Bag-Of-Words): The complete set of significant correlations between gender features and topic features – both content topics and discourse topics. Based on $r$, the Topic N Gender forms the gendered (discourse) word lists via Topics 54 and 60 (the masculine word lists) and Topic 62 (the feminine word list).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 3 | Women <br> Men | 0.15 <br> -0.14 | women, woman, men, baby, pregnant, girls, men, doctor, health, birth | Content - Pregnancy | Women |<br>| Topic 10 | Women <br> Men | 0.10 <br> -0.12 | energy, body, feel, mind, space, yoga, love, beautiful, feeling, meditation | Content - Yoga | Women |<br>| Topic 49 | Women <br> Men | -0.21 <br> 0.17 | game, know, think, team, going, mean, play, year, one, good | Content - Sports | Men |<br>| Topic 71 | Women <br> Men | 0.14 <br> -0.14 | christmas, sex, girl, hair, love, get, date, girls, let, wear | Content - Dating | Women |<br>| Topic 54 | Women <br> Men | – <br> 0.12 | get, like, know, right, people, going, podcast, make, want, one | Discourse | Men |<br>| Topic 60 | Women <br> Men | -0.27 <br> 0.20 | going, know, think, get, got, one, really, good, well, yeah | Discourse | Men |<br>| Topic 62 | Women <br> Men | 0.33 <br> -0.28 | like, know, really, going, people, want, think, get, things, life | Discourse | Women |<br><br>Table 2: BERTopic with Contextual Embeddings (BERT, ChatGPT, Llama): The complete set of significant correlations between gender features and topic features for discourse topics only (content topics are omitted).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 0 | Women <br> Men | -0.08 <br> 0.10 | like, yeah, know, oh, right, podcast, got, going, think, really | Discourse | Men |<br>| Topic 2 | Women <br> Men | 0.08 <br> -0.08 | life, know, things, really, people, feel, like, want, love, going | Discourse | Women |<br>| Topic 5 | Women <br> Men | 0.08 <br> – | like, know, think, yeah, episode, really, going, anchor, kind, right | Discourse | Women |</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout with a single column. The text is justified and divided into sections and subsections, indicated by numbered headings. Important terms at the beginning of some paragraphs are bolded. A horizontal rule separates the header from the main content, and another rule separates the main content from the footnote at the bottom.<br><br>[paragraphs]<br><br>Preprint Version.<br><br>**Figure–Table Integration.** In addition to textual refinement, we extend the refinement process to include multimodal elements, to further enhance readability. For each section, the model first generates visualization requirements, such as tables with structured comparisons or figures with explanatory diagrams, together with natural language descriptions. Based on these descriptions, candidate figures and tables are synthesized. The compiled outputs are then fed back to an LLM for quality assessment, enabling automatic detection of issues such as oversized layouts or unreadable text. The LLM provides corrective suggestions, which are applied to improve the final visualizations. Finally, the text is refined again to ensure that all generated figures and tables are properly referenced within the survey.<br><br># 4 EXPERIMENTS<br><br>## 4.1 EXPERIMENTAL SETTINGS<br><br>**Implementation Details.** Following Wang et al. (2024b), we adopt **GPT-4o-mini** as our genera-tion model for its balance of responsiveness and cost. Our retrieval database contains 680K computer science papers from arXiv, with PDFs converted into structured Markdown using MinerU (Wang et al., 2024a) for consistent formatting. The details of the retrieval process are provided in App. A.1. In outline generation, the system consults 1000–1200 papers, with a maximum of 8 sections. For section drafting, each subsection retrieves up to 60 additional relevant papers, combined with those linked during outline generation. Finally, we apply two iterations of the review-and-refine loop to enhance coherence across sections and improve overall readability. Illustrative outputs compared with AutoSurvey are provided in App. A.8.<br><br>**Baselines.** We compare IterSurvey with a set of baselines, ranging from simple retrieval-augmented generation (Naive RAG), which directly drafts from retrieved documents, to more ad-vanced state-of-the-art systems. Specifically, we evaluate against AutoSurvey (Wang et al., 2024b), the first systematic framework for this task; SurveyForge (Yan et al., 2025), which combines heuris-tic outline generation based on the logical structures of human-written surveys with a memory-driven scholar navigation agent for high-quality retrieval; and SurveyGo (Wang et al., 2025), which em-ploys the LLM×MapReduce-V2 algorithm to address the long-context challenge. We also compare with SurveyX (Liang et al., 2025), which introduces an Attribute Tree-based outlining mechanism; however, due to access restrictions, we include SurveyX only in arena experiments. All methods are evaluated on the same retrieval database with generation hyperparameters aligned to their original settings for fairness.<br><br>## 4.2 AUTOMATIC EVALUATION RESULTS<br><br>**Evaluation Setup.** We employ multiple complementary protocols to evaluate the quality of gen-erated surveys. On the 20-topic suite from Wang et al. (2024b), we adopt multi-dimensional scoring with LLM-as-a-judge. Content quality is assessed along three dimensions: coverage, structure, and relevance followed from Wang et al. (2024b). Besides, citation quality is evaluated using the NLI-based protocol of Gao et al. (2023), reporting both recall and precision: _Citation Recall_ measures whether all statements in the generated text are fully supported by the cited passages, while _Citation Precision_ identifies irrelevant citations to ensure that references are pertinent and directly support the claims. To improve scoring stability and reliability, prompts are standardized and judges must pro-vide a rationale before assigning scores. For additional robustness, we aggregate outputs from three judge models: GPT-4o, Claude-3.5-Haiku, and GLM-4.5V.1 Full prompts are provided in App. A.7.<br><br>**Results.** The results on the 20 topics from Wang et al. (2024b) are reported in Tab. 1. Statistical significance was confirmed via paired t-tests, indicating that IterSurvey consistently outperforms baseline models ($p < 0.05$). We summarize the main observations below.<br><br>- **Overall superiority.** IterSurvey consistently outperforms all baselines across both content and citation quality, achieving the highest overall average score (4.75). This demonstrates that the proposed framework is effective and robust across multiple evaluation dimensions.<br><br>[page_number]<br><br>6<br><br>[footnotes]<br><br>1Specifically, we use `chatgpt-4o-latest`, `claude-3-5-haiku-20241022`, and `glm-4.5v`.</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/da32110a-c025-42cc-9483-3416b406d23e" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/cd54149f-bf28-432c-83c4-0ea461aadab8" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/9de005fa-fb69-40c2-a1b3-6569e2e6a5c9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/acb85d5a-1a20-4c71-a2fc-07760bab0c27" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>This is a single-column page containing mostly text, structured with section headings and bold inline subheadings. URLs are formatted in a monospaced font and hyperlinked.<br><br>[paragraphs]<br><br># A Image generation models<br><br>This section details the two diffusion image generation models used in this work, namely Stable Diffusion 1.4 and 1.5.<br><br>**Stable Diffusion 1.4** The Stable Diffusion model is a text-conditioned image generator model that combines an autoencoder with a diffusion model to create a latent diffusion model. The autoencoder encodes images into latent representations with a reduced dimensionality when compared to the input image, reducing the computational needs during the training phase. Text prompts, on the other hand, are encoded using a text encoder and are then cross-attended by the UNet backbone of the latent diffusion model. Finally, the loss is computed using a reconstruction objective between the noise added to the latent representation and the prediction made by the UNet.<br>Stable Diffusion 1.4 (https://huggingface.co/CompVis/stable-diffusion-v1-4) had several rounds of training on the LAION dataset (https://laion.ai/), with each round changing the input image dimension, aesthetic score, and the probability of dropping the text-conditioning to improve classifier-free guidance.<br><br>**Stable Diffusion 1.5** SD 1.5, in turn, has the same architecture and even the same starting point as 1.4, with the difference being how long the model was fine-tuned on top of SD 1.2. The 1.4 version is fine-tuned for 225 thousand steps at resolution 512x512 on “laion-aesthetics v2 5+” with a 10% probability of dropping the text-conditioning, and version 1.5 for 595 thousand steps.<br>As demonstrated in Section D Stable Diffusion 1.4 has better performance than 1.5 in our approach, therefore, we will adopt SD 1.4 for most of the experiments in this paper.<br><br># B Large language models<br><br>Here we give additional details on the large language models that we used in our experiments.<br><br>**Gemma** (Mesnard et al., 2024), trained on a diverse 6 Trillion token dataset comprising web documents, code and mathematical texts. We resorted to the 7 Billion parameter instruction-tuned decoder-only model, named _gemma-7b-it_ (https://huggingface.co/google/gemma-7b-it). This model uses a chat template, which we employ during inference.<br><br>**Llama 2** (Touvron et al., 2023), of which we used the 7 Billion parameter, pre-trained-only model, _Llama-2-7b_ (https://huggingface.co/meta-llama/Llama-2-7b-hf). This model was trained with a mix of publicly available data totalling 2 Trillion tokens. While its chat versions employ supervised fine-tuning and reinforcement learning with human feedback for alignment with human preferences in helpfulness and safety, the pre-trained-only model does not. This results in a less constrained model, but it may also cause it to disperse from the task at hand. Since this model is a pre-trained-only no chat template is needed.<br><br>**Mistral** (Jiang et al., 2023) fine-tuned on various HuggingFace instruction datasets. We resorted to the 7 Billion _Mistral-7B-Instruct-v0.2_ model (https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.2) and used the respective chat template during inference.<br><br>**Phi-2** (Gunasekar et al., 2023) is a compact 2.7 Billion model (https://huggingface.co/microsoft/ phi-2). Despite its size, it offers a competitive performance with respect to models several times its size. It was trained on 250 Billion tokens, obtained through a combination of NLP synthetic data created by GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by GPT-4. This model was not fine-tuned through reinforcement learning from human feedback and does not have guardrails.<br><br>**Model ranking**<br>A ranking of these models in terms of their performance can be found in the HuggingFace leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which assesses several LLMs that are trained under the same criteria and tested on the same benchmarks, including reasoning</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout, likely from a preprint server like arXiv. It features a title, author list with affiliations, an abstract, and the beginning of the "Introduction" section. A preprint notification ("Preprint. Under review.") is present at the bottom left. The text on the left margin ("arXiv:2502.01522v2 [cs.CV] 30 May 2025") is a vertical stamp typical of arXiv submissions.<br><br>[paragraphs]<br><br>arXiv:2502.01522v2 [cs.CV] 30 May 2025<br><br># Unpaired Deblurring via Decoupled Diffusion Model<br><br>**Junhao Cheng**$^1$, **Wei-Ting Chen**$^2$, **Xi Lu**$^1$, **Ming-Hsuan Yang**$^3$<br>$^1$Sun Yat-sen University $^2$ Microsoft $^3$ University of California, Merced<br>https://github.com/donahowe/UID-Diff<br><br>**Abstract**<br><br>Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose UID-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural features and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. We further introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of UID-Diff when encountering unknown blur patterns. Experiments on real-world datasets demonstrate that UID-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios.<br><br># 1 Introduction<br><br>Dynamic blur occurs when the camera and subject move relative to each other during the exposure time, resulting in a smeared and blurred image. Deblurring, the process of removing the blur pattern while preserving the underlying structure of degraded images, is essential for restoring high-quality images for human perception and low-level computer vision applications.<br><br>With the rapid advancement of photographic technology, a wide range of imaging devices are now employed to capture images in real-world scenarios. Due to their diverse lenses and structural designs, these devices may produce distinct blur patterns [1, 2, 3]. This diversity makes it challenging to develop an all-in-one method for deblurring images from arbitrary and varied sources. Consequently, focusing on deblurring algorithms tailored to specific domains has become increasingly significant.<br><br>As deep learning has advanced in recent years, existing deblurring models predominantly build on data-driven approaches that employ neural networks trained via supervised learning on synthetic paired data. Existing works have made efforts to develop deblurring models upon CNN [4, 5], Transformer [6, 7], and GAN [8, 9]. Recently, a new wave of research [10, 11, 12] has begun to investigate the integration of pre-trained generative diffusion models [13], such as Stable Diffusion (SD) [14], with an adapter designed to provide structural guidance for deblurring. These approaches aim to harness the generative capabilities of diffusion models to supplement missing details and generate aesthetically pleasing outputs. However, since paired blurry-sharp training data is limited in<br><br>[footnotes]<br><br>Preprint. Under review.</div></details></td>
</tr>
</table>
### 综合布局及内容理解 (Overall Aesthetics)
<table width="100%" style="table-layout: fixed; text-align: center;">
<tr>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
<th width="25%" align="center">U1-8B-MoT</th>
<th width="25%" align="center">8B-MoT-Infographic</th>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/6b7aaab9-0d28-487f-a7fc-a7969e311bb9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/8324436d-40e4-4ef1-8a87-9d0502305f84" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/cd510868-b859-4d7c-964e-bda7a9b1d24e" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/aeb08dff-0938-4b88-b895-14dc2fa7dad1" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以“曲尼司特”为标题,整体采用浅蓝白色调,布局清晰,分为多个模块化区域,围绕中央的透明胶囊图像展开。右上角展示曲尼司特的化学结构式及其分子式 C₁₈H₁₇NO₄。<br><br>**1. 活性成分数据(左上)**<br>- 以环形图形式展示成分比例:<br> - 曲尼司特 >98%<br> - 辅料 <2%<br>- 图下方标注:“纯度高,临床级标准”<br><br>**2. 适应症(右上)**<br>- 通过三个图标及文字说明:<br> - 鼻部图标:过敏性疾病<br> - 皮肤纹理图标:纤维化<br> - 疤痕图标:瘢痕疙瘩<br><br>**3. 剂量矩阵(中左)**<br>- 表格形式,包含两列:“口服”和“频率”<br> - 成人:100mg / 次;频率:1-3 次 / 天<br> - 儿童:咨询医生;频率:遵医嘱<br><br>**4. 药代动力学时间轴(中右)**<br>- 折线图,横轴为时间(0h 至 24h),纵轴为浓度(无刻度)<br> - 0h:吸收开始(水滴图标)<br> - 1-2h:峰值浓度(山峰图标)<br> - 4-6h:分布/代谢(循环箭头图标)<br> - 24h:排泄(垃圾桶图标)<br>- 图中标注半衰期 ≈ 5-8h<br><br>**5. 警告网格(左下)**<br>- 分为四个象限,每个配有图标和文字:<br> - 相互作用:CYP酶抑制剂/诱导剂(齿轮图标)<br> - 副作用:胃肠道不适,皮疹(胃图标)<br> - 肝功能:定期监测(肝脏图标)<br> - 肾功能:慎用(肾脏图标)<br><br>**6. 患者适用性(中下)**<br>- 两个图标组合:<br> - 成人:人物图标 + 对勾,标注“成人 适用”<br> - 儿童:人物图标 + 问号 + 医生图标,标注“儿童 咨询医生”<br><br>**7. 储存指南(右下)**<br>- 三个图标并列:<br> - 温度计图标:2-25℃ 室温<br> - 密封瓶图标:密闭<br> - 遮光图标(太阳加斜线):避光<br><br>整体设计风格现代、专业,使用大量图标辅助理解,数据可视化清晰,适合医疗或药品宣传场景。所有文本均为中文,语言准确,无冗余描述。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents an augmented reality (AR) shopping experience overlaid on a real-world retail environment. The scene is set in a brightly lit cosmetics aisle of a store, with shelves stocked with beauty products visible in the background. In the foreground, a pair of hands holds a black rectangular compact labeled "ANASTASIA BEVERLY HILLS BROW POWDER DUO" with "EBONY" and "NET WT. 2.5 OZ." printed below. A gold ring is visible on the left hand’s ring finger, and a black wristband is partially seen on the left wrist.<br><br>Overlaid on the image are several semi-transparent, rounded-corner UI elements resembling AR pop-ups or digital cards, providing contextual information about the product and the user’s shopping list.<br><br>On the left side, a vertical panel titled "SHOPPING LIST" lists four items:<br>1. Face Wash — marked with an “X” (completed)<br>2. Shampoo — marked with an “X” (completed)<br>2. Eye Cream — marked with an empty checkbox (not completed; duplicated item number)<br>3. Eye Cream — marked with an empty checkbox (not completed)<br><br>This suggests a possible error or duplication in the list, with two entries for "Eye Cream".<br><br>In the center-right, a speech-bubble-shaped label displays the price: "$23.00".<br><br>To the right of the product, a larger panel titled "PRODUCT DETAILS:" provides information about the "ABH Brow Powder Duo". It features two color swatches:<br>- Left swatch: labeled "DEEP BROWN"<br>- Right swatch: labeled "BLACK"<br><br>Below the swatches, a star rating system shows four and a half filled stars, accompanied by the text "4.5 out of 5 stars".<br><br>Underneath the rating, a section titled "COMMON USES:" states: "DEFINES & FILLS BROWS".<br><br>Further down, a smaller rectangular box labeled "KEY INGREDIENTS" lists:<br>- Vitamin E<br>- Finely Milled Pigments<br><br>At the bottom right, another box titled "APPLICATION TIPS" includes a video icon (a rectangle with a play triangle) and the word "Video", indicating a multimedia tutorial is available.<br><br>The overall layout mimics an immersive AR interface, likely from a smart glasses or smartphone application, designed to enhance in-store shopping by providing instant, interactive product data directly within the user’s field of view. The visual style uses dark gray, translucent backgrounds with white text for high contrast and readability against the busy store backdrop. The design emphasizes usability, with clear categorization of information into distinct panels and intuitive icons. All textual content is in English, and no other languages are present.</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/8745f320-e406-47ab-aeb0-55e65c0322d8" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/1b3765bb-bd50-42d7-921b-3819be625cd8" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/c4a703ea-56c0-4fa5-a9c4-012180795539" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/53b829cd-3df4-44a4-a518-52cfef916eee" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以深蓝色科技感背景为主,配以紫色和青色的电路板图案边框,营造出未来数字设备的视觉氛围。标题“谷歌最新血氧仪机型参数对比(社媒版)”位于顶部中央,使用发光白色字体,突出主题。整体布局为横向三栏式对比结构,左侧为参数类别标签列,中间及右侧分别为三款智能穿戴设备的参数详情。<br><br>左侧参数类别列以图标+文字形式垂直排列,包括:<br>- 芯片(图标为芯片符号)<br>- 电池(图标为电池符号)<br>- 功能(图标为心电波形符号)<br>- 重量(图标为秤盘符号)<br>- 价格(图标为价格标签符号)<br>- 发售时间(图标为日历符号)<br><br>中间三栏分别对应三款产品:<br>1. **高亮推荐机型:Google Pixel Pulse(最新推荐)** <br> - 标题上方有金色星形徽章“★ 高亮推荐机型”,并用金色边框高亮显示。<br> - 芯片:Tensor G4定制芯片<br> - 电池:7天续航,快充<br> - 功能:连续血氧监测,睡眠/压力追踪,AI健康指导<br> - 重量:28克(轻盈)<br> - 价格:¥1999<br> - 发售时间:2024年10月<br><br>2. **竞品A(例如:Apple Watch S9)** <br> - 芯片:S9 SiP芯片<br> - 电池:18小时(正常使用)<br> - 功能:按需血氧,心电图APP,摔倒检测<br> - 重量:32克<br> - 价格:¥3199<br> - 发售时间:2023年9月<br><br>3. **竞品B(例如:Garmin Venu 3)** <br> - 芯片:Elevated V5传感器<br> - 电池:14天(智能模式)<br> - 功能:全天候血氧,身体电量,GPS运动<br> - 重量:35克<br> - 价格:¥2499<br> - 发售时间:2023年8月<br><br>所有数据均采用清晰的横向分隔线组织,每项参数内容居中对齐,字体为简洁现代的无衬线体,颜色为浅蓝或白色,确保可读性。高亮推荐机型使用金色边框和更明亮的文字,形成视觉焦点。<br><br>底部有一行注释文字:“注:以上参数仅供参考,具体以官方发布为准。#科技 #健康 #谷歌新品 #血氧仪对比”,字体较小,颜色较暗,作为补充说明。<br><br>整体设计风格现代、科技感强,通过色彩对比、边框高亮和图标辅助,有效传达了各机型在关键性能指标上的差异,尤其突出了Google Pixel Pulse在续航、价格和功能集成方面的优势。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图以复古手绘风格呈现,整体布局如一本打开的泛黄书页,背景为米黄色仿旧纸张质感,边缘带有不规则撕裂效果。标题“博物馆游览扩展内容与要点”位于顶部中央,字体为深棕色艺术字,两侧饰有卷曲花纹装饰,视觉上突出主题。<br><br>全图采用六点式结构化布局,围绕中心分布六个核心模块,每个模块均配有独立插画、编号标题和说明文字,通过装饰性边框、花环、丝带等元素进行区分与美化。整体设计风格温馨、文艺,融合了音乐符号、星星、薰衣草、云朵等点缀元素,营造出轻松愉悦的文化探索氛围。<br><br>各模块内容如下:<br><br>1. **沉浸式体验** <br> - 标题:“1. 沉浸式体验” <br> - 说明文字:“参与互动展览,感受历史场景还原,身临其境。” <br> - 视觉元素:左侧描绘一位金发男孩手持放大镜观察一个微缩历史街景模型(包含房屋、摊位和人物),上方有齿轮与灯泡组成的思考气泡,象征探索与发现。右侧配有一个系着粉色蝴蝶结的礼物盒,标签写有“SURPRISE”。<br><br>2. **主题讲座与工作坊** <br> - 标题:“2. 主题讲座与工作坊” <br> - 说明文字:“聆听专家深度解读,亲手制作手工艺品,学习新知。” <br> - 视觉元素:右侧展示一张木桌,桌上摆放陶壶、陶罐、刻刀等手工工具,旁边堆叠书籍与卷轴;周围环绕橄榄枝花环,上方悬挂一串风铃(含月亮、星星与铃铛),背景点缀云朵与星光。<br><br>3. **馆藏珍品探索** <br> - 标题:“3. 馆藏珍品探索” <br> - 说明文字:“寻找镇馆之宝,了解背后的故事与文化价值,深度挖掘。” <br> - 视觉元素:左侧是一个打开的木质宝箱,内有青铜鼎状文物与发光卷轴;旁有绿色玉璧吊坠、散落铜钱,以及一支点燃的白色蜡烛,烛台装饰有薰衣草与小花束。<br><br>4. **特色导览路线** <br> - 标题:“4. 特色导览路线”(置于米色丝带横幅中) <br> - 说明文字:“跟随定制路线,发现隐秘角落与独特视角,别样精彩。” <br> - 视觉元素:下方是一张展开的复古地图,标有拱门、凉亭、佛像、雕塑等景点,以红色虚线连接,并配有指南针图标,体现路径规划概念。<br><br>5. **数字化互动** <br> - 标题:“5. 数字化互动”(置于圆形波点边框内) <br> - 说明文字:“利用AR/VR技术,打破时空限制,体验虚拟现实。” <br> - 视觉元素:右侧描绘一位戴VR眼镜的人正在触控空中悬浮的陶罐图像,周围有Wi-Fi信号、数据图表、声波图等科技元素,体现数字交互场景。<br><br>6. **文化衍生品** <br> - 标题:“6. 文化衍生品” <br> - 说明文字:“选购独特纪念品,将博物馆记忆带回家,延续美好。” <br> - 视觉元素:左下角陈列多种文创商品,包括印有博物馆建筑图案的帆布袋(标有“MUSEUM”)、笔记本、明信片、徽章;右下角则是一盘精致三明治(面包上烙有五角星图案),配蓝莓与卷饼,旁有一只戴派对帽、系蓝色蝴蝶结的白鹅,口中喷出音符,充满童趣。<br><br>整张信息图通过图文结合的方式,系统介绍了博物馆参观的六大延伸活动,既传达实用信息,又兼具美学感染力,适合用于宣传册、教育海报或线上推广材料。所有文本均为中文,无英文或数字编码,语言风格亲切自然,符合大众传播需求。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/32f54d2b-c0cb-4cd0-9513-b17c10f30dab" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/3bdbdb5f-538c-4e2f-971d-7118f9409e96" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/0282904c-fe15-4eef-9bd1-5362beae14f4" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/ef0e8194-3ab6-4d12-9b11-ae59068cc0a7" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为《国际货运代理业务流程与注意事项全指南》,旨在系统梳理国际货运代理从接单到交付的全流程操作规范,并重点提示各环节的核心注意事项与风险规避要点,帮助货运代理新人及外贸商家快速掌握业务逻辑,减少操作失误与纠纷。<br><br>整体设计采用“冰山模型”视觉隐喻:上方露出水面部分展示显性业务流程(共6个步骤),下方水下部分揭示隐性但关键的风险管理与合规要求(共3大注意事项)。背景为米白色纸张质感,边缘有撕纸效果,中央以蓝色渐变冰山贯穿上下,象征业务表层与深层的关联。顶部标题使用深棕色粗体字,配以浅橙色衬底条,副标题为黑色小号字体,简洁明了。<br><br>---<br><br>### 一、显性业务流程(冰山水上部分)——共6个步骤<br><br>每个步骤以圆角矩形卡片呈现,配有图标和箭头连接,形成线性流程图。卡片背景为浅米色,文字为深棕色,标题加粗。<br><br>**Step 1: 接单揽货**<br>- 标题:确认需求 签订委托<br>- 内容:<br> - 对接客户确认货物明细、运输时效、目的地<br> - 核对禁运/限运品,确认报价<br> - 签订正式委托合同,明确权责<br>- 图标:握手图标<br><br>**Step 2: 订舱配载**<br>- 标题:预定舱位 确认出运计划<br>- 内容:<br> - 选择承运商提交订舱申请<br> - 确认舱位后发放入仓通知<br> - 易碎品/危险品提前告知,申请特殊配载<br>- 图标:轮船+飞机图标<br><br>**Step 3: 报关报检**<br>- 标题:完成通关申报 确保合法出境<br>- 内容:<br> - 整理报关资料提交申报<br> - 法检货物提前报检获取通关单<br> - 反复核对信息避免错报漏报<br>- 图标:文件+印章图标<br><br>**Step 4: 货物运输**<br>- 标题:跟踪动态 同步异常<br>- 内容:<br> - 获取运输凭证发给客户<br> - 全程跟踪轨迹,异常情况第一时间告知<br> - 提醒客户购买运输保险<br>- 图标:卡车+地球图标<br><br>**Step 5: 到港清关**<br>- 标题:办理入境申报 完成货物放行<br>- 内容:<br> - 通知收货人准备清关资料提交申报<br> - 核算税费通知缴纳办理放行<br> - 核实特定品类特殊准入要求<br>- 图标:起重机+文件图标<br><br>**Step 6: 派送签收**<br>- 标题:完成交付 确认闭环<br>- 内容:<br> - 安排本地配送车辆送达指定<br> - 提醒收货人当场核验签收<br> - 货损少货第一时间留存证据启动理赔<br>- 图标:快递车图标<br><br>---<br><br>### 二、隐性注意事项(冰山水下部分)——共3大项<br><br>水下部分分为三层,由浅蓝到深蓝渐变,每层对应一个注意事项,左侧为总述,右侧为具体细则,通过虚线箭头连接。<br><br>**注意事项1:单证管理要求**<br>- 总述:所有单证需真实准确、留存备查<br>- 细则:<br> - 单单一致 单证一致(信息完全匹配)<br> - 提单信息反复核对(确认无误后出具)<br> - 单证留存≥2年(备查避免纠纷)<br><br>**注意事项2:风险防控要点**<br>- 总述:提前识别风险,减少业务损失<br>- 细则:<br> - 核实客户与货物资质(避免涉嫌违规)<br> - 提前做好特殊货物防护(温控、防震等)<br> - 异常情况同步留证(明确权责)<br><br>**注意事项3:合规操作要求**<br>- 总述:严守法规红线,避免合规风险<br>- 细则:<br> - 禁运品严禁承接(需有资质审批)<br> - 如实申报不得逃税(严禁低报虚报)<br> - 遵守制裁合规要求(不接违规业务)<br><br>---<br><br>### 三、视觉元素与数据编码<br><br>- **图表类型**:流程图 + 分层结构图(冰山模型)<br>- **数据编码方式**<br> - 步骤顺序:通过箭头连接表示时间或逻辑顺序。<br> - 注意事项层级:通过垂直分层与虚线箭头体现“主-次”关系。<br> - 信息密度:每个步骤/注意事项均包含标题、核心动作和子项列表,信息高度结构化。<br>- **图标系统**:每个步骤配有代表性的扁平化图标(如握手、轮船、文件、卡车等),增强可读性。<br>- **色彩方案**:主色调为米白、浅棕、蓝色系,营造专业、冷静、可靠的感觉。<br><br>---<br><br>该信息图结构清晰、内容详尽,既展示了国际货运代理业务的完整操作链路,又深入挖掘了隐藏在流程背后的管理、风控与合规维度,是极具实用价值的行业指导工具。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“支持向量机(SVM)扩展内容与要点”,整体采用柔和的紫色渐变背景,搭配手绘卡通风格的白色圆角矩形模块,视觉清新且富有教育性。图中通过四个主要模块和一个底部总结栏,系统介绍了SVM的核心扩展内容,各模块之间以虚线箭头连接,形成逻辑关联。<br><br>标题位于顶部中央,使用加粗黑体字并带有白色描边,突出显示主题。<br><br>**1. 基础概念模块** <br>位于左上角,标题为“基础概念”。图示展示蓝色圆形点与红色三角形点在二维空间中被一条实线分隔,两侧有虚线表示间隔区域。右侧有一个手持铅笔的卡通机器人,其对话框内写着“找到最佳分隔线!”。下方文字说明:“最大化间隔,分类数据。” 该部分强调SVM的基本目标是寻找能最大化分类间隔的最优超平面。<br><br>**2. 核技巧(Kernel Trick)模块** <br>位于右上角,标题为“核技巧(Kernel Trick)”。左侧为二维平面上无法线性分离的蓝点与红三角点;中间箭头指向右侧三维曲面图,其中数据点被映射到高维空间后可被平面清晰分割。右侧配有一支魔法棒和星星,标注“核函数”,象征核技巧的“魔法”作用。下方文字说明:“低维不可分 → 高维可分!”,直观解释了核技巧的核心思想。<br><br>**3. 软间隔(Soft Margin)模块** <br>位于左下角,标题为“软间隔(Soft Margin)”。图示与基础概念类似,但部分数据点越过了间隔边界,旁边放置一个橡皮擦图标,标注字母“C”,代表正则化参数。下方文字说明:“允许少量错误,提高泛化能力。” 该模块说明SVM在现实数据中引入松弛变量,以平衡分类准确率与模型泛化性能。<br><br>**4. 多分类(Multi-class)模块** <br>位于右下角,标题为“多分类(Multi-class)”。左侧展示两个类别的数据点(蓝色圆点与红色三角),机器人对话框标“1 vs Rest”,表示一对多策略;右侧展示两个类别分别独立处理,机器人对话框标“1 vs 1”,表示一对一策略。图示通过箭头连接至两个圆形区域,分别包含各自类别的数据点,最终合并为分类结果。下方文字说明:“策略:一对多 或 一对一。” 该部分阐明SVM如何扩展到多类别分类问题。<br><br>**5. 应用与优势栏** <br>位于底部横向长条,标题为“应用与优势”,左侧配有灯泡图标。右侧依次列出三个应用场景及优势,每个前均有对应图标:<br>- 📚 文本分类<br>- 🖼️ 图像识别<br>- 👍 小样本表现好<br><br>整体布局结构清晰,从基础到进阶,从二分类到多分类,从线性到非线性,层层递进。图中所有文字均为中文,使用简洁明了的术语,配合生动的插画,使复杂的机器学习概念易于理解。颜色编码用于区分不同类别(蓝色圆点、红色三角、绿色方块),图形元素如机器人、魔法棒、橡皮擦等增强了趣味性和记忆点。</div></details></td>
</tr>
<tr>
<td valign="top"><img src="https://github.com/user-attachments/assets/6cdaaf0c-93cd-42ff-af03-f85c91d369c7" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/6b209397-6f90-40b1-acb8-19053c2999ba" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/d271df25-6fb4-4e33-a6f8-12f65287d1f9" width="100%"></td>
<td valign="top"><img src="https://github.com/user-attachments/assets/cb134523-acda-4155-b847-c1da10f0b8f9" width="100%"></td>
</tr>
<tr>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">该信息图题为“钢琴&古筝学习参考指南”,采用类似地铁线路图的视觉设计风格,以清晰的分支结构组织内容,通过不同颜色的线条(红色、蓝色、绿色)区分主要主题模块,并使用圆形节点表示关键信息点,整体布局简洁明了,背景点缀有灰色城市轮廓和浅蓝色水波纹装饰元素,增强现代感与层次感。<br><br>图表顶部中央为标题“钢琴&古筝学习参考指南”,字体加粗醒目。从标题节点向下延伸出一条红色主干线路,代表“钢琴vs古筝核心演奏技巧差异”这一核心对比主题。该红色线路在中间分叉为两条平行支线:左侧为“钢琴(西洋键盘乐器)”,右侧为“古筝(中国民族弹拨乐器)”。<br><br>在“钢琴(西洋键盘乐器)”支线下,依次列出四个关键特征节点:<br>- 手指+手腕+手臂协同发力<br>- 标准化指法逻辑<br>- 触键力度均匀、落点准确<br>- 出厂固定音高<br><br>在“古筝(中国民族弹拨乐器)”支线下,依次列出四个对应特征节点:<br>- 佩戴义甲演奏<br>- 按、颤、滑、摇多类技法<br>- 触弦角度30-45度<br>- 左手按弦控制音高<br><br>两条支线在中间交汇处用一个带有双向箭头符号的圆形节点连接,强调两者对比关系。<br><br>从红色主干线路下方引出一条蓝色线路,标记为“两类乐器通用高效练习法”。该蓝色线路包含四个节点:<br>- 热身练习:手指拉伸、音阶爬音<br>- 分节练习:4-8小节片段拆分<br>- 慢练优先:原速50%保证清晰度<br>- 录音复盘:针对性调整问题点<br><br>再从蓝色线路下方引出一条绿色线路,标记为“演奏常见误区避坑指南”。该绿色线路呈发散状,连接四个警告/提示节点,分别用禁止符号(⛔)或警示符号(⚠️)标识:<br>- ⛔ 不要塌腕:手腕与琴键/弦面持平<br>- ⚠️ 不要追求速度优先:音准、节奏、情感是核心<br>- ⛔ 不要忽视乐理学习:帮助理解曲目逻辑<br>- ⚠️ 不要长时间连续练习:每45分钟休息10分钟<br><br>所有文字内容均为中文,字体清晰易读,层级分明,通过颜色编码(红=对比、蓝=方法、绿=避坑)和图标辅助(箭头、禁止、警示),使信息传递高效直观。此图旨在为学习者提供系统性指导,涵盖乐器特性、科学练习方法及常见错误规避,兼具实用性和教育意义。</div></details></td>
<td colspan="2" valign="top"><details><summary><b>Prompt</b></summary><div style="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "Personal Insight & Casual Life Reflection Essay Collections: A Complete Guide" is presented in a tactile, clay-like 3D style with rounded edges and textured surfaces, giving it a handcrafted, approachable aesthetic. The background is a neutral light gray, which makes the colorful clay elements stand out. The title is displayed at the top in two lines: the first line "Personal Insight & Casual Life Reflection" is in a warm terracotta brown, while the second line "Essay Collections: A Complete Guide" uses a bright sky blue for "A Complete Guide," creating visual emphasis.<br><br>The infographic is organized as a grid with three main vertical sections (rows) and seven horizontal columns. The leftmost column contains the main categories, each labeled with a large blue clay tile with yellow text. These are:<br><br>1. **Core Traits of the Genre** – with a smaller brown clay subtitle: "Key traits of the personal insight & life reflection genre"<br>2. **Value for Readers & Writers** – with a smaller brown clay subtitle: "Why These Collections Matter"<br>3. **Actionable Engagement Steps** – with a smaller brown clay subtitle: "Simple steps for new readers and writers"<br><br>The remaining six columns form the header row, each represented by a colored clay tile with white text:<br><br>- **Infographic Multi-factor** (blue)<br>- **Statistics** (teal)<br>- **Quotes** (orange)<br>- **Key Terms** (green)<br>- **Style Preferences** (purple)<br>- **Layout Preferences** (light blue)<br>- **Other Requirements** (dark purple)<br><br>Each of the three main rows then has corresponding content under these headers. For the first two data points under each category (Statistics, Quotes, Key Terms), there is a white clay tile with a brown checkmark (√), indicating inclusion or relevance. For the last three headers (Style Preferences, Layout Preferences, Other Requirements), there is a white tile with a number "3" in the respective color of that column’s header.<br><br>Below each checkmark or number, there is a larger white clay tile with black text describing the specific point. The full structure is as follows:<br><br>---<br><br>### Row 1: Core Traits of the Genre<br><br>- **Statistics**: √ → *Rooted in real lived experience*<br>- **Quotes**: √ → *Inspired by small everyday moments*<br>- **Key Terms**: √ → *No formal structure rules*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 2: Value for Readers & Writers<br><br>- **Statistics**: √ → *Readers: Comfort & relatable validation*<br>- **Quotes**: √ → *Writers: Low-stakes creative outlet*<br>- **Key Terms**: √ → *All audiences: Record of personal growth*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 3: Actionable Engagement Steps<br><br>- **Statistics**: √ → *Step 1: Start with short essays to read*<br>- **Quotes**: √ → *Step 2: Write 100-word daily reflections*<br>- **Key Terms**: √ → *Step 3: Group by theme to build your collection*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (green) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>The design uses consistent visual encoding: checkmarks indicate presence or applicability, while the number "3" likely denotes a level of preference or importance (though no legend is provided). The colors are used both for column headers and for the numbers, creating a visual link between the header and its corresponding data point. All text is rendered in a clean, sans-serif font, and the overall layout is structured and easy to follow, with clear separation between rows and columns via subtle clay dividers. The infographic communicates a comprehensive yet accessible guide to engaging with personal insight and life reflection essays through a visually engaging and tactile design.</div></details></td>
</tr>
</table>
## 案例展示
> ✨ **想了解模型的实际效果?欢迎前往 👉 [ 🖼️ 信息图案例展示 ](./u1_infographic_showcases.md) 👈 浏览 100 个生成样例!**
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment