This document records VRAM usage and performance benchmarks for the SenseNova-U1-8B-MoT model across different inference tasks. All tests are run with the `--profile` flag on a single NVIDIA H100 80G GPU.
---
## Text-to-Image
Standard text-to-image inference without chain-of-thought.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt"Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus."\
Enables chain-of-thought reasoning (`--think`), where the model outputs its reasoning process before generating the image. Generation time and VRAM usage increase slightly.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus." \
Image editing requires both an input image and an editing instruction. Processing the additional input image results in higher peak VRAM usage compared to plain text-to-image.
Interleaved generation produces multiple images and corresponding text in a single inference call. Per-image token count is lower, but overall VRAM usage and wall time are substantially higher.
```bash
python examples/interleave/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT/ \
--prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." \
> VRAM columns are formatted as `allocated / reserved`. CPU RSS is the peak RSS during the generation phase.
## Low-VRAM Inference (Text-to-Image as Example)
### VRAM Budget Cap (`--max_memory`)
The `--max_memory` parameter caps the GPU VRAM budget to simulate consumer-grade GPUs with varying VRAM capacities, covering 32 GB (e.g. RTX 5090), 24 GB (e.g. RTX 4090), 16 GB (e.g. RTX 4080), 12 GB (e.g. RTX 4070), and 8 GB (e.g. RTX 4060). Model layers exceeding the VRAM budget are automatically offloaded to CPU memory, so CPU RSS rises significantly as the GPU budget decreases.
> It is recommended to set `max_memory` slightly below the GPU's physical VRAM (e.g. use `26GiB`–`28GiB` for a 32 GB card) to leave enough headroom and avoid OOM errors during inference.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "..." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--device_map auto \
--max_memory "0=<N>GiB,cpu=80GiB" \
--profile
```
| GPU Budget | Target GPU | Load Peak VRAM (GiB) | Gen Peak VRAM (GiB) | Load CPU RSS (GiB) | Gen CPU RSS (GiB) | Avg Time (s) | Throughput (tok/s) |
> VRAM columns are formatted as `allocated / reserved`. As the GPU budget decreases, model layers are progressively offloaded to CPU, causing CPU RSS to rise and inference throughput to drop.
### VRAM Optimization Mode (`--vram_mode`)
The `--vram_mode` parameter selects the VRAM optimization strategy, trading off inference speed against VRAM footprint.
```bash
python examples/t2i/inference.py \
--model_path checkpoints/SenseNova-U1-8B-MoT \
--prompt "..." \
--output_dir outputs/ \
--cfg_scale 4.0 \
--cfg_norm none \
--timestep_shift 3.0 \
--num_steps 50 \
--vram_mode <full|balanced|low>\
--profile
```
| `--vram_mode` | Strategy | Load Peak VRAM (GiB) | Gen Peak VRAM (GiB) | Load CPU RSS (GiB) | Gen CPU RSS (GiB) | Avg Time (s) | Throughput (tok/s) |
> VRAM columns are formatted as `allocated / reserved`. `balanced` and `low` modes use lazy loading — no GPU VRAM is allocated during model load (shown as —); weights are swapped in on demand, causing CPU RSS to rise significantly during generation.
--prompt"Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus."\
--prompt "Close portrait of an elderly woman by a farmhouse window, textured skin, gentle smile, warm natural light, emotional documentary look. The portrait should feel polished and natural, with sharp eyes, realistic skin texture, accurate facial anatomy, and premium lighting that keeps the face as the main focus." \
This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
## Overview
SenseNova-U1 is exposed as one unified multimodal model, but the understanding and generation paths exhibit different execution shapes in production. They tend to prefer different scheduling policies, parallelization strategies, and resource ratios, rather than a single shared serving configuration. When both are coupled inside one monolithic runtime, these choices become unnecessarily tied together, which can leave both paths operating away from their respective optimal points.
To avoid this coupling, SenseNova-U1 adopts a **disaggregated** architecture:
-**LightLLM** for understanding, text streaming, and control flow
-**LightX2V** for image generation
These two engines exchange generation state through pinned shared memory and high-performance transfer kernels. The handoff is lightweight, while each side can still run with its own optimal execution policy.
This design provides practical benefits in production:
- Independent parallelism (for example, understanding with `TP=2` (Tensor Parallel=2), generation
with `CFG=2` (CFG Parallel=2) or `SP=2` (Sequence Parallel=2)).
- Independent resource allocation (different GPU counts and memory budgets).
- Independent scaling for text-heavy vs. image-heavy traffic.
- Better operational isolation and simpler performance tuning.
The same architecture can be deployed in two modes, depending on your hardware budget and traffic pattern:
-**Separate**: LightLLM and LightX2V run on different GPU groups.
-**Colocate**: LightLLM and LightX2V run as separate processes on the same GPU.
In most production setups, `Separate` is the default choice because it gives clearer bottleneck control and independent scaling. `Colocate` is useful for quick validation, generation-heave scenes, or smaller GPU setups.
### Attention for Multimodal Prefill of NEO-Unify
NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FlashAttention3 (FA3) codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
To preserve the causal-triangle speedup whenever possible, the kernel makes the decision per M-block. It OR-reduces the image_token_tag values inside the current block: if the block contains no image token, it keeps the standard causal K-range; if the block contains image tokens, it extends the K-range to cover the required image span. As a result, pure-text blocks still follow the normal causal path, while only the relevant blocks pay the extra work needed by the hybrid mask.
The overhead therefore does not depend on a fixed ratio, but on how image tokens are distributed across the sequence and across M-block boundaries. When image rows are concentrated in only part of the sequence, the extra work is correspondingly localized. For text-only requests, image_token_tag is empty, and the kernel falls back to vanilla FA3 with no additional overhead.
The benchmark below compares two implementations for Neo-style multimodal prefill:
-**Triton implementation**: easier to migrate into existing codebases, with lower
integration cost and faster iteration.
-**FA3 implementation**: higher absolute performance on supported hardware.
In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
### Cross-Model Speed Comparison
The table below compares the latency of a single diffusion step for
**2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
all measurements are taken on **H100**; the `TP2+CFG2` result uses
`2x H100`.
Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
This guide covers setting up the Python environment for running SenseNova-U1 locally with the `transformers` backend.
> **Software versions:** Python 3.11, torch 2.8, CUDA 12.8 (cu128). Update `pyproject.toml` index URLs if your driver requires a different CUDA version.
We recommend [**uv**](https://docs.astral.sh/uv/) to manage the Python environment.
The `sensenova_u1` package is installed in editable mode, so the canonical [NEO-Unify model](../src/sensenova_u1/models/neo_unify/) is automatically registered with `transformers.Auto*` at import time.
> **Older NVIDIA drivers:** the default index is CUDA 12.8. If your driver
> does not support cu128, change `[tool.uv.sources]` / `[[tool.uv.index]]`
> in `pyproject.toml` to e.g. `https://download.pytorch.org/whl/cu126` (and
> adjust the pinned torch / torchvision versions accordingly) before
> running `uv sync`.
## Optional: flash-attn
`flash-attn` is declared as an optional extra;
without it the model transparently falls back to torch SDPA;
once flash-attn is importable the runtime picks it automatically (`--attn_backend auto`).
```bash
# (a) Build from source via PyPI
uv sync--extra flash
# (b) Install a prebuilt CUDA wheel matching your torch + Python
First, create a `.env` file and populate it with the four required parameters. Then just add `--enhance` to your `examples/t2i/inference.py` command line.
Add `--print_enhance` to echo the original + enhanced prompt for
debugging.
To use **SenseNova 6.7 Flash-Lite** as the enhancer, get an API key from
| **Gemini 3.1 Pro** (Default) | `chat_completions` | `https://generativelanguage.googleapis.com/v1beta/openai/chat/completions` | Best overall infographic quality in our internal bench. Excellent at structured / hierarchical content. |
| SenseNova 6.7 Flash-Lite | `chat_completions` | `https://token.sensenova.cn/v1/chat/completions` | Near Gemini 3.1 Pro quality on Chinese content at lower per-token cost, preferred for production. |
| Anthropic Claude (Sonnet/Opus) | `anthropic` | `https://api.anthropic.com/v1/messages` | Strong typography discipline, slightly less "information-dense" out of the box. |
| Kimi 2.5 | `chat_completions` | `https://api.moonshot.cn/v1/chat/completions` | Good Chinese enhancements, weaker for English-dense infographics in our runs. |
| Gemini 3.1 Flash-Lite (Third-party service) | `chat_completions` | `https://aigateway.edgecloudapp.com/v1/f194fd69361cd590f1fa136c9c90eca1/senseai` | The overall quality of the information chart is high and its generation speed is fast. |
| Kimi 2.5/Qwen3.6-Plus (Third-party service) | `chat_completions` | `https://coding.dashscope.aliyuncs.com/v1/chat/completions` | Good Chinese enhancements. Different models can be flexibly selected. |
## 4. Qualitative comparison
> The table below will be populated with side-by-side samples from the same
> handful of base prompts, rendered at `2048×2048` with identical sampler
> knobs. PRs with new backends welcome.
| Base prompt | No enhance | Gemini 3.1 Pro | SenseNova | Qwen3.6-Plus | Kimi 2.5 |
| Popularizing the importance of three meals a day | <imgsrc="assets/showcases/prompt_enhancement/case3.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_gemini_enhanced.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_sensenova_enhanced.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_qwen_enhanced.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_kimi_enhanced.webp"width="150"> |
| Popularizing the importance of three meals a day | <imgsrc="assets/showcases/prompt_enhancement/case3.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_gemini_enhanced.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_sensenova_enhanced.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_qwen_enhanced.webp"width="150"> | <imgsrc="assets/showcases/prompt_enhancement/case3_kimi_enhanced.webp"width="150"> |
| [<img width="300" alt="t2i general dense face hd 07" src="./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
| [<img width="260" alt="t2i general face hd 13" src="./assets/showcases/t2i_general/1_1_face_hd_13.webp">](./assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./assets/showcases/t2i_general/1_1_face_hd_17.webp">](./assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
| [<img width="260" alt="t2i general landscape 06" src="./assets/showcases/t2i_general/1_1_landscape_06.webp">](./assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./assets/showcases/t2i_general/1_1_landscape_07.webp">](./assets/showcases/t2i_general/1_1_landscape_07.webp) |
| [<img width="200" alt="t2i general portrait artistic 02 a" src="./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./assets/showcases/t2i_general/9_16_human_pose_11.webp">](./assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./assets/showcases/t2i_general/9_16_artistic_07.webp">](./assets/showcases/t2i_general/9_16_artistic_07.webp) |
| [<img width="200" alt="t2i general portrait text rendering 02" src="./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
<tdstyle="vertical-align: top;">The playful craft that embodies Russian cultural charm</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is the matryoshka, identified as a Russian craft. Essential modifiers include playful and Russian cultural charm. The relation is that the craft embodies the charm. 2. <b>Reasoning Process:</b> The prompt identifies the matryoshka, the iconic Russian wooden doll set. 3. <b>Establish the frame:</b> The frame captures the matryoshka set in the foreground. The composition focuses on the Russian craft to show the playful nature of the doll set. 4. <b>Set the lighting and color:</b> Lighting illuminates the matryoshka to reveal the Russian cultural charm. The color palette supports the playful craft aesthetic. 5. <b>Lock the style:</b> The style emphasizes the wooden nature of the Russian craft. The finish reflects the charm of the matryoshka. 6. <b>Explicit Prompt:</b> A set of three colorful, hand-painted wooden matryoshka dolls arranged by size on a rustic wooden table, bright floral patterns, soft natural light.</div></td>
<tdstyle="vertical-align: top;">A typical dish from the country where Naples is located</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
<tdstyle="vertical-align: top;">A gigantic bubble in the immediate foreground with a small town barely visible inside</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The prompt requires a gigantic bubble positioned in the immediate foreground. A small town must be visible inside the bubble. 2. <b>Reasoning Process:</b> The bubble acts as a transparent lens, potentially distorting the town's appearance due to its curvature. 3. <b>Establish the frame:</b> The shot scale is an extreme close-up, placing the bubble so close it fills the majority of the frame. The focus plane is set on the bubble's surface, rendering the background outside the town soft and out of focus. 4. <b>Set the lighting and color:</b> Natural daylight creates bright specular highlights on the upper surface of the bubble. The color palette is dominated by the cool transparency of the bubble and the muted tones of the town within. 5. <b>Lock the style:</b> The overall image style is photorealistic with a high level of detail on the bubble's texture. The rendering captures the glossy finish and physical properties of the bubble accurately. 6. <b>Explicit Prompt:</b> An extreme close-up of a massive, transparent soap bubble floating in the immediate foreground, with a tiny, detailed miniature town visible inside the curvature, refracted and magnified by the surface tension, soft natural daylight, dreamlike, high detail, photorealistic texture.</div></td>
<tdstyle="vertical-align: top;">A chocolate bar left in direct sunlight, highlighting the state of the chocolate</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a chocolate bar positioned in direct sunlight. The focus is on the state of the chocolate, specifically how the heat affects it. 2. <b>Reasoning Process:</b> Heat causes chocolate to melt, losing its structured shape and becoming a viscous, glossy liquid. 3. <b>Establish the frame:</b> The composition is a close-up shot that fills the frame with the chocolate bar to emphasize detail. The angle is slightly elevated to show the top surface and the pooling liquid clearly. 4. <b>Build the environment:</b> The chocolate bar rests on a generic surface that supports the object without distracting from the main subject. The background is blurred to keep attention on the foreground elements and the chocolate. 5. <b>Set the lighting and color:</b> Direct sunlight creates bright highlights on the melting chocolate, emphasizing its glossy texture. The lighting is warm and intense, casting distinct shadows and illuminating the rich brown colors of the liquid. 6. <b>Explicit Prompt:</b> A close-up of a melting chocolate bar on a surface, with the edges losing their defined shape and pooling into a glossy, viscous puddle under the heat of the sun.</div></td>
<tdstyle="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
| <divalign="center"><ahref="../examples/editing/data/images/1.webp"><imgwidth="180"alt="editing input 1"src="../examples/editing/data/images/1.webp"></a><ahref="../docs/assets/showcases/editing/1_out.webp"><imgwidth="180"alt="editing output 1"src="../docs/assets/showcases/editing/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/3.webp"><imgwidth="180"alt="editing input 3"src="../examples/editing/data/images/3.webp"></a><ahref="../docs/assets/showcases/editing/3_out.webp"><imgwidth="180"alt="editing output 3"src="../docs/assets/showcases/editing/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
| <divalign="center"><ahref="../examples/editing/data/images/2.webp"><imgwidth="180"alt="editing input 2"src="../examples/editing/data/images/2.webp"></a><ahref="../docs/assets/showcases/editing/2_out.webp"><imgwidth="180"alt="editing output 2"src="../docs/assets/showcases/editing/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/4.webp"><imgwidth="180"alt="editing input 4"src="../examples/editing/data/images/4.webp"></a><ahref="../docs/assets/showcases/editing/4_out.webp"><imgwidth="180"alt="editing output 4"src="../docs/assets/showcases/editing/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
| <divalign="center"><ahref="../examples/editing/data/images/5.webp"><imgwidth="180"alt="editing input 5"src="../examples/editing/data/images/5.webp"></a><ahref="../docs/assets/showcases/editing/5_out.webp"><imgwidth="180"alt="editing output 5"src="../docs/assets/showcases/editing/5_out.webp"></a><br><sub>Turn the image into an American comic style.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/8.webp"><imgwidth="180"alt="editing input 8"src="../examples/editing/data/images/8.webp"></a><ahref="../docs/assets/showcases/editing/8_out.webp"><imgwidth="180"alt="editing output 8"src="../docs/assets/showcases/editing/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> |
| <divalign="center"><ahref="../examples/editing/data/images/6.webp"><imgwidth="180"alt="editing input 6"src="../examples/editing/data/images/6.webp"></a><ahref="../docs/assets/showcases/editing/6_out.webp"><imgwidth="180"alt="editing output 6"src="../docs/assets/showcases/editing/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/7.webp"><imgwidth="180"alt="editing input 7"src="../examples/editing/data/images/7.webp"></a><ahref="../docs/assets/showcases/editing/7_out.webp"><imgwidth="180"alt="editing output 7"src="../docs/assets/showcases/editing/7_out.webp"></a><br><sub>Remove the person on the far right wearing a green skirt and a green top.</sub></div> |
1.<b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag’s shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a 3D perspective view of a traffic cone. 2. <b>Instruction Understanding:</b> The instruction asks for a side view. 3. <b>Reasoning Process:</b> A side view of a standard traffic cone results in a triangular silhouette with a flat rectangular base. 4. <b>Expected Visual Changes:</b> The perspective is flattened into this 2D-like geometric profile. 5. <b>Elements to Preserve:</b> The cone's height and color should remain consistent with the original object. 6. <b>Explicit Edit Prompt:</b> Edit the perspective view into a flat side-profile silhouette of a triangle with a rectangular base, keeping the red color and proportions unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a kettle pouring water onto a mug, but the stream is misaligned and missing the cup. 2. <b>Instruction Understanding:</b> The instruction asks to fix the physical inconsistency. 3. <b>Reasoning Process:</b> The water stream must be redirected to connect the spout to the mug, maintaining the trajectory of gravity. 4. <b>Expected Visual Changes:</b> The water stream will be redirected to connect the spout to the mug. 5. <b>Elements to Preserve:</b> The kettle, mug, and background must remain unchanged while the water path is corrected. 6. <b>Explicit Edit Prompt:</b> Draw a continuous water stream connecting the kettle spout to the mug, keeping the kettle, mug, and background unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a 2x2 matrix with values 1, 2, 3, and 4. 2. <b>Instruction Understanding:</b> The instruction asks to convert this to an upper triangular matrix. 3. <b>Reasoning Process:</b> By definition, an upper triangular matrix has zeros below the main diagonal, so the entry '3' must be changed to '0' while keeping '1', '2', and '4' as they are, and this modification satisfies the mathematical property requested. 4. <b>Expected Visual Changes:</b> The entry '3' in the lower-left position will be changed to '0'. 5. <b>Elements to Preserve:</b> The grid lines, the matrix structure, and the other entries must remain unchanged. 6. <b>Explicit Edit Prompt:</b> Change the '3' in the lower-left position to '0', while keeping the matrix structure and other entries unchanged.</div></td>
| [<img width="300" alt="t2i general dense face hd 07" src="./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
| [<img width="260" alt="t2i general face hd 13" src="./assets/showcases/t2i_general/1_1_face_hd_13.webp">](./assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./assets/showcases/t2i_general/1_1_face_hd_17.webp">](./assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
| [<img width="260" alt="t2i general landscape 06" src="./assets/showcases/t2i_general/1_1_landscape_06.webp">](./assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./assets/showcases/t2i_general/1_1_landscape_07.webp">](./assets/showcases/t2i_general/1_1_landscape_07.webp) |
| [<img width="200" alt="t2i general portrait artistic 02 a" src="./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./assets/showcases/t2i_general/9_16_human_pose_11.webp">](./assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./assets/showcases/t2i_general/9_16_artistic_07.webp">](./assets/showcases/t2i_general/9_16_artistic_07.webp) |
| [<img width="200" alt="t2i general portrait text rendering 02" src="./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
<tdstyle="vertical-align: top;">The playful craft that embodies Russian cultural charm</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is the matryoshka, identified as a Russian craft. Essential modifiers include playful and Russian cultural charm. The relation is that the craft embodies the charm. 2. <b>Reasoning Process:</b> The prompt identifies the matryoshka, the iconic Russian wooden doll set. 3. <b>Establish the frame:</b> The frame captures the matryoshka set in the foreground. The composition focuses on the Russian craft to show the playful nature of the doll set. 4. <b>Set the lighting and color:</b> Lighting illuminates the matryoshka to reveal the Russian cultural charm. The color palette supports the playful craft aesthetic. 5. <b>Lock the style:</b> The style emphasizes the wooden nature of the Russian craft. The finish reflects the charm of the matryoshka. 6. <b>Explicit Prompt:</b> A set of three colorful, hand-painted wooden matryoshka dolls arranged by size on a rustic wooden table, bright floral patterns, soft natural light.</div></td>
<tdstyle="vertical-align: top;">A typical dish from the country where Naples is located</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
<tdstyle="vertical-align: top;">A gigantic bubble in the immediate foreground with a small town barely visible inside</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The prompt requires a gigantic bubble positioned in the immediate foreground. A small town must be visible inside the bubble. 2. <b>Reasoning Process:</b> The bubble acts as a transparent lens, potentially distorting the town's appearance due to its curvature. 3. <b>Establish the frame:</b> The shot scale is an extreme close-up, placing the bubble so close it fills the majority of the frame. The focus plane is set on the bubble's surface, rendering the background outside the town soft and out of focus. 4. <b>Set the lighting and color:</b> Natural daylight creates bright specular highlights on the upper surface of the bubble. The color palette is dominated by the cool transparency of the bubble and the muted tones of the town within. 5. <b>Lock the style:</b> The overall image style is photorealistic with a high level of detail on the bubble's texture. The rendering captures the glossy finish and physical properties of the bubble accurately. 6. <b>Explicit Prompt:</b> An extreme close-up of a massive, transparent soap bubble floating in the immediate foreground, with a tiny, detailed miniature town visible inside the curvature, refracted and magnified by the surface tension, soft natural daylight, dreamlike, high detail, photorealistic texture.</div></td>
<tdstyle="vertical-align: top;">A chocolate bar left in direct sunlight, highlighting the state of the chocolate</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a chocolate bar positioned in direct sunlight. The focus is on the state of the chocolate, specifically how the heat affects it. 2. <b>Reasoning Process:</b> Heat causes chocolate to melt, losing its structured shape and becoming a viscous, glossy liquid. 3. <b>Establish the frame:</b> The composition is a close-up shot that fills the frame with the chocolate bar to emphasize detail. The angle is slightly elevated to show the top surface and the pooling liquid clearly. 4. <b>Build the environment:</b> The chocolate bar rests on a generic surface that supports the object without distracting from the main subject. The background is blurred to keep attention on the foreground elements and the chocolate. 5. <b>Set the lighting and color:</b> Direct sunlight creates bright highlights on the melting chocolate, emphasizing its glossy texture. The lighting is warm and intense, casting distinct shadows and illuminating the rich brown colors of the liquid. 6. <b>Explicit Prompt:</b> A close-up of a melting chocolate bar on a surface, with the edges losing their defined shape and pooling into a glossy, viscous puddle under the heat of the sun.</div></td>
<tdstyle="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
<td><divstyle="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
| <divalign="center"><ahref="../examples/editing/data/images/1.webp"><imgwidth="180"alt="editing input 1"src="../examples/editing/data/images/1.webp"></a><ahref="../docs/assets/showcases/editing/1_out.webp"><imgwidth="180"alt="editing output 1"src="../docs/assets/showcases/editing/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/3.webp"><imgwidth="180"alt="editing input 3"src="../examples/editing/data/images/3.webp"></a><ahref="../docs/assets/showcases/editing/3_out.webp"><imgwidth="180"alt="editing output 3"src="../docs/assets/showcases/editing/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
| <divalign="center"><ahref="../examples/editing/data/images/2.webp"><imgwidth="180"alt="editing input 2"src="../examples/editing/data/images/2.webp"></a><ahref="../docs/assets/showcases/editing/2_out.webp"><imgwidth="180"alt="editing output 2"src="../docs/assets/showcases/editing/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/4.webp"><imgwidth="180"alt="editing input 4"src="../examples/editing/data/images/4.webp"></a><ahref="../docs/assets/showcases/editing/4_out.webp"><imgwidth="180"alt="editing output 4"src="../docs/assets/showcases/editing/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
| <divalign="center"><ahref="../examples/editing/data/images/5.webp"><imgwidth="180"alt="editing input 5"src="../examples/editing/data/images/5.webp"></a><ahref="../docs/assets/showcases/editing/5_out.webp"><imgwidth="180"alt="editing output 5"src="../docs/assets/showcases/editing/5_out.webp"></a><br><sub>Turn the image into an American comic style.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/8.webp"><imgwidth="180"alt="editing input 8"src="../examples/editing/data/images/8.webp"></a><ahref="../docs/assets/showcases/editing/8_out.webp"><imgwidth="180"alt="editing output 8"src="../docs/assets/showcases/editing/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> |
| <divalign="center"><ahref="../examples/editing/data/images/6.webp"><imgwidth="180"alt="editing input 6"src="../examples/editing/data/images/6.webp"></a><ahref="../docs/assets/showcases/editing/6_out.webp"><imgwidth="180"alt="editing output 6"src="../docs/assets/showcases/editing/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> | <divalign="center"><ahref="../examples/editing/data/images/7.webp"><imgwidth="180"alt="editing input 7"src="../examples/editing/data/images/7.webp"></a><ahref="../docs/assets/showcases/editing/7_out.webp"><imgwidth="180"alt="editing output 7"src="../docs/assets/showcases/editing/7_out.webp"></a><br><sub>Remove the person on the far right wearing a green skirt and a green top.</sub></div> |
1.<b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag's shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a 3D perspective view of a traffic cone. 2. <b>Instruction Understanding:</b> The instruction asks for a side view. 3. <b>Reasoning Process:</b> A side view of a standard traffic cone results in a triangular silhouette with a flat rectangular base. 4. <b>Expected Visual Changes:</b> The perspective is flattened into this 2D-like geometric profile. 5. <b>Elements to Preserve:</b> The cone's height and color should remain consistent with the original object. 6. <b>Explicit Edit Prompt:</b> Edit the perspective view into a flat side-profile silhouette of a triangle with a rectangular base, keeping the red color and proportions unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a kettle pouring water onto a mug, but the stream is misaligned and missing the cup. 2. <b>Instruction Understanding:</b> The instruction asks to fix the physical inconsistency. 3. <b>Reasoning Process:</b> The water stream must be redirected to connect the spout to the mug, maintaining the trajectory of gravity. 4. <b>Expected Visual Changes:</b> The water stream will be redirected to connect the spout to the mug. 5. <b>Elements to Preserve:</b> The kettle, mug, and background must remain unchanged while the water path is corrected. 6. <b>Explicit Edit Prompt:</b> Draw a continuous water stream connecting the kettle spout to the mug, keeping the kettle, mug, and background unchanged.</div></td>
1.<b>Source Image Analysis:</b> The source image shows a 2x2 matrix with values 1, 2, 3, and 4. 2. <b>Instruction Understanding:</b> The instruction asks to convert this to an upper triangular matrix. 3. <b>Reasoning Process:</b> By definition, an upper triangular matrix has zeros below the main diagonal, so the entry '3' must be changed to '0' while keeping '1', '2', and '4' as they are, and this modification satisfies the mathematical property requested. 4. <b>Expected Visual Changes:</b> The entry '3' in the lower-left position will be changed to '0'. 5. <b>Elements to Preserve:</b> The grid lines, the matrix structure, and the other entries must remain unchanged. 6. <b>Explicit Edit Prompt:</b> Change the '3' in the lower-left position to '0', while keeping the matrix structure and other entries unchanged.</div></td>
**SenseNova-U1-8B-MoT-Infographic** is built upon the base U1-8B-MoT model with an extended MT training phase. During both the MT and SFT stages, we optimized the data distribution between visual understanding and generation tasks. In the RL stage, we further refined the reward recipe to reduce the occurrence of unintended black backgrounds in generated infographics.
-**Benchmark Performance:** Compared with the base **SenseNova-U1-8B-MoT** model, BizGenEval hard/easy increased from **39.8 / 61.1** to **46.6 / 65.4** (**+6.8 / +4.3 points**), and IGenBench Q-ACC/I-ACC increased from **51.3 / 4.2** to **69.5 / 17.0** (**+18.2 / +12.8 points**), while maintaining robust visual understanding capabilities without substantial degradation.
-**Generation Quality:** The model produces complex infographics across 100+ styles and layouts, with improved visual aesthetics and text rendering — including dense small text such as arXiv-style pages.
<sub>IGenBench scores are reported as percentages. Models are ordered by the arithmetic mean of BizGenEval hard, BizGenEval easy, IGenBench Q-ACC, and IGenBench I-ACC within the commercial and open-source groups separately. OneIG is included as a general generation reference. Full per-category results are intended for the Hugging Face model card.</sub>
## Qualitative Comparison
We present a qualitative comparison between the base **SenseNova-U1-8B-MoT** and the fine-tuned **SenseNova-U1-8B-MoT-Infographic** model across five key dimensions: background stability, chart accuracy, text rendering accuracy, arXiv paper rendering quality, and overall visual aesthetics.
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "College Entrance Pathway Reforce Comparison" presents a structured comparison of key aspects for prospective students in Guangdong, China, aiming to enter college through a specialized entrance examination. The layout is organized as a multi-column table with four main columns: "Content Item / Evaluation Criteria", "Statistics", "Quotes", and "Key Terms". Each row corresponds to a distinct evaluation criterion or step in the preparation process, with visual icons, text, and data points enhancing clarity.<br><br>The infographic uses a clean, minimalist design with black line art icons on a light beige background. Text is primarily in bold sans-serif font, with headings emphasized for readability. Data is encoded using icons (e.g., graduation cap, calendar, books, target, rocket) to visually represent concepts, while numerical values are explicitly labeled for precision.<br><br>The first row addresses **Eligibility Criteria**:<br>- In the "Statistics" column, it features an icon of a person checking a map of Guangdong with the text: "Official Eligibility Requirements Confirm if you qualify to register".<br>- The "Quotes" column lists three eligible groups with corresponding icons: "Final-Year Guangdong Junior College Student", "Guangdong Resident <2YearsPostGraduation",and"EligibleRetiredMilitaryPersonnel".<br>- The "Key Terms" column shows a magnifying glass over a document with the label: "Eligibility Verification".<br><br>The second row covers **Exam Structure & Scoring Breakdown**:<br>- "Statistics" displays icons representing different test types and scores: 100 pts (graduation cap), 200 pts (person at desk), 1000 pts (document with pen), 150 pts (document with pen). Below: "Total 500 points across 4 test papers".<br>- "Quotes" lists four subject components in document-shaped boxes: "Political Theory (100 pts)", "Major-Aligned Public Subject (100 pts)", "Professional Subject 1 (150 pts)", "Professional Subject 2 (150 pts)".<br>- "Key Terms" includes a balance scale icon with "Score Distribution".<br><br>The third row details the **Official Annual Exam Timeline**:<br>- "Statistics" contains a horizontal timeline with icons of a calendar and clock, labeled "Annual Key Timeline".<br>- "Quotes" provides a detailed timeline: Jan: Registration Open → Jan: Admission Open → Mid-Mar: Exam Date → Mid-Apr: Score Release → May-Jun: Admission Offers.<br>- "Key Terms" shows a calendar and clock with "Critical Dates".<br><br>The next three rows outline a three-step preparation strategy:<br><br>**Step 1 - Confirm Target Major & Institution**:<br>- "Statistics": Icon of a person holding a map with a target, text: "Confirm your target 6 months in advance".<br>- "Quotes": Two bullet points: "Download official exam syllabi and past professional subject papers from the target institution’s admission portal" and "Cross-verify that your junior college major meets the target major’s prerequisite requirements".<br>- "Key Terms": Clock and books with "Target Selection".<br><br>**Step 2 - Public Subject Foundation Building**:<br>- "Statistics": Icon of a person studying with books and a coffee cup, text: "Complete 3 months of structured public subject study".<br>- "Quotes": Two bullet points: "Complete 5+ years of past public subject exam papers to identify recurring test points" and "Political Theory allocates 30% of total score to current affairs from the past calendar year".<br>- "Key Terms": Box with lightbulb and "Core Knowledge".<br><br>**Step 3 - Professional Subject Sprint Revision**:<br>- "Statistics": Icon of a running person with a book and clock, text: "Focus on high-weight professional subjects in the final 2 months".<br>- "Quotes": Two bullet points: "Practice past professional subject papers from your target institution and review core major textbooks" and "60% of professional subject questions are repeated or adapted from past 3 years of papers for most institutions".<br>- "Key Terms": Trophy and gears with "Intensive Review".<br><br>Red horizontal lines separate the first three criteria from the three-step strategy, while a blue line separates Step 1 from Steps 2 and 3, visually grouping related content. All textual information is preserved exactly as presented, including spelling variations like "Oficial" (likely intended as "Official"). The infographic serves as a strategic roadmap combining official requirements, scoring details, timelines, and actionable preparation steps for candidates.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "12-Month Market Performance: US vs. Asia" presents a structured, puzzle-piece-based visual analysis comparing the performance of US and Asian equity markets over a 12-month period. The layout is organized into three main steps, arranged in a central vertical flow with interconnected puzzle pieces, emphasizing a modular, analytical approach to market comparison. The design uses clean black-and-white line art with light blue accents for key sections, icons for visual representation, and clear typography for readability.<br><br>**Step 1** (top center) introduces the scope of the analysis. It features an illustration of four people examining charts, symbolizing data analysis. To the right, it defines the market indices being compared:<br>- **US Markets**: S&P 500, NASDAQ<br>- **Asian Markets**: Nikkei 225, Hang Seng, KOSPI, CSI 300<br><br>It also lists the types of data analyzed:<br>- Trailing Return (represented by a rising bar chart icon)<br>- Average Daily Volume (represented by a stacked bar chart icon)<br>- Top Sector Return (represented by a pie chart icon)<br><br>**Step 2** (left side, labeled "Metrics that account for 72% of short-term S&P 500 volatility") focuses on US Market Core Driving Indicators. This section contains icons representing industry (factory), finance (bank building), money (hand holding dollar sign), and labor (worker in hard hat). Below these icons, a light blue banner reads "US Market Core Driving Indicators". Specific metrics are listed with red warning triangle icons:<br>- CPI YoY: 3.2%<br>- Federal Funds Rate: 5.25–5.5%<br>- Non-farm Payrolls: +187k July 2024<br><br>**Step 3** (right side, labeled "Metrics that predict 68% of MSCI Asia Ex-Japan 3-month forward returns") focuses on Asian Market Core Leading Indicators. This section includes icons for shipping (container), manufacturing (gears), and calculation (calculator). A light blue banner below reads "Asian Market Core Leading Indicators". Specific metrics are listed:<br>- Manufacturing PMI: 51.2 (with red warning triangle)<br>- Q2 Export Growth: +6.8% YoY (with red warning triangle)<br>- Avg Policy Rate: 3.1% (with information circle icon)<br><br>At the bottom center, a large puzzle piece titled "Policy Shifts & Market Volatility Correlation" displays a line graph with two fluctuating lines:<br>- **US VIX (navy line)** — representing US market volatility<br>- **Asian Avg Volatility (green line)** — representing average Asian market volatility<br><br>Arrows connect the two lines, indicating correlation. Below the graph, key insights are provided with red warning triangles:<br>- Rate hike impact: +27% US VIX<br>- Trade policy impact: +34% Asian VIX<br>- Cross-regional sell-off correlation: 0.68<br><br>The overall structure visually represents how US and Asian market performances are driven by distinct but interrelated economic indicators, with a central focus on their volatility dynamics and policy impacts. The use of puzzle pieces metaphorically suggests that these components fit together to form a complete picture of global market trends. The infographic employs consistent iconography, color-coding (red for warnings, blue for core sections), and clear textual labeling to convey complex financial data in an accessible format.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create a vintage-styled ornithology infographic with a light beige background and dark tree silhouettes framing the left and right sides. At the top, center-aligned, place the main title 'Alter bringt Vorsicht' in a bold, serif font, followed by the subtitle 'Sekunden pro Minute für Suchen vs. Picken bei Staren verschiedener Altersgruppen'. The central visualization is a diverging horizontal bar chart featuring age groups vertically listed in the middle: 'Jungvogel', 'Einjähriger Jungvogel', 'Zweijähriger Altvogel', 'Ausgewachsener Altvogel', and 'Senior'. For each group, a Rust Red bar extending to the right represents 'Picken', while a Sepia Tone bar extending to the left represents 'Suchen'. Numerical labels are placed inside each bar near the outer edge. Below the chart, two x-axes ranging from 0 to 40 reflect the time in seconds for both directions. A legend in the bottom right corner displays square markers for 'Picken' (Rust Red) and 'Suchen' (Sepia Tone). The given data is : [{"age_group": "Jungvogel", "pecking": 20, "scanning": 10}, {"age_group": "Einjähriger Jungvogel", "pecking": 30, "scanning": 20}, {"age_group": "Zweijähriger Altvogel", "pecking": 35, "scanning": 25}, {"age_group": "Ausgewachsener Altvogel", "pecking": 28, "scanning": 30}, {"age_group": "Senior", "pecking": 20, "scanning": 35}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: right; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic with the title 'Ajustes de Velocidad del Sprint' and the subtitle 'Desviación de alcance del equipo backend: Trabajo planificado vs. Corrección de errores no planificados' centered at the top. The central visual is a vertical bar chart set against a light gray background, featuring a horizontal baseline at the zero-point mark. Positive values (planned work) are represented by green bars extending upwards, while negative values (unplanned bug fixes) are represented by yellow bars extending downwards. Each bar is topped or bottomed with multi-line text containing the module name and its value in 'pts'. Specifically, the bars are labeled 'Autenticación de usuario' (50 pts), 'API de pago' (35 pts), 'Sincronización de inventario' (22 pts), 'BD de analítica' (14 pts), 'Servicio de correo' (-10 pts), 'Webhooks' (-28 pts), 'Carga de multimedia' (-42 pts), and 'Pasarela heredada' (-70 pts). An illustration of mechanical gears is positioned in the upper-right area, and a cluster of small bug icons is placed in the lower-left area. The vertical axis shows tick labels ranging from +60 pts to -80 pts. The given data is : [{"category": "Autenticación de usuario", "unit": "pts", "value": 50}, {"category": "API de pago", "unit": "pts", "value": 35}, {"category": "Sincronización de inventario", "unit": "pts", "value": 22}, {"category": "BD de analítica", "unit": "pts", "value": 14}, {"category": "Servicio de correo", "unit": "pts", "value": -10}, {"category": "Webhooks", "unit": "pts", "value": -28}, {"category": "Carga de multimedia", "unit": "pts", "value": -42}, {"category": "Pasarela heredada", "unit": "pts", "value": -70}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that features a title and a subtitle centered at the top, reading 'Fastest Cuisines to Prepare' and 'Average Ghost Kitchen Handover Time by Item Type (Minutes)' respectively. The main visual is a horizontal grouped bar chart combining a Fast-food neon visual style with checkerboard borders along the edges, featuring a centered legend above the chart area for 'QuickEats' (cyan neon border) and 'DashNow' (orange neon border). To the bottom right of the bar chart, there is a simple illustration of two mopeds waiting for orders. The chart's vertical axis lists four categories, each preceded by a simple icon, while the horizontal axis represents handover time in minutes with numerical labels at 0, 5, 10, 15, and 20, supplemented by dotted vertical gridlines. Each category features a pair of black bars representing the two platforms, with exact values displayed directly inside the right end of each bar. For 'Classic Tacos', QuickEats takes 10.0 minutes while DashNow takes 11.5 minutes. 'Supreme Burritos' require the longest preparation, with 17.5 minutes for QuickEats and 19.0 minutes for DashNow. 'Spicy Nachos' take 9.5 minutes on QuickEats and 10.0 minutes on DashNow. Finally, 'Mini Quesadillas' are the fastest, taking 8.0 minutes for QuickEats and 8.5 minutes for DashNow. The given data is : [{"category": "Classic Tacos", "platform": "QuickEats", "unit": "Minutes", "value": 10.0}, {"category": "Classic Tacos", "platform": "DashNow", "unit": "Minutes", "value": 11.5}, {"category": "Supreme Burritos", "platform": "QuickEats", "unit": "Minutes", "value": 17.5}, {"category": "Supreme Burritos", "platform": "DashNow", "unit": "Minutes", "value": 19.0}, {"category": "Spicy Nachos", "platform": "QuickEats", "unit": "Minutes", "value": 9.5}, {"category": "Spicy Nachos", "platform": "DashNow", "unit": "Minutes", "value": 10.0}, {"category": "Mini Quesadillas", "platform": "QuickEats", "unit": "Minutes", "value": 8.0}, {"category": "Mini Quesadillas", "platform": "DashNow", "unit": "Minutes", "value": 8.5}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that presents a centered title at the top, stating "Übertaktet vs. Standard-Takt", with the subtitle "Temperaturanstieg bei langen Gaming-Sessions" directly below it. The main visual is a line chart spanning the width of the infographic on a dark background, embodying a Gamer Aesthetic with vibrant RGB neon accents. This chart has a vertical axis on the left labeled with numerical values in increments of 10 from 30°C to 100°C, and a horizontal axis at the bottom with time labels: '0m', '15m', '30m', '45m', '60m', '75m', '90m', '105m', and '120m'. Horizontal grid lines mark each 10°C increment. A horizontal legend is positioned under the subtitle, containing a cyan circular marker and line for "Standard-Takt" and a magenta circular marker and line for "Übertaktet (+150MHz)". Two data series are plotted as glowing neon lines with hollow circular markers at each data point, accompanied by gradient shading below each line. The cyan "Standard-Takt" line shows a steep rise from 38°C at 0m to 68°C at 15m, followed by a flat plateau reaching 73.5°C at 120m. The magenta "Übertaktet" line displays a similar initial spike from 42°C to 75°C, but continues with a gradual linear creep up to 93°C at 120m. Spike annotations (callout boxes) point to the final data points on the right, highlighting the peak temperatures: a magenta box reads "Peak: 93°C" and a cyan box reads "Peak: 73.5°C". A stylized thermometer line-art icon is subtly placed in the center of the chart's background. The given data is : [{"profile": "Standard-Takt", "temperature": 38, "time": "0m"}, {"profile": "Übertaktet", "temperature": 42, "time": "0m"}, {"profile": "Standard-Takt", "temperature": 68, "time": "15m"}, {"profile": "Übertaktet", "temperature": 75, "time": "15m"}, {"profile": "Standard-Takt", "temperature": 71, "time": "30m"}, {"profile": "Übertaktet", "temperature": 79, "time": "30m"}, {"profile": "Standard-Takt", "temperature": 72, "time": "45m"}, {"profile": "Übertaktet", "temperature": 82, "time": "45m"}, {"profile": "Standard-Takt", "temperature": 72.5, "time": "60m"}, {"profile": "Übertaktet", "temperature": 85, "time": "60m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "75m"}, {"profile": "Übertaktet", "temperature": 87, "time": "75m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "90m"}, {"profile": "Übertaktet", "temperature": 89, "time": "90m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "105m"}, {"profile": "Übertaktet", "temperature": 91, "time": "105m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "120m"}, {"profile": "Übertaktet", "temperature": 93, "time": "120m"}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that displays data in a vertical diverging bar chart format. At the top left of the visualization, there is a title: 'Anomalie de l'Atlantique Sud : Dérive magnétique', and a subtitle: 'Vecteurs de dérive vers l'est et l'ouest en kilomètres par rapport à la ligne de base historique'. In the upper left area below the text, an icon of a compass rose is placed within a magnetic field line curve. The main chart features a horizontal zero-axis line, labeled with a '0' on the far left, representing the historical coordinate baseline. The x-axis at the bottom displays the decades '1980', '1990', '2000', '2010', and '2020', each marked with a small vertical tick. For each decade, a vertical bar extends from the zero-axis, with its corresponding data label positioned directly at the end of the bar. The data shows westward drift represented by blue bars extending below the axis for '1980' with a value of '-15 km' and '1990' with a value of '-32 km'. Eastward drift is represented by red bars extending above the axis for '2000' with a value of '+10 km', '2010' with a value of '+45 km', and '2020' with a value of '+68 km'. The overall visual style mimics a geophysical science journal, utilizing compass red and blue color tones. The given data is : [{"decade": "1980", "drift_km": -15}, {"decade": "1990", "drift_km": -32}, {"decade": "2000", "drift_km": 10}, {"decade": "2010", "drift_km": 45}, {"decade": "2020", "drift_km": 68}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic in a corporate report minimalism style with muted corporate grays and blues, featuring a large title, 'Seasonal Fluctuations in 15-Year Mortgages', at the top. Directly below it is a subtitle, 'Historical prepayment velocities showing seasonal housing market trends'. Underneath the subtitle, a horizontal legend identifies two categories with small square icons: 'Spring/Summer Originations' in lighter gray-blue and 'Fall/Winter Originations' in darker gray-blue. The main visual is a multi-line chart in a wide landscape orientation. The vertical axis has numeric labels at 0.0, 5.0, 10.0, 15.0, and 20.0, with horizontal grid lines extending across the plot. The horizontal axis features labels: 'Jan 2018', 'Apr', 'Jul', 'Oct', 'Jan 2019', 'Apr', and 'Jul'. An icon depicting a sleek house silhouette is positioned in the upper left corner of the chart's plotting area. Two distinct lines represent the categories, characterized by cyclical seasonal bumps in the summer months. Both lines have square markers at each data point, with numerical values displayed near them. The lighter line for 'Spring/Summer Originations' plots a value of 8.0 in Jan 2018, rising to 12.5 in Apr, peaking at 16.0 in Jul, dipping to 11.0 in Oct, dropping further to 7.5 in Jan 2019, climbing to 13.0 in Apr, and reaching 17.5 in Jul. The darker line for 'Fall/Winter Originations' mirrors this pattern, starting at 6.5 in Jan 2018, increasing to 9.0 in Apr, hitting 14.5 in Jul, falling to 10.0 in Oct, bottoming out at 6.0 in Jan 2019, rising to 10.5 in Apr, and ending at 15.0 in Jul. The given data is : [{"category": "Spring/Summer Originations", "date": "2018-01", "value": 8.0}, {"category": "Fall/Winter Originations", "date": "2018-01", "value": 6.5}, {"category": "Spring/Summer Originations", "date": "2018-04", "value": 12.5}, {"category": "Fall/Winter Originations", "date": "2018-04", "value": 9.0}, {"category": "Spring/Summer Originations", "date": "2018-07", "value": 16.0}, {"category": "Fall/Winter Originations", "date": "2018-07", "value": 14.5}, {"category": "Spring/Summer Originations", "date": "2018-10", "value": 11.0}, {"category": "Fall/Winter Originations", "date": "2018-10", "value": 10.0}, {"category": "Spring/Summer Originations", "date": "2019-01", "value": 7.5}, {"category": "Fall/Winter Originations", "date": "2019-01", "value": 6.0}, {"category": "Spring/Summer Originations", "date": "2019-04", "value": 13.0}, {"category": "Fall/Winter Originations", "date": "2019-04", "value": 10.5}, {"category": "Spring/Summer Originations", "date": "2019-07", "value": 17.5}, {"category": "Fall/Winter Originations", "date": "2019-07", "value": 15.0}]</div></details></td>
</tr>
</table>
### Text Rendering Accuracy and Size Appropriateness
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents a comprehensive architectural and structural analysis of the Temple of Kom Ombo, an ancient Egyptian temple located on the west bank of the Nile River. The title "TEMPLE OF KOM OMBO" is prominently displayed in a hand-drawn, white-bordered box in the lower-right corner of the image, set against a brown background that mimics sandstone or earth tones. The overall layout is divided into multiple sections: a central photographic image of the temple ruins under a clear blue sky, surrounded by illustrative technical diagrams, annotated floor plans, and textual data blocks, all rendered in white line art and text for high contrast.<br><br>The central photograph shows the main hypostyle hall and surrounding structures of the temple, with visitors walking among the columns and courtyards, providing a sense of scale. In the background, the Nile River and palm trees are visible, situating the temple in its natural environment. The ruins are constructed from light-colored sandstone blocks, consistent with the material noted in the text.<br><br>In the upper-left quadrant, a 3D axonometric diagram illustrates the overall dimensions of the temple complex: approximately 62 meters by 51 meters, labeled along the axes. Adjacent to this, a list of key structural facts is presented in bullet points:<br>- TEMPLE AXIS: DOUBLE SANCTUARY FOR SOBEK & HORUS<br>- OVERALL DIMENSIONS (APPROX. 62M x 51M)<br>- CONSTRUCTION MATERIAL: SANDSTONE BLOCKS<br>- COLUMN HEIGHTS: UP TO 12 METERS<br><br>Above the central photo, two schematic diagrams illustrate architectural details:<br>- A top-down view of the hypostyle hall showing 30 columns arranged in a grid, labeled “HYPOSTYLE HALL (30 COLUMNS)” and pointing to “TWO SANCTUARIES.”<br>- A cross-section labeled “PYLON AND HYPOSTYLE SECTION,” which includes a detailed vertical cutaway showing the roofing system supported by columns, with arrows indicating load paths down to foundations.<br><br>To the right of the central image, text notes “TWO ENTRANCES SYMBOLIZING DUALITY,” emphasizing the temple’s unique dual dedication. This concept is reinforced in the lower section of the infographic, where a detailed floor plan is overlaid on the brown ground area.<br><br>The floor plan, drawn in white lines, is annotated with various features:<br>- INNER TEMPLE (FOR SOBEK) — marked with a rectangular inner sanctum.<br>- INNER TEMPLE (FOR HAROERIS) — another distinct inner sanctum, indicating the dual religious function.<br>- NILOMETER — a structure used to measure the Nile’s water level.<br>- BIRTH HOUSE (MAMMISHI) — a smaller chamber associated with fertility rituals.<br>- MUMMIFIED CROCODILE MUSEUM SITE — indicating a location within the temple complex for sacred crocodile mummies.<br>- TWO ENTRANCES SYMBOLIZING DUALITY — shown as two separate entryways on the plan.<br><br>Surrounding the floor plan are inset images of relief carvings, each labeled:<br>- MEDICAL INSTRUMENT RELIEFS — depicting figures with tools.<br>- TWO ENTRANCES RELIEFS — showing doorways flanked by deities.<br>- CALENDAR RELIEFS — illustrating scenes related to timekeeping or agricultural cycles.<br><br>Additional annotations point to structural aspects:<br>- “STRUCTURAL LOAD PATHS FROM COLUMNS TO FOUNDATIONS” — illustrated with curved arrows tracing the force transfer from columns through the walls to the ground.<br>- The pylon and hypostyle section diagram also labels “ROOFING SYSTEM” and shows how the roof beams rest on column capitals.<br><br>All textual content is in English, using a clean, sans-serif font that enhances readability. The visual style blends real photography with technical illustrations and hand-drawn elements, creating an educational and engaging format suitable for tourists, students, or archaeologists. The infographic effectively communicates both the physical characteristics and symbolic significance of the Temple of Kom Ombo, highlighting its duality, engineering, and cultural importance.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is laid out with two tables at the top, followed by a two-column text layout. The tables span the full width of the text area. The text includes a section heading.<br><br>[paragraphs]<br><br>the TOPIC MODELER, the GENDER SEGMENTER, and an OTHER module (transcript length and duration). We test for a linear relationship between each pair of variables: $H_O : r = 0$, $H_A : r \neq 0$, where $H_O$ is the origi-nal hypothesis, $H_A$ is the alternate hypothesis, and $r$ is the Pearson’s correlation coefficient. We follow Reddy et al. (2021) and Yang et al. (2019) and apply a Bonferroni cor-rection to our $\alpha$ value of $0.05$, setting $\alpha = 0.05/z$, where $z = \binom{124}{2} = 7,626$ for LDA, representing the number of feature relationships we consider. Hence, we reject $H_O$ in favor of $H_A$ if $p \leq \alpha$. Given the largeness of $z$, our $\alpha$ value becomes small, making our criteria for significance strict and thus suitable for investigating our research ques-tions. Furthermore, we filter our correlations $r$, such that $\Vert r\Vert > 0.1$ for our LDA experiments, and $\Vert r\Vert > 0.05$ for our BERTopic experiments (due to the smaller sample size of 10,000 podcasts, and fewer samples may have higher vari-ance). Our results focus on a selection of these significant correlations; the full results are available on the project web-site: https://www.gendered-discourse.net/extended-results.<br><br>### RQ0: How Are Women and Men’s Discourse Different?<br><br>Using GDCF, our Gendered Discourse Correlation Frame-work shown in Figure 2, we then analyze significant corre-lations between between the gender features from the GEN-DER SEGMENTER module (Doukhan et al. 2018a), and the topic features from the TOPIC MODELER module (Blei, Ng, and Jordan 2003). We use the discourse topics to automati-cally form gendered discourse word lists via their significant correlations.<br><br>Starting with the first row of Table 1, we see that Topic 3’s word list returned by LDA with Non-Contextual Embed-dings (Bag-Of-Words) (via the TOPIC MODELER module) contains the words women, woman, men, baby, pregnant, girls, men, doctor, health, birth (in descending weighted or-der). Based on this word list, we manually interpret this topic as being a content topic, specifically about pregnancy, as noted in the column “Topic N Categories.” Then, we look to the gender correlations in the columns “Gender” and “$r$,” and see that $r(\text{Topic 3, Women}) = +0.15$ and $r(\text{Topic 3, Men}) = -0.14$. This indicates that the topic of pregnancy positively correlates with women (identified via the GENDER SEGMENTER module), and negatively corre-lates with men. Therefore, we associate Topic 3 (Content - Pregnancy) with Women, as noted in the “Topic N Gender” column. Similarly, we make these associations in the “Topic N Gender” column for Topics 10, 49, and 71.<br><br>Next, we focus on the Topic 54 row. This topic is inter-preted using the word list get, like, know, right, people, go-ing, podcast, make, want, one. This word list does not refer to any content, hence, we manually interpret this topic as being a discourse topic. Moving to the gender correlations, we see that $r(\text{Topic 54, Women}) = \emptyset$ and $r(\text{Topic 3, Men}) = +0.12$. The reason for $r(\text{Topic 54, Women}) = \emptyset$ is because the correlation between the features Topic 54 and Women did not come back as significant. However, due to the positive correlation of $0.12$ for Topic 3 and Men, we manually as-sociate Topic 3 with Men in the “Topic N Gender” column.<br><br>[tables]<br><br>Table 1: LDA with Non-Contextual Embeddings (Bag-Of-Words): The complete set of significant correlations between gender features and topic features – both content topics and discourse topics. Based on $r$, the Topic N Gender forms the gendered (discourse) word lists via Topics 54 and 60 (the masculine word lists) and Topic 62 (the feminine word list).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 3 | Women <br> Men | 0.15 <br> -0.14 | women, woman, men, baby, pregnant, girls, men, doctor, health, birth | Content - Pregnancy | Women |<br>| Topic 10 | Women <br> Men | 0.10 <br> -0.12 | energy, body, feel, mind, space, yoga, love, beautiful, feeling, meditation | Content - Yoga | Women |<br>| Topic 49 | Women <br> Men | -0.21 <br> 0.17 | game, know, think, team, going, mean, play, year, one, good | Content - Sports | Men |<br>| Topic 71 | Women <br> Men | 0.14 <br> -0.14 | christmas, sex, girl, hair, love, get, date, girls, let, wear | Content - Dating | Women |<br>| Topic 54 | Women <br> Men | – <br> 0.12 | get, like, know, right, people, going, podcast, make, want, one | Discourse | Men |<br>| Topic 60 | Women <br> Men | -0.27 <br> 0.20 | going, know, think, get, got, one, really, good, well, yeah | Discourse | Men |<br>| Topic 62 | Women <br> Men | 0.33 <br> -0.28 | like, know, really, going, people, want, think, get, things, life | Discourse | Women |<br><br>Table 2: BERTopic with Contextual Embeddings (BERT, ChatGPT, Llama): The complete set of significant correlations between gender features and topic features for discourse topics only (content topics are omitted).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 0 | Women <br> Men | -0.08 <br> 0.10 | like, yeah, know, oh, right, podcast, got, going, think, really | Discourse | Men |<br>| Topic 2 | Women <br> Men | 0.08 <br> -0.08 | life, know, things, really, people, feel, like, want, love, going | Discourse | Women |<br>| Topic 5 | Women <br> Men | 0.08 <br> – | like, know, think, yeah, episode, really, going, anchor, kind, right | Discourse | Women |</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout with a single column. The text is justified and divided into sections and subsections, indicated by numbered headings. Important terms at the beginning of some paragraphs are bolded. A horizontal rule separates the header from the main content, and another rule separates the main content from the footnote at the bottom.<br><br>[paragraphs]<br><br>Preprint Version.<br><br>**Figure–Table Integration.** In addition to textual refinement, we extend the refinement process to include multimodal elements, to further enhance readability. For each section, the model first generates visualization requirements, such as tables with structured comparisons or figures with explanatory diagrams, together with natural language descriptions. Based on these descriptions, candidate figures and tables are synthesized. The compiled outputs are then fed back to an LLM for quality assessment, enabling automatic detection of issues such as oversized layouts or unreadable text. The LLM provides corrective suggestions, which are applied to improve the final visualizations. Finally, the text is refined again to ensure that all generated figures and tables are properly referenced within the survey.<br><br># 4 EXPERIMENTS<br><br>## 4.1 EXPERIMENTAL SETTINGS<br><br>**Implementation Details.** Following Wang et al. (2024b), we adopt **GPT-4o-mini** as our genera-tion model for its balance of responsiveness and cost. Our retrieval database contains 680K computer science papers from arXiv, with PDFs converted into structured Markdown using MinerU (Wang et al., 2024a) for consistent formatting. The details of the retrieval process are provided in App. A.1. In outline generation, the system consults 1000–1200 papers, with a maximum of 8 sections. For section drafting, each subsection retrieves up to 60 additional relevant papers, combined with those linked during outline generation. Finally, we apply two iterations of the review-and-refine loop to enhance coherence across sections and improve overall readability. Illustrative outputs compared with AutoSurvey are provided in App. A.8.<br><br>**Baselines.** We compare IterSurvey with a set of baselines, ranging from simple retrieval-augmented generation (Naive RAG), which directly drafts from retrieved documents, to more ad-vanced state-of-the-art systems. Specifically, we evaluate against AutoSurvey (Wang et al., 2024b), the first systematic framework for this task; SurveyForge (Yan et al., 2025), which combines heuris-tic outline generation based on the logical structures of human-written surveys with a memory-driven scholar navigation agent for high-quality retrieval; and SurveyGo (Wang et al., 2025), which em-ploys the LLM×MapReduce-V2 algorithm to address the long-context challenge. We also compare with SurveyX (Liang et al., 2025), which introduces an Attribute Tree-based outlining mechanism; however, due to access restrictions, we include SurveyX only in arena experiments. All methods are evaluated on the same retrieval database with generation hyperparameters aligned to their original settings for fairness.<br><br>## 4.2 AUTOMATIC EVALUATION RESULTS<br><br>**Evaluation Setup.** We employ multiple complementary protocols to evaluate the quality of gen-erated surveys. On the 20-topic suite from Wang et al. (2024b), we adopt multi-dimensional scoring with LLM-as-a-judge. Content quality is assessed along three dimensions: coverage, structure, and relevance followed from Wang et al. (2024b). Besides, citation quality is evaluated using the NLI-based protocol of Gao et al. (2023), reporting both recall and precision: _Citation Recall_ measures whether all statements in the generated text are fully supported by the cited passages, while _Citation Precision_ identifies irrelevant citations to ensure that references are pertinent and directly support the claims. To improve scoring stability and reliability, prompts are standardized and judges must pro-vide a rationale before assigning scores. For additional robustness, we aggregate outputs from three judge models: GPT-4o, Claude-3.5-Haiku, and GLM-4.5V.1 Full prompts are provided in App. A.7.<br><br>**Results.** The results on the 20 topics from Wang et al. (2024b) are reported in Tab. 1. Statistical significance was confirmed via paired t-tests, indicating that IterSurvey consistently outperforms baseline models ($p <0.05$).Wesummarizethemainobservationsbelow.<br><br>- **Overall superiority.** IterSurvey consistently outperforms all baselines across both content and citation quality, achieving the highest overall average score (4.75). This demonstrates that the proposed framework is effective and robust across multiple evaluation dimensions.<br><br>[page_number]<br><br>6<br><br>[footnotes]<br><br>1Specifically, we use `chatgpt-4o-latest`, `claude-3-5-haiku-20241022`, and `glm-4.5v`.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>This is a single-column page containing mostly text, structured with section headings and bold inline subheadings. URLs are formatted in a monospaced font and hyperlinked.<br><br>[paragraphs]<br><br># A Image generation models<br><br>This section details the two diffusion image generation models used in this work, namely Stable Diffusion 1.4 and 1.5.<br><br>**Stable Diffusion 1.4** The Stable Diffusion model is a text-conditioned image generator model that combines an autoencoder with a diffusion model to create a latent diffusion model. The autoencoder encodes images into latent representations with a reduced dimensionality when compared to the input image, reducing the computational needs during the training phase. Text prompts, on the other hand, are encoded using a text encoder and are then cross-attended by the UNet backbone of the latent diffusion model. Finally, the loss is computed using a reconstruction objective between the noise added to the latent representation and the prediction made by the UNet.<br>Stable Diffusion 1.4 (https://huggingface.co/CompVis/stable-diffusion-v1-4) had several rounds of training on the LAION dataset (https://laion.ai/), with each round changing the input image dimension, aesthetic score, and the probability of dropping the text-conditioning to improve classifier-free guidance.<br><br>**Stable Diffusion 1.5** SD 1.5, in turn, has the same architecture and even the same starting point as 1.4, with the difference being how long the model was fine-tuned on top of SD 1.2. The 1.4 version is fine-tuned for 225 thousand steps at resolution 512x512 on “laion-aesthetics v2 5+” with a 10% probability of dropping the text-conditioning, and version 1.5 for 595 thousand steps.<br>As demonstrated in Section D Stable Diffusion 1.4 has better performance than 1.5 in our approach, therefore, we will adopt SD 1.4 for most of the experiments in this paper.<br><br># B Large language models<br><br>Here we give additional details on the large language models that we used in our experiments.<br><br>**Gemma** (Mesnard et al., 2024), trained on a diverse 6 Trillion token dataset comprising web documents, code and mathematical texts. We resorted to the 7 Billion parameter instruction-tuned decoder-only model, named _gemma-7b-it_ (https://huggingface.co/google/gemma-7b-it). This model uses a chat template, which we employ during inference.<br><br>**Llama 2** (Touvron et al., 2023), of which we used the 7 Billion parameter, pre-trained-only model, _Llama-2-7b_ (https://huggingface.co/meta-llama/Llama-2-7b-hf). This model was trained with a mix of publicly available data totalling 2 Trillion tokens. While its chat versions employ supervised fine-tuning and reinforcement learning with human feedback for alignment with human preferences in helpfulness and safety, the pre-trained-only model does not. This results in a less constrained model, but it may also cause it to disperse from the task at hand. Since this model is a pre-trained-only no chat template is needed.<br><br>**Mistral** (Jiang et al., 2023) fine-tuned on various HuggingFace instruction datasets. We resorted to the 7 Billion _Mistral-7B-Instruct-v0.2_ model (https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.2) and used the respective chat template during inference.<br><br>**Phi-2** (Gunasekar et al., 2023) is a compact 2.7 Billion model (https://huggingface.co/microsoft/ phi-2). Despite its size, it offers a competitive performance with respect to models several times its size. It was trained on 250 Billion tokens, obtained through a combination of NLP synthetic data created by GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by GPT-4. This model was not fine-tuned through reinforcement learning from human feedback and does not have guardrails.<br><br>**Model ranking**<br>A ranking of these models in terms of their performance can be found in the HuggingFace leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which assesses several LLMs that are trained under the same criteria and tested on the same benchmarks, including reasoning</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout, likely from a preprint server like arXiv. It features a title, author list with affiliations, an abstract, and the beginning of the "Introduction" section. A preprint notification ("Preprint. Under review.") is present at the bottom left. The text on the left margin ("arXiv:2502.01522v2 [cs.CV] 30 May 2025") is a vertical stamp typical of arXiv submissions.<br><br>[paragraphs]<br><br>arXiv:2502.01522v2 [cs.CV] 30 May 2025<br><br># Unpaired Deblurring via Decoupled Diffusion Model<br><br>**Junhao Cheng**$^1$, **Wei-Ting Chen**$^2$, **Xi Lu**$^1$, **Ming-Hsuan Yang**$^3$<br>$^1$Sun Yat-sen University $^2$ Microsoft $^3$ University of California, Merced<br>https://github.com/donahowe/UID-Diff<br><br>**Abstract**<br><br>Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose UID-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural features and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. We further introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of UID-Diff when encountering unknown blur patterns. Experiments on real-world datasets demonstrate that UID-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios.<br><br># 1 Introduction<br><br>Dynamic blur occurs when the camera and subject move relative to each other during the exposure time, resulting in a smeared and blurred image. Deblurring, the process of removing the blur pattern while preserving the underlying structure of degraded images, is essential for restoring high-quality images for human perception and low-level computer vision applications.<br><br>With the rapid advancement of photographic technology, a wide range of imaging devices are now employed to capture images in real-world scenarios. Due to their diverse lenses and structural designs, these devices may produce distinct blur patterns [1, 2, 3]. This diversity makes it challenging to develop an all-in-one method for deblurring images from arbitrary and varied sources. Consequently, focusing on deblurring algorithms tailored to specific domains has become increasingly significant.<br><br>As deep learning has advanced in recent years, existing deblurring models predominantly build on data-driven approaches that employ neural networks trained via supervised learning on synthetic paired data. Existing works have made efforts to develop deblurring models upon CNN [4, 5], Transformer [6, 7], and GAN [8, 9]. Recently, a new wave of research [10, 11, 12] has begun to investigate the integration of pre-trained generative diffusion models [13], such as Stable Diffusion (SD) [14], with an adapter designed to provide structural guidance for deblurring. These approaches aim to harness the generative capabilities of diffusion models to supplement missing details and generate aesthetically pleasing outputs. However, since paired blurry-sharp training data is limited in<br><br>[footnotes]<br><br>Preprint. Under review.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents an augmented reality (AR) shopping experience overlaid on a real-world retail environment. The scene is set in a brightly lit cosmetics aisle of a store, with shelves stocked with beauty products visible in the background. In the foreground, a pair of hands holds a black rectangular compact labeled "ANASTASIA BEVERLY HILLS BROW POWDER DUO" with "EBONY" and "NET WT. 2.5 OZ." printed below. A gold ring is visible on the left hand’s ring finger, and a black wristband is partially seen on the left wrist.<br><br>Overlaid on the image are several semi-transparent, rounded-corner UI elements resembling AR pop-ups or digital cards, providing contextual information about the product and the user’s shopping list.<br><br>On the left side, a vertical panel titled "SHOPPING LIST" lists four items:<br>1. Face Wash — marked with an “X” (completed)<br>2. Shampoo — marked with an “X” (completed)<br>2. Eye Cream — marked with an empty checkbox (not completed; duplicated item number)<br>3. Eye Cream — marked with an empty checkbox (not completed)<br><br>This suggests a possible error or duplication in the list, with two entries for "Eye Cream".<br><br>In the center-right, a speech-bubble-shaped label displays the price: "$23.00".<br><br>To the right of the product, a larger panel titled "PRODUCT DETAILS:" provides information about the "ABH Brow Powder Duo". It features two color swatches:<br>- Left swatch: labeled "DEEP BROWN"<br>- Right swatch: labeled "BLACK"<br><br>Below the swatches, a star rating system shows four and a half filled stars, accompanied by the text "4.5 out of 5 stars".<br><br>Underneath the rating, a section titled "COMMON USES:" states: "DEFINES & FILLS BROWS".<br><br>Further down, a smaller rectangular box labeled "KEY INGREDIENTS" lists:<br>- Vitamin E<br>- Finely Milled Pigments<br><br>At the bottom right, another box titled "APPLICATION TIPS" includes a video icon (a rectangle with a play triangle) and the word "Video", indicating a multimedia tutorial is available.<br><br>The overall layout mimics an immersive AR interface, likely from a smart glasses or smartphone application, designed to enhance in-store shopping by providing instant, interactive product data directly within the user’s field of view. The visual style uses dark gray, translucent backgrounds with white text for high contrast and readability against the busy store backdrop. The design emphasizes usability, with clear categorization of information into distinct panels and intuitive icons. All textual content is in English, and no other languages are present.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "Personal Insight & Casual Life Reflection Essay Collections: A Complete Guide" is presented in a tactile, clay-like 3D style with rounded edges and textured surfaces, giving it a handcrafted, approachable aesthetic. The background is a neutral light gray, which makes the colorful clay elements stand out. The title is displayed at the top in two lines: the first line "Personal Insight & Casual Life Reflection" is in a warm terracotta brown, while the second line "Essay Collections: A Complete Guide" uses a bright sky blue for "A Complete Guide," creating visual emphasis.<br><br>The infographic is organized as a grid with three main vertical sections (rows) and seven horizontal columns. The leftmost column contains the main categories, each labeled with a large blue clay tile with yellow text. These are:<br><br>1. **Core Traits of the Genre** – with a smaller brown clay subtitle: "Key traits of the personal insight & life reflection genre"<br>2. **Value for Readers & Writers** – with a smaller brown clay subtitle: "Why These Collections Matter"<br>3. **Actionable Engagement Steps** – with a smaller brown clay subtitle: "Simple steps for new readers and writers"<br><br>The remaining six columns form the header row, each represented by a colored clay tile with white text:<br><br>- **Infographic Multi-factor** (blue)<br>- **Statistics** (teal)<br>- **Quotes** (orange)<br>- **Key Terms** (green)<br>- **Style Preferences** (purple)<br>- **Layout Preferences** (light blue)<br>- **Other Requirements** (dark purple)<br><br>Each of the three main rows then has corresponding content under these headers. For the first two data points under each category (Statistics, Quotes, Key Terms), there is a white clay tile with a brown checkmark (√), indicating inclusion or relevance. For the last three headers (Style Preferences, Layout Preferences, Other Requirements), there is a white tile with a number "3" in the respective color of that column’s header.<br><br>Below each checkmark or number, there is a larger white clay tile with black text describing the specific point. The full structure is as follows:<br><br>---<br><br>### Row 1: Core Traits of the Genre<br><br>- **Statistics**: √ → *Rooted in real lived experience*<br>- **Quotes**: √ → *Inspired by small everyday moments*<br>- **Key Terms**: √ → *No formal structure rules*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 2: Value for Readers & Writers<br><br>- **Statistics**: √ → *Readers: Comfort & relatable validation*<br>- **Quotes**: √ → *Writers: Low-stakes creative outlet*<br>- **Key Terms**: √ → *All audiences: Record of personal growth*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 3: Actionable Engagement Steps<br><br>- **Statistics**: √ → *Step 1: Start with short essays to read*<br>- **Quotes**: √ → *Step 2: Write 100-word daily reflections*<br>- **Key Terms**: √ → *Step 3: Group by theme to build your collection*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (green) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>The design uses consistent visual encoding: checkmarks indicate presence or applicability, while the number "3" likely denotes a level of preference or importance (though no legend is provided). The colors are used both for column headers and for the numbers, creating a visual link between the header and its corresponding data point. All text is rendered in a clean, sans-serif font, and the overall layout is structured and easy to follow, with clear separation between rows and columns via subtle clay dividers. The infographic communicates a comprehensive yet accessible guide to engaging with personal insight and life reflection essays through a visually engaging and tactile design.</div></details></td>
</tr>
</table>
## Showcases (Applications)
> ✨ **Curious to see the model in action? Explore 100 generated examples in our full 👉 [ 🖼️ Infographic Model Showcases ](./u1_infographic_showcases.md) 👈 gallery!**
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "College Entrance Pathway Reforce Comparison" presents a structured comparison of key aspects for prospective students in Guangdong, China, aiming to enter college through a specialized entrance examination. The layout is organized as a multi-column table with four main columns: "Content Item / Evaluation Criteria", "Statistics", "Quotes", and "Key Terms". Each row corresponds to a distinct evaluation criterion or step in the preparation process, with visual icons, text, and data points enhancing clarity.<br><br>The infographic uses a clean, minimalist design with black line art icons on a light beige background. Text is primarily in bold sans-serif font, with headings emphasized for readability. Data is encoded using icons (e.g., graduation cap, calendar, books, target, rocket) to visually represent concepts, while numerical values are explicitly labeled for precision.<br><br>The first row addresses **Eligibility Criteria**:<br>- In the "Statistics" column, it features an icon of a person checking a map of Guangdong with the text: "Official Eligibility Requirements Confirm if you qualify to register".<br>- The "Quotes" column lists three eligible groups with corresponding icons: "Final-Year Guangdong Junior College Student", "Guangdong Resident <2YearsPostGraduation",and"EligibleRetiredMilitaryPersonnel".<br>- The "Key Terms" column shows a magnifying glass over a document with the label: "Eligibility Verification".<br><br>The second row covers **Exam Structure & Scoring Breakdown**:<br>- "Statistics" displays icons representing different test types and scores: 100 pts (graduation cap), 200 pts (person at desk), 1000 pts (document with pen), 150 pts (document with pen). Below: "Total 500 points across 4 test papers".<br>- "Quotes" lists four subject components in document-shaped boxes: "Political Theory (100 pts)", "Major-Aligned Public Subject (100 pts)", "Professional Subject 1 (150 pts)", "Professional Subject 2 (150 pts)".<br>- "Key Terms" includes a balance scale icon with "Score Distribution".<br><br>The third row details the **Official Annual Exam Timeline**:<br>- "Statistics" contains a horizontal timeline with icons of a calendar and clock, labeled "Annual Key Timeline".<br>- "Quotes" provides a detailed timeline: Jan: Registration Open → Jan: Admission Open → Mid-Mar: Exam Date → Mid-Apr: Score Release → May-Jun: Admission Offers.<br>- "Key Terms" shows a calendar and clock with "Critical Dates".<br><br>The next three rows outline a three-step preparation strategy:<br><br>**Step 1 - Confirm Target Major & Institution**:<br>- "Statistics": Icon of a person holding a map with a target, text: "Confirm your target 6 months in advance".<br>- "Quotes": Two bullet points: "Download official exam syllabi and past professional subject papers from the target institution’s admission portal" and "Cross-verify that your junior college major meets the target major’s prerequisite requirements".<br>- "Key Terms": Clock and books with "Target Selection".<br><br>**Step 2 - Public Subject Foundation Building**:<br>- "Statistics": Icon of a person studying with books and a coffee cup, text: "Complete 3 months of structured public subject study".<br>- "Quotes": Two bullet points: "Complete 5+ years of past public subject exam papers to identify recurring test points" and "Political Theory allocates 30% of total score to current affairs from the past calendar year".<br>- "Key Terms": Box with lightbulb and "Core Knowledge".<br><br>**Step 3 - Professional Subject Sprint Revision**:<br>- "Statistics": Icon of a running person with a book and clock, text: "Focus on high-weight professional subjects in the final 2 months".<br>- "Quotes": Two bullet points: "Practice past professional subject papers from your target institution and review core major textbooks" and "60% of professional subject questions are repeated or adapted from past 3 years of papers for most institutions".<br>- "Key Terms": Trophy and gears with "Intensive Review".<br><br>Red horizontal lines separate the first three criteria from the three-step strategy, while a blue line separates Step 1 from Steps 2 and 3, visually grouping related content. All textual information is preserved exactly as presented, including spelling variations like "Oficial" (likely intended as "Official"). The infographic serves as a strategic roadmap combining official requirements, scoring details, timelines, and actionable preparation steps for candidates.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "12-Month Market Performance: US vs. Asia" presents a structured, puzzle-piece-based visual analysis comparing the performance of US and Asian equity markets over a 12-month period. The layout is organized into three main steps, arranged in a central vertical flow with interconnected puzzle pieces, emphasizing a modular, analytical approach to market comparison. The design uses clean black-and-white line art with light blue accents for key sections, icons for visual representation, and clear typography for readability.<br><br>**Step 1** (top center) introduces the scope of the analysis. It features an illustration of four people examining charts, symbolizing data analysis. To the right, it defines the market indices being compared:<br>- **US Markets**: S&P 500, NASDAQ<br>- **Asian Markets**: Nikkei 225, Hang Seng, KOSPI, CSI 300<br><br>It also lists the types of data analyzed:<br>- Trailing Return (represented by a rising bar chart icon)<br>- Average Daily Volume (represented by a stacked bar chart icon)<br>- Top Sector Return (represented by a pie chart icon)<br><br>**Step 2** (left side, labeled "Metrics that account for 72% of short-term S&P 500 volatility") focuses on US Market Core Driving Indicators. This section contains icons representing industry (factory), finance (bank building), money (hand holding dollar sign), and labor (worker in hard hat). Below these icons, a light blue banner reads "US Market Core Driving Indicators". Specific metrics are listed with red warning triangle icons:<br>- CPI YoY: 3.2%<br>- Federal Funds Rate: 5.25–5.5%<br>- Non-farm Payrolls: +187k July 2024<br><br>**Step 3** (right side, labeled "Metrics that predict 68% of MSCI Asia Ex-Japan 3-month forward returns") focuses on Asian Market Core Leading Indicators. This section includes icons for shipping (container), manufacturing (gears), and calculation (calculator). A light blue banner below reads "Asian Market Core Leading Indicators". Specific metrics are listed:<br>- Manufacturing PMI: 51.2 (with red warning triangle)<br>- Q2 Export Growth: +6.8% YoY (with red warning triangle)<br>- Avg Policy Rate: 3.1% (with information circle icon)<br><br>At the bottom center, a large puzzle piece titled "Policy Shifts & Market Volatility Correlation" displays a line graph with two fluctuating lines:<br>- **US VIX (navy line)** — representing US market volatility<br>- **Asian Avg Volatility (green line)** — representing average Asian market volatility<br><br>Arrows connect the two lines, indicating correlation. Below the graph, key insights are provided with red warning triangles:<br>- Rate hike impact: +27% US VIX<br>- Trade policy impact: +34% Asian VIX<br>- Cross-regional sell-off correlation: 0.68<br><br>The overall structure visually represents how US and Asian market performances are driven by distinct but interrelated economic indicators, with a central focus on their volatility dynamics and policy impacts. The use of puzzle pieces metaphorically suggests that these components fit together to form a complete picture of global market trends. The infographic employs consistent iconography, color-coding (red for warnings, blue for core sections), and clear textual labeling to convey complex financial data in an accessible format.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create a vintage-styled ornithology infographic with a light beige background and dark tree silhouettes framing the left and right sides. At the top, center-aligned, place the main title 'Alter bringt Vorsicht' in a bold, serif font, followed by the subtitle 'Sekunden pro Minute für Suchen vs. Picken bei Staren verschiedener Altersgruppen'. The central visualization is a diverging horizontal bar chart featuring age groups vertically listed in the middle: 'Jungvogel', 'Einjähriger Jungvogel', 'Zweijähriger Altvogel', 'Ausgewachsener Altvogel', and 'Senior'. For each group, a Rust Red bar extending to the right represents 'Picken', while a Sepia Tone bar extending to the left represents 'Suchen'. Numerical labels are placed inside each bar near the outer edge. Below the chart, two x-axes ranging from 0 to 40 reflect the time in seconds for both directions. A legend in the bottom right corner displays square markers for 'Picken' (Rust Red) and 'Suchen' (Sepia Tone). The given data is : [{"age_group": "Jungvogel", "pecking": 20, "scanning": 10}, {"age_group": "Einjähriger Jungvogel", "pecking": 30, "scanning": 20}, {"age_group": "Zweijähriger Altvogel", "pecking": 35, "scanning": 25}, {"age_group": "Ausgewachsener Altvogel", "pecking": 28, "scanning": 30}, {"age_group": "Senior", "pecking": 20, "scanning": 35}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: right; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic with the title 'Ajustes de Velocidad del Sprint' and the subtitle 'Desviación de alcance del equipo backend: Trabajo planificado vs. Corrección de errores no planificados' centered at the top. The central visual is a vertical bar chart set against a light gray background, featuring a horizontal baseline at the zero-point mark. Positive values (planned work) are represented by green bars extending upwards, while negative values (unplanned bug fixes) are represented by yellow bars extending downwards. Each bar is topped or bottomed with multi-line text containing the module name and its value in 'pts'. Specifically, the bars are labeled 'Autenticación de usuario' (50 pts), 'API de pago' (35 pts), 'Sincronización de inventario' (22 pts), 'BD de analítica' (14 pts), 'Servicio de correo' (-10 pts), 'Webhooks' (-28 pts), 'Carga de multimedia' (-42 pts), and 'Pasarela heredada' (-70 pts). An illustration of mechanical gears is positioned in the upper-right area, and a cluster of small bug icons is placed in the lower-left area. The vertical axis shows tick labels ranging from +60 pts to -80 pts. The given data is : [{"category": "Autenticación de usuario", "unit": "pts", "value": 50}, {"category": "API de pago", "unit": "pts", "value": 35}, {"category": "Sincronización de inventario", "unit": "pts", "value": 22}, {"category": "BD de analítica", "unit": "pts", "value": 14}, {"category": "Servicio de correo", "unit": "pts", "value": -10}, {"category": "Webhooks", "unit": "pts", "value": -28}, {"category": "Carga de multimedia", "unit": "pts", "value": -42}, {"category": "Pasarela heredada", "unit": "pts", "value": -70}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that features a title and a subtitle centered at the top, reading 'Fastest Cuisines to Prepare' and 'Average Ghost Kitchen Handover Time by Item Type (Minutes)' respectively. The main visual is a horizontal grouped bar chart combining a Fast-food neon visual style with checkerboard borders along the edges, featuring a centered legend above the chart area for 'QuickEats' (cyan neon border) and 'DashNow' (orange neon border). To the bottom right of the bar chart, there is a simple illustration of two mopeds waiting for orders. The chart's vertical axis lists four categories, each preceded by a simple icon, while the horizontal axis represents handover time in minutes with numerical labels at 0, 5, 10, 15, and 20, supplemented by dotted vertical gridlines. Each category features a pair of black bars representing the two platforms, with exact values displayed directly inside the right end of each bar. For 'Classic Tacos', QuickEats takes 10.0 minutes while DashNow takes 11.5 minutes. 'Supreme Burritos' require the longest preparation, with 17.5 minutes for QuickEats and 19.0 minutes for DashNow. 'Spicy Nachos' take 9.5 minutes on QuickEats and 10.0 minutes on DashNow. Finally, 'Mini Quesadillas' are the fastest, taking 8.0 minutes for QuickEats and 8.5 minutes for DashNow. The given data is : [{"category": "Classic Tacos", "platform": "QuickEats", "unit": "Minutes", "value": 10.0}, {"category": "Classic Tacos", "platform": "DashNow", "unit": "Minutes", "value": 11.5}, {"category": "Supreme Burritos", "platform": "QuickEats", "unit": "Minutes", "value": 17.5}, {"category": "Supreme Burritos", "platform": "DashNow", "unit": "Minutes", "value": 19.0}, {"category": "Spicy Nachos", "platform": "QuickEats", "unit": "Minutes", "value": 9.5}, {"category": "Spicy Nachos", "platform": "DashNow", "unit": "Minutes", "value": 10.0}, {"category": "Mini Quesadillas", "platform": "QuickEats", "unit": "Minutes", "value": 8.0}, {"category": "Mini Quesadillas", "platform": "DashNow", "unit": "Minutes", "value": 8.5}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that presents a centered title at the top, stating "Übertaktet vs. Standard-Takt", with the subtitle "Temperaturanstieg bei langen Gaming-Sessions" directly below it. The main visual is a line chart spanning the width of the infographic on a dark background, embodying a Gamer Aesthetic with vibrant RGB neon accents. This chart has a vertical axis on the left labeled with numerical values in increments of 10 from 30°C to 100°C, and a horizontal axis at the bottom with time labels: '0m', '15m', '30m', '45m', '60m', '75m', '90m', '105m', and '120m'. Horizontal grid lines mark each 10°C increment. A horizontal legend is positioned under the subtitle, containing a cyan circular marker and line for "Standard-Takt" and a magenta circular marker and line for "Übertaktet (+150MHz)". Two data series are plotted as glowing neon lines with hollow circular markers at each data point, accompanied by gradient shading below each line. The cyan "Standard-Takt" line shows a steep rise from 38°C at 0m to 68°C at 15m, followed by a flat plateau reaching 73.5°C at 120m. The magenta "Übertaktet" line displays a similar initial spike from 42°C to 75°C, but continues with a gradual linear creep up to 93°C at 120m. Spike annotations (callout boxes) point to the final data points on the right, highlighting the peak temperatures: a magenta box reads "Peak: 93°C" and a cyan box reads "Peak: 73.5°C". A stylized thermometer line-art icon is subtly placed in the center of the chart's background. The given data is : [{"profile": "Standard-Takt", "temperature": 38, "time": "0m"}, {"profile": "Übertaktet", "temperature": 42, "time": "0m"}, {"profile": "Standard-Takt", "temperature": 68, "time": "15m"}, {"profile": "Übertaktet", "temperature": 75, "time": "15m"}, {"profile": "Standard-Takt", "temperature": 71, "time": "30m"}, {"profile": "Übertaktet", "temperature": 79, "time": "30m"}, {"profile": "Standard-Takt", "temperature": 72, "time": "45m"}, {"profile": "Übertaktet", "temperature": 82, "time": "45m"}, {"profile": "Standard-Takt", "temperature": 72.5, "time": "60m"}, {"profile": "Übertaktet", "temperature": 85, "time": "60m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "75m"}, {"profile": "Übertaktet", "temperature": 87, "time": "75m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "90m"}, {"profile": "Übertaktet", "temperature": 89, "time": "90m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "105m"}, {"profile": "Übertaktet", "temperature": 91, "time": "105m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "120m"}, {"profile": "Übertaktet", "temperature": 93, "time": "120m"}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic that displays data in a vertical diverging bar chart format. At the top left of the visualization, there is a title: 'Anomalie de l'Atlantique Sud : Dérive magnétique', and a subtitle: 'Vecteurs de dérive vers l'est et l'ouest en kilomètres par rapport à la ligne de base historique'. In the upper left area below the text, an icon of a compass rose is placed within a magnetic field line curve. The main chart features a horizontal zero-axis line, labeled with a '0' on the far left, representing the historical coordinate baseline. The x-axis at the bottom displays the decades '1980', '1990', '2000', '2010', and '2020', each marked with a small vertical tick. For each decade, a vertical bar extends from the zero-axis, with its corresponding data label positioned directly at the end of the bar. The data shows westward drift represented by blue bars extending below the axis for '1980' with a value of '-15 km' and '1990' with a value of '-32 km'. Eastward drift is represented by red bars extending above the axis for '2000' with a value of '+10 km', '2010' with a value of '+45 km', and '2020' with a value of '+68 km'. The overall visual style mimics a geophysical science journal, utilizing compass red and blue color tones. The given data is : [{"decade": "1980", "drift_km": -15}, {"decade": "1990", "drift_km": -32}, {"decade": "2000", "drift_km": 10}, {"decade": "2010", "drift_km": 45}, {"decade": "2020", "drift_km": 68}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">Create an infographic in a corporate report minimalism style with muted corporate grays and blues, featuring a large title, 'Seasonal Fluctuations in 15-Year Mortgages', at the top. Directly below it is a subtitle, 'Historical prepayment velocities showing seasonal housing market trends'. Underneath the subtitle, a horizontal legend identifies two categories with small square icons: 'Spring/Summer Originations' in lighter gray-blue and 'Fall/Winter Originations' in darker gray-blue. The main visual is a multi-line chart in a wide landscape orientation. The vertical axis has numeric labels at 0.0, 5.0, 10.0, 15.0, and 20.0, with horizontal grid lines extending across the plot. The horizontal axis features labels: 'Jan 2018', 'Apr', 'Jul', 'Oct', 'Jan 2019', 'Apr', and 'Jul'. An icon depicting a sleek house silhouette is positioned in the upper left corner of the chart's plotting area. Two distinct lines represent the categories, characterized by cyclical seasonal bumps in the summer months. Both lines have square markers at each data point, with numerical values displayed near them. The lighter line for 'Spring/Summer Originations' plots a value of 8.0 in Jan 2018, rising to 12.5 in Apr, peaking at 16.0 in Jul, dipping to 11.0 in Oct, dropping further to 7.5 in Jan 2019, climbing to 13.0 in Apr, and reaching 17.5 in Jul. The darker line for 'Fall/Winter Originations' mirrors this pattern, starting at 6.5 in Jan 2018, increasing to 9.0 in Apr, hitting 14.5 in Jul, falling to 10.0 in Oct, bottoming out at 6.0 in Jan 2019, rising to 10.5 in Apr, and ending at 15.0 in Jul. The given data is : [{"category": "Spring/Summer Originations", "date": "2018-01", "value": 8.0}, {"category": "Fall/Winter Originations", "date": "2018-01", "value": 6.5}, {"category": "Spring/Summer Originations", "date": "2018-04", "value": 12.5}, {"category": "Fall/Winter Originations", "date": "2018-04", "value": 9.0}, {"category": "Spring/Summer Originations", "date": "2018-07", "value": 16.0}, {"category": "Fall/Winter Originations", "date": "2018-07", "value": 14.5}, {"category": "Spring/Summer Originations", "date": "2018-10", "value": 11.0}, {"category": "Fall/Winter Originations", "date": "2018-10", "value": 10.0}, {"category": "Spring/Summer Originations", "date": "2019-01", "value": 7.5}, {"category": "Fall/Winter Originations", "date": "2019-01", "value": 6.0}, {"category": "Spring/Summer Originations", "date": "2019-04", "value": 13.0}, {"category": "Fall/Winter Originations", "date": "2019-04", "value": 10.5}, {"category": "Spring/Summer Originations", "date": "2019-07", "value": 17.5}, {"category": "Fall/Winter Originations", "date": "2019-07", "value": 15.0}]</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents a comprehensive architectural and structural analysis of the Temple of Kom Ombo, an ancient Egyptian temple located on the west bank of the Nile River. The title "TEMPLE OF KOM OMBO" is prominently displayed in a hand-drawn, white-bordered box in the lower-right corner of the image, set against a brown background that mimics sandstone or earth tones. The overall layout is divided into multiple sections: a central photographic image of the temple ruins under a clear blue sky, surrounded by illustrative technical diagrams, annotated floor plans, and textual data blocks, all rendered in white line art and text for high contrast.<br><br>The central photograph shows the main hypostyle hall and surrounding structures of the temple, with visitors walking among the columns and courtyards, providing a sense of scale. In the background, the Nile River and palm trees are visible, situating the temple in its natural environment. The ruins are constructed from light-colored sandstone blocks, consistent with the material noted in the text.<br><br>In the upper-left quadrant, a 3D axonometric diagram illustrates the overall dimensions of the temple complex: approximately 62 meters by 51 meters, labeled along the axes. Adjacent to this, a list of key structural facts is presented in bullet points:<br>- TEMPLE AXIS: DOUBLE SANCTUARY FOR SOBEK & HORUS<br>- OVERALL DIMENSIONS (APPROX. 62M x 51M)<br>- CONSTRUCTION MATERIAL: SANDSTONE BLOCKS<br>- COLUMN HEIGHTS: UP TO 12 METERS<br><br>Above the central photo, two schematic diagrams illustrate architectural details:<br>- A top-down view of the hypostyle hall showing 30 columns arranged in a grid, labeled “HYPOSTYLE HALL (30 COLUMNS)” and pointing to “TWO SANCTUARIES.”<br>- A cross-section labeled “PYLON AND HYPOSTYLE SECTION,” which includes a detailed vertical cutaway showing the roofing system supported by columns, with arrows indicating load paths down to foundations.<br><br>To the right of the central image, text notes “TWO ENTRANCES SYMBOLIZING DUALITY,” emphasizing the temple’s unique dual dedication. This concept is reinforced in the lower section of the infographic, where a detailed floor plan is overlaid on the brown ground area.<br><br>The floor plan, drawn in white lines, is annotated with various features:<br>- INNER TEMPLE (FOR SOBEK) — marked with a rectangular inner sanctum.<br>- INNER TEMPLE (FOR HAROERIS) — another distinct inner sanctum, indicating the dual religious function.<br>- NILOMETER — a structure used to measure the Nile’s water level.<br>- BIRTH HOUSE (MAMMISHI) — a smaller chamber associated with fertility rituals.<br>- MUMMIFIED CROCODILE MUSEUM SITE — indicating a location within the temple complex for sacred crocodile mummies.<br>- TWO ENTRANCES SYMBOLIZING DUALITY — shown as two separate entryways on the plan.<br><br>Surrounding the floor plan are inset images of relief carvings, each labeled:<br>- MEDICAL INSTRUMENT RELIEFS — depicting figures with tools.<br>- TWO ENTRANCES RELIEFS — showing doorways flanked by deities.<br>- CALENDAR RELIEFS — illustrating scenes related to timekeeping or agricultural cycles.<br><br>Additional annotations point to structural aspects:<br>- “STRUCTURAL LOAD PATHS FROM COLUMNS TO FOUNDATIONS” — illustrated with curved arrows tracing the force transfer from columns through the walls to the ground.<br>- The pylon and hypostyle section diagram also labels “ROOFING SYSTEM” and shows how the roof beams rest on column capitals.<br><br>All textual content is in English, using a clean, sans-serif font that enhances readability. The visual style blends real photography with technical illustrations and hand-drawn elements, creating an educational and engaging format suitable for tourists, students, or archaeologists. The infographic effectively communicates both the physical characteristics and symbolic significance of the Temple of Kom Ombo, highlighting its duality, engineering, and cultural importance.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is laid out with two tables at the top, followed by a two-column text layout. The tables span the full width of the text area. The text includes a section heading.<br><br>[paragraphs]<br><br>the TOPIC MODELER, the GENDER SEGMENTER, and an OTHER module (transcript length and duration). We test for a linear relationship between each pair of variables: $H_O : r = 0$, $H_A : r \neq 0$, where $H_O$ is the origi-nal hypothesis, $H_A$ is the alternate hypothesis, and $r$ is the Pearson’s correlation coefficient. We follow Reddy et al. (2021) and Yang et al. (2019) and apply a Bonferroni cor-rection to our $\alpha$ value of $0.05$, setting $\alpha = 0.05/z$, where $z = \binom{124}{2} = 7,626$ for LDA, representing the number of feature relationships we consider. Hence, we reject $H_O$ in favor of $H_A$ if $p \leq \alpha$. Given the largeness of $z$, our $\alpha$ value becomes small, making our criteria for significance strict and thus suitable for investigating our research ques-tions. Furthermore, we filter our correlations $r$, such that $\Vert r\Vert > 0.1$ for our LDA experiments, and $\Vert r\Vert > 0.05$ for our BERTopic experiments (due to the smaller sample size of 10,000 podcasts, and fewer samples may have higher vari-ance). Our results focus on a selection of these significant correlations; the full results are available on the project web-site: https://www.gendered-discourse.net/extended-results.<br><br>### RQ0: How Are Women and Men’s Discourse Different?<br><br>Using GDCF, our Gendered Discourse Correlation Frame-work shown in Figure 2, we then analyze significant corre-lations between between the gender features from the GEN-DER SEGMENTER module (Doukhan et al. 2018a), and the topic features from the TOPIC MODELER module (Blei, Ng, and Jordan 2003). We use the discourse topics to automati-cally form gendered discourse word lists via their significant correlations.<br><br>Starting with the first row of Table 1, we see that Topic 3’s word list returned by LDA with Non-Contextual Embed-dings (Bag-Of-Words) (via the TOPIC MODELER module) contains the words women, woman, men, baby, pregnant, girls, men, doctor, health, birth (in descending weighted or-der). Based on this word list, we manually interpret this topic as being a content topic, specifically about pregnancy, as noted in the column “Topic N Categories.” Then, we look to the gender correlations in the columns “Gender” and “$r$,” and see that $r(\text{Topic 3, Women}) = +0.15$ and $r(\text{Topic 3, Men}) = -0.14$. This indicates that the topic of pregnancy positively correlates with women (identified via the GENDER SEGMENTER module), and negatively corre-lates with men. Therefore, we associate Topic 3 (Content - Pregnancy) with Women, as noted in the “Topic N Gender” column. Similarly, we make these associations in the “Topic N Gender” column for Topics 10, 49, and 71.<br><br>Next, we focus on the Topic 54 row. This topic is inter-preted using the word list get, like, know, right, people, go-ing, podcast, make, want, one. This word list does not refer to any content, hence, we manually interpret this topic as being a discourse topic. Moving to the gender correlations, we see that $r(\text{Topic 54, Women}) = \emptyset$ and $r(\text{Topic 3, Men}) = +0.12$. The reason for $r(\text{Topic 54, Women}) = \emptyset$ is because the correlation between the features Topic 54 and Women did not come back as significant. However, due to the positive correlation of $0.12$ for Topic 3 and Men, we manually as-sociate Topic 3 with Men in the “Topic N Gender” column.<br><br>[tables]<br><br>Table 1: LDA with Non-Contextual Embeddings (Bag-Of-Words): The complete set of significant correlations between gender features and topic features – both content topics and discourse topics. Based on $r$, the Topic N Gender forms the gendered (discourse) word lists via Topics 54 and 60 (the masculine word lists) and Topic 62 (the feminine word list).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 3 | Women <br> Men | 0.15 <br> -0.14 | women, woman, men, baby, pregnant, girls, men, doctor, health, birth | Content - Pregnancy | Women |<br>| Topic 10 | Women <br> Men | 0.10 <br> -0.12 | energy, body, feel, mind, space, yoga, love, beautiful, feeling, meditation | Content - Yoga | Women |<br>| Topic 49 | Women <br> Men | -0.21 <br> 0.17 | game, know, think, team, going, mean, play, year, one, good | Content - Sports | Men |<br>| Topic 71 | Women <br> Men | 0.14 <br> -0.14 | christmas, sex, girl, hair, love, get, date, girls, let, wear | Content - Dating | Women |<br>| Topic 54 | Women <br> Men | – <br> 0.12 | get, like, know, right, people, going, podcast, make, want, one | Discourse | Men |<br>| Topic 60 | Women <br> Men | -0.27 <br> 0.20 | going, know, think, get, got, one, really, good, well, yeah | Discourse | Men |<br>| Topic 62 | Women <br> Men | 0.33 <br> -0.28 | like, know, really, going, people, want, think, get, things, life | Discourse | Women |<br><br>Table 2: BERTopic with Contextual Embeddings (BERT, ChatGPT, Llama): The complete set of significant correlations between gender features and topic features for discourse topics only (content topics are omitted).<br><br>| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |<br>|---|---|---|---|---|---|<br>| Topic 0 | Women <br> Men | -0.08 <br> 0.10 | like, yeah, know, oh, right, podcast, got, going, think, really | Discourse | Men |<br>| Topic 2 | Women <br> Men | 0.08 <br> -0.08 | life, know, things, really, people, feel, like, want, love, going | Discourse | Women |<br>| Topic 5 | Women <br> Men | 0.08 <br> – | like, know, think, yeah, episode, really, going, anchor, kind, right | Discourse | Women |</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout with a single column. The text is justified and divided into sections and subsections, indicated by numbered headings. Important terms at the beginning of some paragraphs are bolded. A horizontal rule separates the header from the main content, and another rule separates the main content from the footnote at the bottom.<br><br>[paragraphs]<br><br>Preprint Version.<br><br>**Figure–Table Integration.** In addition to textual refinement, we extend the refinement process to include multimodal elements, to further enhance readability. For each section, the model first generates visualization requirements, such as tables with structured comparisons or figures with explanatory diagrams, together with natural language descriptions. Based on these descriptions, candidate figures and tables are synthesized. The compiled outputs are then fed back to an LLM for quality assessment, enabling automatic detection of issues such as oversized layouts or unreadable text. The LLM provides corrective suggestions, which are applied to improve the final visualizations. Finally, the text is refined again to ensure that all generated figures and tables are properly referenced within the survey.<br><br># 4 EXPERIMENTS<br><br>## 4.1 EXPERIMENTAL SETTINGS<br><br>**Implementation Details.** Following Wang et al. (2024b), we adopt **GPT-4o-mini** as our genera-tion model for its balance of responsiveness and cost. Our retrieval database contains 680K computer science papers from arXiv, with PDFs converted into structured Markdown using MinerU (Wang et al., 2024a) for consistent formatting. The details of the retrieval process are provided in App. A.1. In outline generation, the system consults 1000–1200 papers, with a maximum of 8 sections. For section drafting, each subsection retrieves up to 60 additional relevant papers, combined with those linked during outline generation. Finally, we apply two iterations of the review-and-refine loop to enhance coherence across sections and improve overall readability. Illustrative outputs compared with AutoSurvey are provided in App. A.8.<br><br>**Baselines.** We compare IterSurvey with a set of baselines, ranging from simple retrieval-augmented generation (Naive RAG), which directly drafts from retrieved documents, to more ad-vanced state-of-the-art systems. Specifically, we evaluate against AutoSurvey (Wang et al., 2024b), the first systematic framework for this task; SurveyForge (Yan et al., 2025), which combines heuris-tic outline generation based on the logical structures of human-written surveys with a memory-driven scholar navigation agent for high-quality retrieval; and SurveyGo (Wang et al., 2025), which em-ploys the LLM×MapReduce-V2 algorithm to address the long-context challenge. We also compare with SurveyX (Liang et al., 2025), which introduces an Attribute Tree-based outlining mechanism; however, due to access restrictions, we include SurveyX only in arena experiments. All methods are evaluated on the same retrieval database with generation hyperparameters aligned to their original settings for fairness.<br><br>## 4.2 AUTOMATIC EVALUATION RESULTS<br><br>**Evaluation Setup.** We employ multiple complementary protocols to evaluate the quality of gen-erated surveys. On the 20-topic suite from Wang et al. (2024b), we adopt multi-dimensional scoring with LLM-as-a-judge. Content quality is assessed along three dimensions: coverage, structure, and relevance followed from Wang et al. (2024b). Besides, citation quality is evaluated using the NLI-based protocol of Gao et al. (2023), reporting both recall and precision: _Citation Recall_ measures whether all statements in the generated text are fully supported by the cited passages, while _Citation Precision_ identifies irrelevant citations to ensure that references are pertinent and directly support the claims. To improve scoring stability and reliability, prompts are standardized and judges must pro-vide a rationale before assigning scores. For additional robustness, we aggregate outputs from three judge models: GPT-4o, Claude-3.5-Haiku, and GLM-4.5V.1 Full prompts are provided in App. A.7.<br><br>**Results.** The results on the 20 topics from Wang et al. (2024b) are reported in Tab. 1. Statistical significance was confirmed via paired t-tests, indicating that IterSurvey consistently outperforms baseline models ($p <0.05$).Wesummarizethemainobservationsbelow.<br><br>- **Overall superiority.** IterSurvey consistently outperforms all baselines across both content and citation quality, achieving the highest overall average score (4.75). This demonstrates that the proposed framework is effective and robust across multiple evaluation dimensions.<br><br>[page_number]<br><br>6<br><br>[footnotes]<br><br>1Specifically, we use `chatgpt-4o-latest`, `claude-3-5-haiku-20241022`, and `glm-4.5v`.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>This is a single-column page containing mostly text, structured with section headings and bold inline subheadings. URLs are formatted in a monospaced font and hyperlinked.<br><br>[paragraphs]<br><br># A Image generation models<br><br>This section details the two diffusion image generation models used in this work, namely Stable Diffusion 1.4 and 1.5.<br><br>**Stable Diffusion 1.4** The Stable Diffusion model is a text-conditioned image generator model that combines an autoencoder with a diffusion model to create a latent diffusion model. The autoencoder encodes images into latent representations with a reduced dimensionality when compared to the input image, reducing the computational needs during the training phase. Text prompts, on the other hand, are encoded using a text encoder and are then cross-attended by the UNet backbone of the latent diffusion model. Finally, the loss is computed using a reconstruction objective between the noise added to the latent representation and the prediction made by the UNet.<br>Stable Diffusion 1.4 (https://huggingface.co/CompVis/stable-diffusion-v1-4) had several rounds of training on the LAION dataset (https://laion.ai/), with each round changing the input image dimension, aesthetic score, and the probability of dropping the text-conditioning to improve classifier-free guidance.<br><br>**Stable Diffusion 1.5** SD 1.5, in turn, has the same architecture and even the same starting point as 1.4, with the difference being how long the model was fine-tuned on top of SD 1.2. The 1.4 version is fine-tuned for 225 thousand steps at resolution 512x512 on “laion-aesthetics v2 5+” with a 10% probability of dropping the text-conditioning, and version 1.5 for 595 thousand steps.<br>As demonstrated in Section D Stable Diffusion 1.4 has better performance than 1.5 in our approach, therefore, we will adopt SD 1.4 for most of the experiments in this paper.<br><br># B Large language models<br><br>Here we give additional details on the large language models that we used in our experiments.<br><br>**Gemma** (Mesnard et al., 2024), trained on a diverse 6 Trillion token dataset comprising web documents, code and mathematical texts. We resorted to the 7 Billion parameter instruction-tuned decoder-only model, named _gemma-7b-it_ (https://huggingface.co/google/gemma-7b-it). This model uses a chat template, which we employ during inference.<br><br>**Llama 2** (Touvron et al., 2023), of which we used the 7 Billion parameter, pre-trained-only model, _Llama-2-7b_ (https://huggingface.co/meta-llama/Llama-2-7b-hf). This model was trained with a mix of publicly available data totalling 2 Trillion tokens. While its chat versions employ supervised fine-tuning and reinforcement learning with human feedback for alignment with human preferences in helpfulness and safety, the pre-trained-only model does not. This results in a less constrained model, but it may also cause it to disperse from the task at hand. Since this model is a pre-trained-only no chat template is needed.<br><br>**Mistral** (Jiang et al., 2023) fine-tuned on various HuggingFace instruction datasets. We resorted to the 7 Billion _Mistral-7B-Instruct-v0.2_ model (https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.2) and used the respective chat template during inference.<br><br>**Phi-2** (Gunasekar et al., 2023) is a compact 2.7 Billion model (https://huggingface.co/microsoft/ phi-2). Despite its size, it offers a competitive performance with respect to models several times its size. It was trained on 250 Billion tokens, obtained through a combination of NLP synthetic data created by GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by GPT-4. This model was not fine-tuned through reinforcement learning from human feedback and does not have guardrails.<br><br>**Model ranking**<br>A ranking of these models in terms of their performance can be found in the HuggingFace leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which assesses several LLMs that are trained under the same criteria and tested on the same benchmarks, including reasoning</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">[typesetting]<br><br>The page is a standard academic paper layout, likely from a preprint server like arXiv. It features a title, author list with affiliations, an abstract, and the beginning of the "Introduction" section. A preprint notification ("Preprint. Under review.") is present at the bottom left. The text on the left margin ("arXiv:2502.01522v2 [cs.CV] 30 May 2025") is a vertical stamp typical of arXiv submissions.<br><br>[paragraphs]<br><br>arXiv:2502.01522v2 [cs.CV] 30 May 2025<br><br># Unpaired Deblurring via Decoupled Diffusion Model<br><br>**Junhao Cheng**$^1$, **Wei-Ting Chen**$^2$, **Xi Lu**$^1$, **Ming-Hsuan Yang**$^3$<br>$^1$Sun Yat-sen University $^2$ Microsoft $^3$ University of California, Merced<br>https://github.com/donahowe/UID-Diff<br><br>**Abstract**<br><br>Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose UID-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural features and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. We further introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of UID-Diff when encountering unknown blur patterns. Experiments on real-world datasets demonstrate that UID-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios.<br><br># 1 Introduction<br><br>Dynamic blur occurs when the camera and subject move relative to each other during the exposure time, resulting in a smeared and blurred image. Deblurring, the process of removing the blur pattern while preserving the underlying structure of degraded images, is essential for restoring high-quality images for human perception and low-level computer vision applications.<br><br>With the rapid advancement of photographic technology, a wide range of imaging devices are now employed to capture images in real-world scenarios. Due to their diverse lenses and structural designs, these devices may produce distinct blur patterns [1, 2, 3]. This diversity makes it challenging to develop an all-in-one method for deblurring images from arbitrary and varied sources. Consequently, focusing on deblurring algorithms tailored to specific domains has become increasingly significant.<br><br>As deep learning has advanced in recent years, existing deblurring models predominantly build on data-driven approaches that employ neural networks trained via supervised learning on synthetic paired data. Existing works have made efforts to develop deblurring models upon CNN [4, 5], Transformer [6, 7], and GAN [8, 9]. Recently, a new wave of research [10, 11, 12] has begun to investigate the integration of pre-trained generative diffusion models [13], such as Stable Diffusion (SD) [14], with an adapter designed to provide structural guidance for deblurring. These approaches aim to harness the generative capabilities of diffusion models to supplement missing details and generate aesthetically pleasing outputs. However, since paired blurry-sharp training data is limited in<br><br>[footnotes]<br><br>Preprint. Under review.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic presents an augmented reality (AR) shopping experience overlaid on a real-world retail environment. The scene is set in a brightly lit cosmetics aisle of a store, with shelves stocked with beauty products visible in the background. In the foreground, a pair of hands holds a black rectangular compact labeled "ANASTASIA BEVERLY HILLS BROW POWDER DUO" with "EBONY" and "NET WT. 2.5 OZ." printed below. A gold ring is visible on the left hand’s ring finger, and a black wristband is partially seen on the left wrist.<br><br>Overlaid on the image are several semi-transparent, rounded-corner UI elements resembling AR pop-ups or digital cards, providing contextual information about the product and the user’s shopping list.<br><br>On the left side, a vertical panel titled "SHOPPING LIST" lists four items:<br>1. Face Wash — marked with an “X” (completed)<br>2. Shampoo — marked with an “X” (completed)<br>2. Eye Cream — marked with an empty checkbox (not completed; duplicated item number)<br>3. Eye Cream — marked with an empty checkbox (not completed)<br><br>This suggests a possible error or duplication in the list, with two entries for "Eye Cream".<br><br>In the center-right, a speech-bubble-shaped label displays the price: "$23.00".<br><br>To the right of the product, a larger panel titled "PRODUCT DETAILS:" provides information about the "ABH Brow Powder Duo". It features two color swatches:<br>- Left swatch: labeled "DEEP BROWN"<br>- Right swatch: labeled "BLACK"<br><br>Below the swatches, a star rating system shows four and a half filled stars, accompanied by the text "4.5 out of 5 stars".<br><br>Underneath the rating, a section titled "COMMON USES:" states: "DEFINES & FILLS BROWS".<br><br>Further down, a smaller rectangular box labeled "KEY INGREDIENTS" lists:<br>- Vitamin E<br>- Finely Milled Pigments<br><br>At the bottom right, another box titled "APPLICATION TIPS" includes a video icon (a rectangle with a play triangle) and the word "Video", indicating a multimedia tutorial is available.<br><br>The overall layout mimics an immersive AR interface, likely from a smart glasses or smartphone application, designed to enhance in-store shopping by providing instant, interactive product data directly within the user’s field of view. The visual style uses dark gray, translucent backgrounds with white text for high contrast and readability against the busy store backdrop. The design emphasizes usability, with clear categorization of information into distinct panels and intuitive icons. All textual content is in English, and no other languages are present.</div></details></td>
<tdcolspan="2"valign="top"><details><summary><b>Prompt</b></summary><divstyle="text-align: left; max-height: 200px; overflow-y: auto; font-size: 0.85em; font-weight: normal; font-style: normal;">The infographic titled "Personal Insight & Casual Life Reflection Essay Collections: A Complete Guide" is presented in a tactile, clay-like 3D style with rounded edges and textured surfaces, giving it a handcrafted, approachable aesthetic. The background is a neutral light gray, which makes the colorful clay elements stand out. The title is displayed at the top in two lines: the first line "Personal Insight & Casual Life Reflection" is in a warm terracotta brown, while the second line "Essay Collections: A Complete Guide" uses a bright sky blue for "A Complete Guide," creating visual emphasis.<br><br>The infographic is organized as a grid with three main vertical sections (rows) and seven horizontal columns. The leftmost column contains the main categories, each labeled with a large blue clay tile with yellow text. These are:<br><br>1. **Core Traits of the Genre** – with a smaller brown clay subtitle: "Key traits of the personal insight & life reflection genre"<br>2. **Value for Readers & Writers** – with a smaller brown clay subtitle: "Why These Collections Matter"<br>3. **Actionable Engagement Steps** – with a smaller brown clay subtitle: "Simple steps for new readers and writers"<br><br>The remaining six columns form the header row, each represented by a colored clay tile with white text:<br><br>- **Infographic Multi-factor** (blue)<br>- **Statistics** (teal)<br>- **Quotes** (orange)<br>- **Key Terms** (green)<br>- **Style Preferences** (purple)<br>- **Layout Preferences** (light blue)<br>- **Other Requirements** (dark purple)<br><br>Each of the three main rows then has corresponding content under these headers. For the first two data points under each category (Statistics, Quotes, Key Terms), there is a white clay tile with a brown checkmark (√), indicating inclusion or relevance. For the last three headers (Style Preferences, Layout Preferences, Other Requirements), there is a white tile with a number "3" in the respective color of that column’s header.<br><br>Below each checkmark or number, there is a larger white clay tile with black text describing the specific point. The full structure is as follows:<br><br>---<br><br>### Row 1: Core Traits of the Genre<br><br>- **Statistics**: √ → *Rooted in real lived experience*<br>- **Quotes**: √ → *Inspired by small everyday moments*<br>- **Key Terms**: √ → *No formal structure rules*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 2: Value for Readers & Writers<br><br>- **Statistics**: √ → *Readers: Comfort & relatable validation*<br>- **Quotes**: √ → *Writers: Low-stakes creative outlet*<br>- **Key Terms**: √ → *All audiences: Record of personal growth*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (yellow) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>### Row 3: Actionable Engagement Steps<br><br>- **Statistics**: √ → *Step 1: Start with short essays to read*<br>- **Quotes**: √ → *Step 2: Write 100-word daily reflections*<br>- **Key Terms**: √ → *Step 3: Group by theme to build your collection*<br>- **Style Preferences**: 3 (orange) → *(no text)*<br>- **Layout Preferences**: 3 (green) → *(no text)*<br>- **Other Requirements**: 3 (purple) → *(no text)*<br><br>---<br><br>The design uses consistent visual encoding: checkmarks indicate presence or applicability, while the number "3" likely denotes a level of preference or importance (though no legend is provided). The colors are used both for column headers and for the numbers, creating a visual link between the header and its corresponding data point. All text is rendered in a clean, sans-serif font, and the overall layout is structured and easy to follow, with clear separation between rows and columns via subtle clay dividers. The infographic communicates a comprehensive yet accessible guide to engaging with personal insight and life reflection essays through a visually engaging and tactile design.</div></details></td>