interleaved.md 15.3 KB
Newer Older
raojy's avatar
first  
raojy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
# Interleaved Generation Evaluation

Reproduction guide for SenseNova-U1 on interleaved generation benchmarks.

The benchmark scripts live under `evaluation/interleave/`:

- `BabyVision/` — API-based multimodal understanding with answer extraction and judge scoring
- `OpenING/` — local-model interleaved generation with GPT-based judging
- `Unimmmu/` — local-model interleaved generation with external score computation
- `Realunify/` — local-model interleaved generation for GEU and UEG

## 1. Overview

```
┌──────────────────────┐
│ evaluation/interleave│
└──────────┬───────────┘

           ├── BabyVision   ── API inference ── extract / judge ── aggregate score
           ├── OpenING      ── local inference ─ GPT judge       ── summarize
           ├── Unimmmu      ── local inference ─ external scorer
           └── RealUnify    ── local inference ─ rule / judge scoring
```

1. `BabyVision` sends requests to one or more `/generate` endpoints and writes JSONL predictions.
2. `OpenING`, `Unimmmu`, and `RealUnify` load the model locally through `transformers`.
3. Some benchmarks have a separate judge or score-aggregation step after inference.
4. Most scripts support resume-friendly reruns through existing outputs, explicit `--resume`, or shard merging.

All commands below assume:

```bash
cd evaluation/interleave
```

## 2. Benchmark Matrix

| Benchmark | Inference backend | Evaluation backend | Primary outputs |
| --- | --- | --- | --- |
| `BabyVision` | HTTP `/generate` API | extraction + LLM judge | `babyvision_<model>.jsonl`, `*_eval.jsonl` |
| `OpenING` | local model | GPT judge + CSV summary | per-sample JSON, generated images, judge JSON |
| `Unimmmu` | local model | external scorer | `unimmmu_results.jsonl`, generated images |
| `RealUnify (GEU)` | local model | rule-based scorer | `realunify_results.jsonl`, score JSON |
| `RealUnify (UEG)` | local model | user-provided judge wrapper | `ueg_results.json[l]` |

For local-model benchmarks, pass the real dataset path explicitly instead of relying on placeholder defaults such as `<DATA_ROOT>/...`.

## 3. BabyVision

`BabyVision` is the API-backed benchmark in this suite. The typical flow is inference first, then extraction and judge scoring, then score aggregation.

Reference inference command:

```bash
python3 BabyVision/infer_babyvision.py \
  --model-name local-model \
  --data-path /path/to/meta_data.jsonl \
  --image-root /path/to/babyvision_images \
  --output-dir ./output/babyvision_understand \
  --generate-urls http://127.0.0.1:8000/generate \
  --workers 32 \
  --max-retries 3 \
  --backend-max-retries 20 \
  --request-timeout 600 \
  --max-new-tokens 32768 \
  --no-do-sample \
  --temperature 0 \
  --top-p 0.95 \
  --repetition-penalty 1.05 \
  --min-pixels 262144 \
  --max-pixels 4194304
```

| Argument | Meaning |
| --- | --- |
| `--data-path` | Path to `meta_data.jsonl`. |
| `--image-root` | Root directory used to resolve sample image paths. |
| `--generate-urls` | One or more `/generate` endpoints, comma-separated. |
| `--workers` | Concurrent request workers. |
| `--max-retries`, `--backend-max-retries` | Retry budget on the sample side and backend-request side. |
| `--request-timeout` | Per-request timeout in seconds. |
| `--min-pixels`, `--max-pixels` | Image preprocessing bounds. |

Inference writes `babyvision_<model_name>.jsonl`. Completed `taskId`s are skipped automatically on rerun.

Reference evaluation command:

```bash
python3 BabyVision/eval_babyvision.py \
  --input ./output/babyvision_understand/babyvision_local-model.jsonl \
  --output ./output/babyvision_understand/babyvision_local-model_eval.jsonl \
  --endpoint https://your-judge-endpoint \
  --api-key your_api_key \
  --model gpt-4.1 \
  --force \
  --workers 16 \
  --retries 3
```

`eval_babyvision.py` performs answer extraction and judge scoring. `--endpoint` and `--api-key` can also come from environment variables. Use `--force` to recompute existing records, or `--judge-only` to score records that already have `extracted_answer`.

Reference score command:

```bash
python3 BabyVision/compute_score.py \
  ./output/babyvision_understand/babyvision_local-model_eval.jsonl
```

The score script reports overall accuracy plus per-`type` and per-`subtype` results.

## 4. OpenING

`OpenING` runs local-model interleaved generation and then scores the outputs with a GPT judge.

Reference single-GPU inference command:

```bash
python3 OpenING/infer_opening.py \
  --mode opening \
  --model_path /path/to/model \
  --save_dir ./output/opening_interleave/opening_output \
  --meta-path /path/to/OpenING-benchmark \
  --data-file-name test_data.jsonl \
  --think_mode think \
  --cfg_scale 4.0 \
  --img_cfg_scale 1.0 \
  --timestep_shift 3.0 \
  --cfg_interval 0 1.0 \
  --num_steps 50 \
  --max_new_tokens 4096 \
  --max_generation_pixels 4194304 \
  --oom_retry_max_pixels 1048576 \
  --image_width 1920 \
  --image_height 1088 \
  --opening_step_prompt_style can_be \
  --retry_short_outputs 0 \
  --seed 42
```

Reference 8-shard single-node command:

```bash
mkdir -p logs
for LOCAL_RANK in 0 1 2 3 4 5 6 7; do
  echo "Starting shard ${LOCAL_RANK} on GPU ${LOCAL_RANK}"
  CUDA_VISIBLE_DEVICES=${LOCAL_RANK} python3 OpenING/infer_opening.py \
    --mode opening \
    --model_path /path/to/model \
    --save_dir ./output/opening_interleave/opening_output \
    --meta-path /path/to/OpenING-benchmark \
    --data-file-name test_data.jsonl \
    --think_mode think \
    --num_shards 8 \
    --shard_index ${LOCAL_RANK} \
    --cfg_scale 4.0 \
    --img_cfg_scale 1.0 \
    --timestep_shift 3.0 \
    --cfg_interval 0 1.0 \
    --num_steps 50 \
    --max_new_tokens 4096 \
    --max_generation_pixels 4194304 \
    --oom_retry_max_pixels 1048576 \
    --image_width 1920 \
    --image_height 1088 \
    --opening_step_prompt_style can_be \
    --retry_short_outputs 0 \
    --seed 42 \
    > logs/opening_shard${LOCAL_RANK}.log 2>&1 &
done
wait
```

| Argument | Meaning |
| --- | --- |
| `--model_path` | Local model path. |
| `--meta-path` | Dataset root. |
| `--data-file-name` | Dataset JSONL file under the benchmark root. |
| `--save_dir` | Output directory for per-sample JSON and images. |
| `--think_mode` | `think`, `no_think`, or both. |
| `--cfg_interval`, `--max_generation_pixels`, `--oom_retry_max_pixels` | Main generation and OOM-retry controls. |
| `--num_shards`, `--shard_index` | Manual sharding for multi-process runs. |

Each sample is saved as `<save_dir>/<total_uid>.json`, and generated images use names such as `<save_dir>/<total_uid>-o-0.jpg`.

Reference judge command:

```bash
export OPENING_JUDGE_BASE_URL=http://127.0.0.1:8000
export OPENING_JUDGE_API_KEY=your_api_key

python3 OpenING/eval_opening.py \
  --mode output_dir \
  --opening_root /path/to/OpenING \
  --output_dir ./output/opening_interleave/opening_output \
  --output_file /path/to/OpenING/gpt-score_results_opening_output.json \
  --workers 4 \
  --save_every 10
```

`eval_opening.py` supports both a single model-output directory and a parent directory containing multiple outputs. Existing judge results are reused by default; `--retry_invalid_scores` retries only malformed score records.

Reference summary command:

```bash
python3 OpenING/summarize_GPT_scores.py \
  --input_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_opening_output.json \
  --output_csv /path/to/OpenING/Interleaved_Arena/model_score_summaries.csv \
  --filtered_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_filtered.json
```

This step converts raw judge results into a comparison-friendly CSV and can optionally emit a filtered JSON with invalid scores removed.

## 5. Unimmmu

`Unimmmu` supports both understanding-only and interleaved generation paths, but the interleaved mode is the one covered here.

Reference inference command:

```bash
python3 Unimmmu/inference_unimmmu.py \
  --model_path /path/to/model \
  --data_path /path/to/unimmmu_direct.jsonl \
  --output_dir ./output/unimmmu_interleave \
  --inference_mode interleave \
  --cfg_scale 4.0 \
  --img_cfg_scale 1.0 \
  --num_steps 50
```

Reference multi-GPU command:

```bash
torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \
  --model_path /path/to/model \
  --data_path /path/to/unimmmu_direct.jsonl \
  --output_dir ./output/unimmmu_interleave \
  --inference_mode interleave \
  --cfg_scale 4.0 \
  --img_cfg_scale 1.0 \
  --num_steps 50
```

Reference `device_map` command:

```bash
python3 Unimmmu/inference_unimmmu.py \
  --model_path /path/to/model \
  --data_path /path/to/unimmmu_direct.jsonl \
  --output_dir ./output/unimmmu_interleave \
  --inference_mode interleave \
  --device_map auto \
  --max_memory_per_gpu_gb 60 \
  --cfg_scale 4.0 \
  --num_steps 50
```

Reference shard merge command:

```bash
python3 Unimmmu/merge_shards.py \
  --data_path /path/to/unimmmu_direct.jsonl \
  --shard_dir ./output/unimmmu_interleave/shards \
  --output_file ./output/unimmmu_interleave/unimmmu_results.jsonl
```

| Argument | Meaning |
| --- | --- |
| `--inference_mode` | Use `interleave` for the benchmark covered here. |
| `--data_path` | Benchmark JSONL path. |
| `--output_dir` | Root directory for JSONL results and generated images. |
| `--resume` | Skip completed `hash_uid`s. |
| `--num_shards`, `--shard_rank` | Manual data sharding. |
| `--device_map auto` | Single-process multi-GPU loading via Hugging Face. |

The main output file is `unimmmu_results.jsonl`. Interleaved images are written under `<output_dir>/images/<task>/`. In the current implementation, resume is applied before shard selection, so rerunning one shard is most reliable after deleting that shard's outputs and rerunning without `--resume`.

Reference score command:

```bash
python3 Unimmmu/calculate_score.py \
  --input_file ./output/unimmmu_interleave/unimmmu_results.jsonl \
  --output_dir ./output/unimmmu_interleave/scores \
  --benchmark_path /path/to/image_text_agent
```

`calculate_score.py` delegates the actual scoring logic to the external benchmark repository pointed to by `--benchmark_path`.

## 6. RealUnify (GEU)

The GEU script supports both `step` and `interleave` modes. The interleaved path is the main one for SenseNova-U1 benchmarking.

Reference inference command:

```bash
python3 Realunify/inference_realunify.py \
  --model_path /path/to/model \
  --data_path /path/to/GEU_step_processed.jsonl \
  --output_dir ./output/realunify_interleave \
  --inference_mode interleave \
  --cfg_scale 4.0 \
  --img_cfg_scale 1.0 \
  --num_steps 50
```

Reference multi-GPU command:

```bash
torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \
  --model_path /path/to/model \
  --data_path /path/to/GEU_step_processed.jsonl \
  --output_dir ./output/realunify_interleave \
  --inference_mode interleave \
  --cfg_scale 4.0 \
  --img_cfg_scale 1.0 \
  --num_steps 50
```

Reference `device_map` command:

```bash
python3 Realunify/inference_realunify.py \
  --model_path /path/to/model \
  --data_path /path/to/GEU_step_processed.jsonl \
  --output_dir ./output/realunify_interleave \
  --inference_mode interleave \
  --device_map auto \
  --max_memory_per_gpu_gb 60 \
  --cfg_scale 4.0 \
  --num_steps 50
```

Reference shard merge command:

```bash
python3 Realunify/merge_shards.py \
  --data_path /path/to/GEU_step_processed.jsonl \
  --shard_dir ./output/realunify_interleave/shards \
  --output_file ./output/realunify_interleave/realunify_results.jsonl
```

The main result file is `realunify_results.jsonl`. In `step` mode, `generated_image` stores `[input_image, edited_image]`; in `interleave` mode, the generated sequence is stored under `generated_images`. As with `Unimmmu`, resume currently happens before manual shard selection.

If you want a fixed output image size, pass `--target_image_size 1024`.

Reference score command:

```bash
python3 Realunify/calculate_score.py \
  --input_file ./output/realunify_interleave/realunify_results.jsonl \
  --output_file ./output/realunify_interleave/realunify_scores.json
```

The scorer first tries to extract the answer from `<answer>...</answer>` and otherwise falls back to the first `A/B/C/D` letter found in `model_response`.

## 7. RealUnify (UEG)

The UEG script exposes `understand_t2i`, `interleave`, and `t2i` inference modes.

Reference `understand_t2i` command:

```bash
python3 Realunify/inference_realunify_ueg.py \
  --model_path /path/to/model \
  --data_path /path/to/UEG_step.json \
  --output_dir ./output/ueg_understand_t2i \
  --inference_mode understand_t2i \
  --cfg_scale 4.0 \
  --num_steps 50
```

Reference `interleave` command:

```bash
python3 Realunify/inference_realunify_ueg.py \
  --model_path /path/to/model \
  --data_path /path/to/UEG_step.json \
  --output_dir ./output/ueg_interleave \
  --inference_mode interleave \
  --cfg_scale 4.0 \
  --timestep_shift 3.0 \
  --num_steps 50
```

Reference `t2i` command:

```bash
python3 Realunify/inference_realunify_ueg.py \
  --model_path /path/to/model \
  --data_path /path/to/UEG_step.json \
  --output_dir ./output/ueg_t2i \
  --inference_mode t2i \
  --cfg_scale 4.0 \
  --num_steps 50
```

Unlike the GEU script, this one does not expose manual `--num_shards` or `--shard_rank` flags. Multi-process splitting relies on the distributed rank provided by `torchrun`. The output preserves generated image paths together with the follow-up `question_list`.

Reference score command:

```bash
python3 Realunify/calculate_score_ueg.py \
  --input_file ./output/ueg_interleave/ueg_results.json
```

`calculate_score_ueg.py` is only a scaffold in the current repository. It expects a user-provided `GeminiAPI` judge wrapper and otherwise raises `NotImplementedError`.

## 8. Running the Evaluation

A typical local-model evaluation flow looks like this:

```bash
MODEL_PATH=/path/to/hf_model

torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \
  --model_path ${MODEL_PATH} \
  --data_path /path/to/GEU_step_processed.jsonl \
  --output_dir ./output/realunify_interleave \
  --inference_mode interleave \
  --cfg_scale 4.0 \
  --img_cfg_scale 1.0 \
  --num_steps 50

python3 Realunify/inference_realunify_ueg.py \
  --model_path ${MODEL_PATH} \
  --data_path /path/to/UEG_step.json \
  --output_dir ./output/ueg_understand_t2i \
  --inference_mode understand_t2i \
  --cfg_scale 4.0 \
  --num_steps 50

torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \
  --model_path ${MODEL_PATH} \
  --data_path /path/to/unimmmu_direct.jsonl \
  --output_dir ./output/unimmmu_interleave \
  --inference_mode interleave \
  --cfg_scale 4.0 \
  --img_cfg_scale 1.0 \
  --num_steps 50
```

`RealUnify (GEU)`, `RealUnify (UEG)`, and `Unimmmu` are independent and can run in parallel. `BabyVision` and `OpenING` each have their own inference-plus-evaluation pipeline as described above.

## 9. Troubleshooting

- Dataset file not found: check `--data_path`, or for `OpenING`, verify `--meta-path` and `--data-file-name`.
- A path still contains `<DATA_ROOT>`: the script is using a placeholder default; pass the real path explicitly.
- Samples are unexpectedly skipped: outputs already exist; review `--resume`, `--overwrite`, or shard outputs.
- Rerunning a single shard gives odd behavior: for `Unimmmu` and `RealUnify (GEU)`, delete that shard's outputs first and rerun without `--resume`.
- UEG scoring fails immediately: `calculate_score_ueg.py` needs a user-supplied `GeminiAPI` wrapper.