# Interleaved Generation Evaluation Reproduction guide for SenseNova-U1 on interleaved generation benchmarks. The benchmark scripts live under `evaluation/interleave/`: - `BabyVision/` — API-based multimodal understanding with answer extraction and judge scoring - `OpenING/` — local-model interleaved generation with GPT-based judging - `Unimmmu/` — local-model interleaved generation with external score computation - `Realunify/` — local-model interleaved generation for GEU and UEG ## 1. Overview ``` ┌──────────────────────┐ │ evaluation/interleave│ └──────────┬───────────┘ │ ├── BabyVision ── API inference ── extract / judge ── aggregate score ├── OpenING ── local inference ─ GPT judge ── summarize ├── Unimmmu ── local inference ─ external scorer └── RealUnify ── local inference ─ rule / judge scoring ``` 1. `BabyVision` sends requests to one or more `/generate` endpoints and writes JSONL predictions. 2. `OpenING`, `Unimmmu`, and `RealUnify` load the model locally through `transformers`. 3. Some benchmarks have a separate judge or score-aggregation step after inference. 4. Most scripts support resume-friendly reruns through existing outputs, explicit `--resume`, or shard merging. All commands below assume: ```bash cd evaluation/interleave ``` ## 2. Benchmark Matrix | Benchmark | Inference backend | Evaluation backend | Primary outputs | | --- | --- | --- | --- | | `BabyVision` | HTTP `/generate` API | extraction + LLM judge | `babyvision_.jsonl`, `*_eval.jsonl` | | `OpenING` | local model | GPT judge + CSV summary | per-sample JSON, generated images, judge JSON | | `Unimmmu` | local model | external scorer | `unimmmu_results.jsonl`, generated images | | `RealUnify (GEU)` | local model | rule-based scorer | `realunify_results.jsonl`, score JSON | | `RealUnify (UEG)` | local model | user-provided judge wrapper | `ueg_results.json[l]` | For local-model benchmarks, pass the real dataset path explicitly instead of relying on placeholder defaults such as `/...`. ## 3. BabyVision `BabyVision` is the API-backed benchmark in this suite. The typical flow is inference first, then extraction and judge scoring, then score aggregation. Reference inference command: ```bash python3 BabyVision/infer_babyvision.py \ --model-name local-model \ --data-path /path/to/meta_data.jsonl \ --image-root /path/to/babyvision_images \ --output-dir ./output/babyvision_understand \ --generate-urls http://127.0.0.1:8000/generate \ --workers 32 \ --max-retries 3 \ --backend-max-retries 20 \ --request-timeout 600 \ --max-new-tokens 32768 \ --no-do-sample \ --temperature 0 \ --top-p 0.95 \ --repetition-penalty 1.05 \ --min-pixels 262144 \ --max-pixels 4194304 ``` | Argument | Meaning | | --- | --- | | `--data-path` | Path to `meta_data.jsonl`. | | `--image-root` | Root directory used to resolve sample image paths. | | `--generate-urls` | One or more `/generate` endpoints, comma-separated. | | `--workers` | Concurrent request workers. | | `--max-retries`, `--backend-max-retries` | Retry budget on the sample side and backend-request side. | | `--request-timeout` | Per-request timeout in seconds. | | `--min-pixels`, `--max-pixels` | Image preprocessing bounds. | Inference writes `babyvision_.jsonl`. Completed `taskId`s are skipped automatically on rerun. Reference evaluation command: ```bash python3 BabyVision/eval_babyvision.py \ --input ./output/babyvision_understand/babyvision_local-model.jsonl \ --output ./output/babyvision_understand/babyvision_local-model_eval.jsonl \ --endpoint https://your-judge-endpoint \ --api-key your_api_key \ --model gpt-4.1 \ --force \ --workers 16 \ --retries 3 ``` `eval_babyvision.py` performs answer extraction and judge scoring. `--endpoint` and `--api-key` can also come from environment variables. Use `--force` to recompute existing records, or `--judge-only` to score records that already have `extracted_answer`. Reference score command: ```bash python3 BabyVision/compute_score.py \ ./output/babyvision_understand/babyvision_local-model_eval.jsonl ``` The score script reports overall accuracy plus per-`type` and per-`subtype` results. ## 4. OpenING `OpenING` runs local-model interleaved generation and then scores the outputs with a GPT judge. Reference single-GPU inference command: ```bash python3 OpenING/infer_opening.py \ --mode opening \ --model_path /path/to/model \ --save_dir ./output/opening_interleave/opening_output \ --meta-path /path/to/OpenING-benchmark \ --data-file-name test_data.jsonl \ --think_mode think \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --timestep_shift 3.0 \ --cfg_interval 0 1.0 \ --num_steps 50 \ --max_new_tokens 4096 \ --max_generation_pixels 4194304 \ --oom_retry_max_pixels 1048576 \ --image_width 1920 \ --image_height 1088 \ --opening_step_prompt_style can_be \ --retry_short_outputs 0 \ --seed 42 ``` Reference 8-shard single-node command: ```bash mkdir -p logs for LOCAL_RANK in 0 1 2 3 4 5 6 7; do echo "Starting shard ${LOCAL_RANK} on GPU ${LOCAL_RANK}" CUDA_VISIBLE_DEVICES=${LOCAL_RANK} python3 OpenING/infer_opening.py \ --mode opening \ --model_path /path/to/model \ --save_dir ./output/opening_interleave/opening_output \ --meta-path /path/to/OpenING-benchmark \ --data-file-name test_data.jsonl \ --think_mode think \ --num_shards 8 \ --shard_index ${LOCAL_RANK} \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --timestep_shift 3.0 \ --cfg_interval 0 1.0 \ --num_steps 50 \ --max_new_tokens 4096 \ --max_generation_pixels 4194304 \ --oom_retry_max_pixels 1048576 \ --image_width 1920 \ --image_height 1088 \ --opening_step_prompt_style can_be \ --retry_short_outputs 0 \ --seed 42 \ > logs/opening_shard${LOCAL_RANK}.log 2>&1 & done wait ``` | Argument | Meaning | | --- | --- | | `--model_path` | Local model path. | | `--meta-path` | Dataset root. | | `--data-file-name` | Dataset JSONL file under the benchmark root. | | `--save_dir` | Output directory for per-sample JSON and images. | | `--think_mode` | `think`, `no_think`, or both. | | `--cfg_interval`, `--max_generation_pixels`, `--oom_retry_max_pixels` | Main generation and OOM-retry controls. | | `--num_shards`, `--shard_index` | Manual sharding for multi-process runs. | Each sample is saved as `/.json`, and generated images use names such as `/-o-0.jpg`. Reference judge command: ```bash export OPENING_JUDGE_BASE_URL=http://127.0.0.1:8000 export OPENING_JUDGE_API_KEY=your_api_key python3 OpenING/eval_opening.py \ --mode output_dir \ --opening_root /path/to/OpenING \ --output_dir ./output/opening_interleave/opening_output \ --output_file /path/to/OpenING/gpt-score_results_opening_output.json \ --workers 4 \ --save_every 10 ``` `eval_opening.py` supports both a single model-output directory and a parent directory containing multiple outputs. Existing judge results are reused by default; `--retry_invalid_scores` retries only malformed score records. Reference summary command: ```bash python3 OpenING/summarize_GPT_scores.py \ --input_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_opening_output.json \ --output_csv /path/to/OpenING/Interleaved_Arena/model_score_summaries.csv \ --filtered_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_filtered.json ``` This step converts raw judge results into a comparison-friendly CSV and can optionally emit a filtered JSON with invalid scores removed. ## 5. Unimmmu `Unimmmu` supports both understanding-only and interleaved generation paths, but the interleaved mode is the one covered here. Reference inference command: ```bash python3 Unimmmu/inference_unimmmu.py \ --model_path /path/to/model \ --data_path /path/to/unimmmu_direct.jsonl \ --output_dir ./output/unimmmu_interleave \ --inference_mode interleave \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --num_steps 50 ``` Reference multi-GPU command: ```bash torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \ --model_path /path/to/model \ --data_path /path/to/unimmmu_direct.jsonl \ --output_dir ./output/unimmmu_interleave \ --inference_mode interleave \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --num_steps 50 ``` Reference `device_map` command: ```bash python3 Unimmmu/inference_unimmmu.py \ --model_path /path/to/model \ --data_path /path/to/unimmmu_direct.jsonl \ --output_dir ./output/unimmmu_interleave \ --inference_mode interleave \ --device_map auto \ --max_memory_per_gpu_gb 60 \ --cfg_scale 4.0 \ --num_steps 50 ``` Reference shard merge command: ```bash python3 Unimmmu/merge_shards.py \ --data_path /path/to/unimmmu_direct.jsonl \ --shard_dir ./output/unimmmu_interleave/shards \ --output_file ./output/unimmmu_interleave/unimmmu_results.jsonl ``` | Argument | Meaning | | --- | --- | | `--inference_mode` | Use `interleave` for the benchmark covered here. | | `--data_path` | Benchmark JSONL path. | | `--output_dir` | Root directory for JSONL results and generated images. | | `--resume` | Skip completed `hash_uid`s. | | `--num_shards`, `--shard_rank` | Manual data sharding. | | `--device_map auto` | Single-process multi-GPU loading via Hugging Face. | The main output file is `unimmmu_results.jsonl`. Interleaved images are written under `/images//`. In the current implementation, resume is applied before shard selection, so rerunning one shard is most reliable after deleting that shard's outputs and rerunning without `--resume`. Reference score command: ```bash python3 Unimmmu/calculate_score.py \ --input_file ./output/unimmmu_interleave/unimmmu_results.jsonl \ --output_dir ./output/unimmmu_interleave/scores \ --benchmark_path /path/to/image_text_agent ``` `calculate_score.py` delegates the actual scoring logic to the external benchmark repository pointed to by `--benchmark_path`. ## 6. RealUnify (GEU) The GEU script supports both `step` and `interleave` modes. The interleaved path is the main one for SenseNova-U1 benchmarking. Reference inference command: ```bash python3 Realunify/inference_realunify.py \ --model_path /path/to/model \ --data_path /path/to/GEU_step_processed.jsonl \ --output_dir ./output/realunify_interleave \ --inference_mode interleave \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --num_steps 50 ``` Reference multi-GPU command: ```bash torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \ --model_path /path/to/model \ --data_path /path/to/GEU_step_processed.jsonl \ --output_dir ./output/realunify_interleave \ --inference_mode interleave \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --num_steps 50 ``` Reference `device_map` command: ```bash python3 Realunify/inference_realunify.py \ --model_path /path/to/model \ --data_path /path/to/GEU_step_processed.jsonl \ --output_dir ./output/realunify_interleave \ --inference_mode interleave \ --device_map auto \ --max_memory_per_gpu_gb 60 \ --cfg_scale 4.0 \ --num_steps 50 ``` Reference shard merge command: ```bash python3 Realunify/merge_shards.py \ --data_path /path/to/GEU_step_processed.jsonl \ --shard_dir ./output/realunify_interleave/shards \ --output_file ./output/realunify_interleave/realunify_results.jsonl ``` The main result file is `realunify_results.jsonl`. In `step` mode, `generated_image` stores `[input_image, edited_image]`; in `interleave` mode, the generated sequence is stored under `generated_images`. As with `Unimmmu`, resume currently happens before manual shard selection. If you want a fixed output image size, pass `--target_image_size 1024`. Reference score command: ```bash python3 Realunify/calculate_score.py \ --input_file ./output/realunify_interleave/realunify_results.jsonl \ --output_file ./output/realunify_interleave/realunify_scores.json ``` The scorer first tries to extract the answer from `...` and otherwise falls back to the first `A/B/C/D` letter found in `model_response`. ## 7. RealUnify (UEG) The UEG script exposes `understand_t2i`, `interleave`, and `t2i` inference modes. Reference `understand_t2i` command: ```bash python3 Realunify/inference_realunify_ueg.py \ --model_path /path/to/model \ --data_path /path/to/UEG_step.json \ --output_dir ./output/ueg_understand_t2i \ --inference_mode understand_t2i \ --cfg_scale 4.0 \ --num_steps 50 ``` Reference `interleave` command: ```bash python3 Realunify/inference_realunify_ueg.py \ --model_path /path/to/model \ --data_path /path/to/UEG_step.json \ --output_dir ./output/ueg_interleave \ --inference_mode interleave \ --cfg_scale 4.0 \ --timestep_shift 3.0 \ --num_steps 50 ``` Reference `t2i` command: ```bash python3 Realunify/inference_realunify_ueg.py \ --model_path /path/to/model \ --data_path /path/to/UEG_step.json \ --output_dir ./output/ueg_t2i \ --inference_mode t2i \ --cfg_scale 4.0 \ --num_steps 50 ``` Unlike the GEU script, this one does not expose manual `--num_shards` or `--shard_rank` flags. Multi-process splitting relies on the distributed rank provided by `torchrun`. The output preserves generated image paths together with the follow-up `question_list`. Reference score command: ```bash python3 Realunify/calculate_score_ueg.py \ --input_file ./output/ueg_interleave/ueg_results.json ``` `calculate_score_ueg.py` is only a scaffold in the current repository. It expects a user-provided `GeminiAPI` judge wrapper and otherwise raises `NotImplementedError`. ## 8. Running the Evaluation A typical local-model evaluation flow looks like this: ```bash MODEL_PATH=/path/to/hf_model torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \ --model_path ${MODEL_PATH} \ --data_path /path/to/GEU_step_processed.jsonl \ --output_dir ./output/realunify_interleave \ --inference_mode interleave \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --num_steps 50 python3 Realunify/inference_realunify_ueg.py \ --model_path ${MODEL_PATH} \ --data_path /path/to/UEG_step.json \ --output_dir ./output/ueg_understand_t2i \ --inference_mode understand_t2i \ --cfg_scale 4.0 \ --num_steps 50 torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \ --model_path ${MODEL_PATH} \ --data_path /path/to/unimmmu_direct.jsonl \ --output_dir ./output/unimmmu_interleave \ --inference_mode interleave \ --cfg_scale 4.0 \ --img_cfg_scale 1.0 \ --num_steps 50 ``` `RealUnify (GEU)`, `RealUnify (UEG)`, and `Unimmmu` are independent and can run in parallel. `BabyVision` and `OpenING` each have their own inference-plus-evaluation pipeline as described above. ## 9. Troubleshooting - Dataset file not found: check `--data_path`, or for `OpenING`, verify `--meta-path` and `--data-file-name`. - A path still contains ``: the script is using a placeholder default; pass the real path explicitly. - Samples are unexpectedly skipped: outputs already exist; review `--resume`, `--overwrite`, or shard outputs. - Rerunning a single shard gives odd behavior: for `Unimmmu` and `RealUnify (GEU)`, delete that shard's outputs first and rerun without `--resume`. - UEG scoring fails immediately: `calculate_score_ueg.py` needs a user-supplied `GeminiAPI` wrapper.