first

2a934cec · raojy · 4b618aa3 · 2a934cec · 2a934cec · 2a934cec
Commit 2a934cec authored May 25, 2026 by raojy
20 changed files
--- a/SenseNova-U1/docs/u1_infographic_showcases.md
+++ b/SenseNova-U1/docs/u1_infographic_showcases.md
--- a/SenseNova-U1/evaluation/README.md
+++ b/SenseNova-U1/evaluation/README.md
+# Evaluation
+
+Benchmark reproduction scripts and guides for SenseNova-U1.
+
+## Sections
+
+- [Visual Understanding](docs/understanding.md) — reproduction scripts for visual understanding benchmarks
+- [Image Generation](docs/image_generation.md) — reproduction scripts for image generation benchmarks
+- [Interleaved Generation](docs/interleaved.md) — reproduction scripts for interleaved generation benchmarks
--- a/SenseNova-U1/evaluation/README_CN.md
+++ b/SenseNova-U1/evaluation/README_CN.md
+# Evaluation（评测）
+
+SenseNova-U1 的 benchmark 复现脚本与指南。
+
+## 目录
+
+- [视觉理解](docs/understanding.md) — 视觉理解 benchmark 复现脚本
+- [图像生成](docs/image_generation.md) — 图像生成 benchmark 复现脚本
+- [交错生成](docs/interleaved.md) — 交错生成 benchmark 复现脚本
--- a/SenseNova-U1/evaluation/docs/image_generation.md
+++ b/SenseNova-U1/evaluation/docs/image_generation.md
+# Image Generation Evaluation
+
+Reproduction scripts for SenseNova-U1 on image generation benchmarks. Each
+benchmark lives in its own subfolder under [`evaluation/gen/`](../gen/) and
+ships with a generation script, an evaluation script, and a shell launcher
+wiring them together:
+
+```
+evaluation/gen/
+├── bizgeneval/              # BizGenEval — business / infographic prompts
+│   ├── gen_images_bizgeneval.py
+│   ├── eval_images_bizgeneval.py
+│   ├── run_bizgeneval.sh
+│   └── data/test.jsonl
+├── igenbench/               # IGenBench — general-purpose T2I benchmark
+│   ├── gen_images_igenbench.py
+│   ├── eval_images_igenbench.py
+│   ├── run_igenbench.sh
+│   └── data/*.json
+├── longtext/                # LongText — long-text rendering (en / zh)
+│   ├── gen_images_longtext.py
+│   ├── eval_images_longtext.py
+│   ├── run_longtext.sh
+│   └── data/{text_prompts.jsonl,text_prompts_zh.jsonl}
+├── cvtg/                    # CVTG-2K — complex visual text generation
+│   ├── eval_cvtg.py
+│   ├── unified_metrics_eval.py
+│   ├── sa_0_4_vit_l_14_linear.pth
+│   ├── run_cvtgeval.sh
+│   └── data/{CVTG,CVTG-Style}/{2..5}{,_combined}.json
+└── tiif/                    # TIIF-Bench — text-image instruction following
+    ├── eval_tiif.py
+    ├── run_tiifeval.sh
+    ├── eval/{eval_with_vlm_mp,summary_results,summary_dimension_results}.py
+    └── data/{testmini,test}{_prompts,_eval_prompts}/*.jsonl
+```
+
+Every benchmark follows the same two-stage flow: **generate images**, then
+**evaluate them** (usually against an OpenAI-compatible judge model). The
+shell launchers chain both stages, so the typical entry point is just:
+
+```bash
+bash evaluation/gen/<bench>/run_<bench>.sh
+```
+
+Edit the variables at the top of each launcher (model path, API key / base,
+judge model, output dirs) before running.
+
+## BizGenEval
+
+Infographic / business-style prompts. Images are judged by an
+OpenAI-compatible VLM (Gemini 3 Pro by default).
+
+End-to-end:
+
+```bash
+bash evaluation/gen/bizgeneval/run_bizgeneval.sh
+```
+
+Or run the two stages manually:
+
+```bash
+# 1) Generate
+python evaluation/gen/bizgeneval/gen_images_bizgeneval.py \
+  --model-path sensenova/SenseNova-U1-8B-MoT-SFT \
+  --output-dir outputs/sensenova/bizgeneval \
+  --cfg-scale 4.0 --cfg-norm none --timestep-shift 3.0 --num-steps 50
+
+# 2) Judge
+python evaluation/gen/bizgeneval/eval_images_bizgeneval.py \
+  --image-dir outputs/sensenova/bizgeneval \
+  --output-dir outputs/sensenova/bizgeneval_eval \
+  --api-base  http://your-api-base/v1 \
+  --api-key   sk-... \
+  --judge-model gemini-3-pro-preview \
+  --concurrency 8
+```
+
+Prompts are loaded from [`bizgeneval/data/test.jsonl`](../gen/bizgeneval/data/test.jsonl).
+The summary (per-item scores + aggregate) is written under `--output-dir`.
+
+## IGenBench
+
+General-purpose T2I benchmark with direct image-question judging.
+
+Prepare the IGenBench metadata from
+[`Brookseeworld/IGenBench-Dataset`](https://huggingface.co/datasets/Brookseeworld/IGenBench-Dataset/tree/main)
+and place the per-item JSON files under
+[`igenbench/data/`](../gen/igenbench/data/). The scripts read those JSON files
+directly for both generation prompts and evaluation questions.
+
+```bash
+bash evaluation/gen/igenbench/run_igenbench.sh
+```
+
+Manual:
+
+```bash
+python evaluation/gen/igenbench/gen_images_igenbench.py \
+  --model-path sensenova/SenseNova-U1-8B-MoT-SFT \
+  --output-dir outputs/sensenova/igenbench \
+  --cfg-scale 4.0 --cfg-norm none --timestep-shift 3.0 --num-steps 50
+
+python evaluation/gen/igenbench/eval_images_igenbench.py \
+  --image-dir outputs/sensenova/igenbench \
+  --output-dir outputs/sensenova/igenbench_eval \
+  --api-base  http://your-api-base/v1 \
+  --api-key   sk-... \
+  --judge-model gemini-3-pro-preview \
+  --concurrency 128
+```
+
+Set `--gen-model-name` to tag the judgments with a custom identifier (useful
+when comparing multiple generators under the same `--output-dir`).
+
+## LongText
+
+Long-text rendering benchmark, run separately for English (`--lang en`) and
+Chinese (`--lang zh`). The launcher executes both passes back to back:
+
+```bash
+bash evaluation/gen/longtext/run_longtext.sh
+```
+
+Manual (single language):
+
+```bash
+python evaluation/gen/longtext/gen_images_longtext.py \
+  --model-path sensenova/SenseNova-U1-8B-MoT-SFT \
+  --output-dir outputs/longtext/en \
+  --lang en \
+  --cfg-scale 4.0 --cfg-norm none --timestep-shift 3.0 --num-steps 50
+
+python evaluation/gen/longtext/eval_images_longtext.py \
+  --image-dir  outputs/longtext/en \
+  --output-dir outputs/longtext/en_eval \
+  --mode en
+```
+
+Evaluation runs OCR + text-match locally, so no judge API is required.
+Prompts live in [`longtext/data/`](../gen/longtext/data/) (`text_prompts.jsonl`
+for `en`, `text_prompts_zh.jsonl` for `zh`).
+
+## CVTG-2K
+
+Complex visual text generation at 2K resolution, evaluated with the
+in-tree [`unified_metrics_eval.py`](../gen/cvtg/unified_metrics_eval.py)
+script (PaddleOCR-based word accuracy + unified visual-text metrics).
+Generation runs as a single Python process with the model sharded across
+visible GPUs via HuggingFace `device_map`.
+
+```bash
+bash evaluation/gen/cvtg/run_cvtgeval.sh
+```
+
+Prepare the CVTG-2K data from
+[`dnkdnk/CVTG-2K`](https://huggingface.co/datasets/dnkdnk/CVTG-2K)
+and place it under [`cvtg/data/`](../gen/cvtg/data/). The LAION
+aesthetic-predictor head
+[`sa_0_4_vit_l_14_linear.pth`](../gen/cvtg/sa_0_4_vit_l_14_linear.pth)
+sits next to the eval script.
+
+Common overrides (set as env vars before the launcher):
+
+| Variable | Default | Description |
+| :------- | :------ | :---------- |
+| `MODEL_PATH` | `sensenova/SenseNova-U1-8B-MoT-SFT` | Local checkpoint path or HF model id |
+| `BENCHMARK_ROOT` | `evaluation/gen/cvtg/data` | CVTG-2K dataset root |
+| `OUTPUT_DIR` | `<repo>/outputs/sensenova/cvtg` | Generated-image + results dir |
+| `PADDLEOCR_SOURCE_DIR` | — | Pre-downloaded PaddleOCR cache (copied to `$HOME/.paddleocr` if missing) |
+| `IMAGE_SIZE` / `CFG_SCALE` / `TIMESTEP_SHIFT` / `NUM_STEPS` | `2048` / `7.0` / `1.0` / `50` | Sampling config |
+| `SAVE_SIZE` | unset (= `IMAGE_SIZE`) | Downsample with LANCZOS to this resolution before writing PNGs. Set to `1024` to use the "generate at 2048, evaluate at 1024" protocol. |
+| `CVTG_SUBSETS` / `CVTG_AREAS` | `CVTG,CVTG-Style` / `2,3,4,5` | Which splits to run |
+| `CUDA_VISIBLE_DEVICES` | `0,1,2,3,4,5,6,7` | GPUs available for model sharding |
+| `DEVICE_MAP` / `MAX_MEMORY_PER_GPU_GB` | `auto` / `70` | HF `device_map` strategy and per-GPU memory cap |
+| `RUN_GENERATION` / `RUN_EVAL` | `1` / `1` | Stage toggles |
+
+Example — generation only:
+
+```bash
+RUN_EVAL=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+  bash evaluation/gen/cvtg/run_cvtgeval.sh
+```
+
+Generated images land under `$OUTPUT_DIR/<subset>/<area>/<key>.png`, and the
+aggregated metrics are written to `$OUTPUT_DIR/CVTG_results.json`. Re-runs
+skip samples whose output PNG already exists, so an interrupted run can be
+resumed by simply re-invoking the launcher.
+
+## TIIF-Bench
+
+Text-image instruction following benchmark, evaluated with a GPT-4o-class
+judge via the in-tree
+[`eval/eval_with_vlm_mp.py`](../gen/tiif/eval/eval_with_vlm_mp.py).
+
+```bash
+API_KEY=sk-... \
+  bash evaluation/gen/tiif/run_tiifeval.sh
+```
+
+Prepare the TIIF-Bench data from
+[`A113N-W3I/TIIF-Bench`](https://github.com/A113N-W3I/TIIF-Bench)
+and place the prompts under
+[`tiif/data/`](../gen/tiif/data/). The three eval helper scripts live under
+[`tiif/eval/`](../gen/tiif/eval/).
+
+Required / common overrides:
+
+| Variable | Default | Description |
+| :------- | :------ | :---------- |
+| `MODEL_PATH` | `sensenova/SenseNova-U1-8B-MoT-SFT` | Local checkpoint path or HF model id |
+| `OUTPUT_DIR` | `<repo>/outputs/sensenova/tiif` | Generated-image + results dir |
+| `TIIFBENCH_SPLIT` | `testmini` | Which split to run (`testmini` / `test`) |
+| `TIIFBENCH_EVAL_MODEL` | `gpt-4o` | Judge model |
+| `API_KEY` (+ optional `TIIFBENCH_AZURE_ENDPOINT` / `TIIFBENCH_API_VERSION`) | — | Judge API credentials |
+| `IMAGE_SIZE` / `CFG_SCALE` / `CFG_NORM` / `TIMESTEP_SHIFT` / `NUM_STEPS` | `1024` / `4.0` / `global` / `3.0` / `50` | Sampling config |
+| `SAVE_SIZE` | unset (= `IMAGE_SIZE`) | Downsample with LANCZOS to this resolution before writing PNGs. Set to `1024` (with `IMAGE_SIZE=2048`) to use the "generate at 2048, evaluate at 1024" protocol. |
+| `GPUS` / `CUDA_VISIBLE_DEVICES` | `8` / `0..7` | GPU layout (generation uses `torchrun`) |
+| `NUM_NODES` / `NODE_RANK` | `1` / `0` | Multi-node sharding (eval runs only on node 0) |
+| `RUN_GENERATION` / `RUN_EVAL` | `1` / `1` | Stage toggles |
+
+Example — single-node generation + eval against an Azure OpenAI endpoint:
+
+```bash
+API_KEY=sk-... \
+TIIFBENCH_AZURE_ENDPOINT=https://your-endpoint.openai.azure.com \
+MODEL_PATH=/path/to/checkpoint \
+  bash evaluation/gen/tiif/run_tiifeval.sh
+```
+
+Per-question judgments are written to `$OUTPUT_DIR/tiifbench-<split>_results/eval_json/`,
+with a dimension-level summary in `result_summary_dimension.txt` next to it.
+
+## Tips
+
+- **Sampling config.** Defaults mirror the values used in the SenseNova-U1
+  tech report. CVTG-2K in particular expects 2048-pixel outputs — lower
+  resolutions will not be comparable.
+- **Judge APIs.** All API-based evaluators accept any OpenAI-compatible
+  endpoint — point them at SenseNova, Gemini (OpenAI-compat), Azure OpenAI,
+  or a local vLLM / sglang server as needed.
+- **Multi-GPU.** `run_tiifeval.sh` uses DDP (`torchrun --nproc_per_node`)
+  with one full model replica per GPU. `run_cvtgeval.sh` instead shards a
+  single model across GPUs via HF `device_map="auto"` — preferred when one
+  GPU cannot hold the whole model. To scale further, run multiple
+  invocations against disjoint `--output_dir`s and merge the results.
+- **Re-evaluation.** `eval_images_bizgeneval.py` / `eval_images_igenbench.py`
+  skip items whose judgments already exist in `--output-dir`. Pass
+  `--force-rerun` to ignore the cache.
--- a/SenseNova-U1/evaluation/docs/interleaved.md
+++ b/SenseNova-U1/evaluation/docs/interleaved.md
+# Interleaved Generation Evaluation
+
+Reproduction guide for SenseNova-U1 on interleaved generation benchmarks.
+
+The benchmark scripts live under `evaluation/interleave/`:
+
+- `BabyVision/` — API-based multimodal understanding with answer extraction and judge scoring
+- `OpenING/` — local-model interleaved generation with GPT-based judging
+- `Unimmmu/` — local-model interleaved generation with external score computation
+- `Realunify/` — local-model interleaved generation for GEU and UEG
+
+## 1. Overview
+
+```
+┌──────────────────────┐
+│ evaluation/interleave│
+└──────────┬───────────┘
+           │
+           ├── BabyVision   ── API inference ── extract / judge ── aggregate score
+           ├── OpenING      ── local inference ─ GPT judge       ── summarize
+           ├── Unimmmu      ── local inference ─ external scorer
+           └── RealUnify    ── local inference ─ rule / judge scoring
+```
+
+1. `BabyVision` sends requests to one or more `/generate` endpoints and writes JSONL predictions.
+2. `OpenING`, `Unimmmu`, and `RealUnify` load the model locally through `transformers`.
+3. Some benchmarks have a separate judge or score-aggregation step after inference.
+4. Most scripts support resume-friendly reruns through existing outputs, explicit `--resume`, or shard merging.
+
+All commands below assume:
+
+```bash
+cd evaluation/interleave
+```
+
+## 2. Benchmark Matrix
+
+| Benchmark | Inference backend | Evaluation backend | Primary outputs |
+| --- | --- | --- | --- |
+| `BabyVision` | HTTP `/generate` API | extraction + LLM judge | `babyvision_<model>.jsonl`, `*_eval.jsonl` |
+| `OpenING` | local model | GPT judge + CSV summary | per-sample JSON, generated images, judge JSON |
+| `Unimmmu` | local model | external scorer | `unimmmu_results.jsonl`, generated images |
+| `RealUnify (GEU)` | local model | rule-based scorer | `realunify_results.jsonl`, score JSON |
+| `RealUnify (UEG)` | local model | user-provided judge wrapper | `ueg_results.json[l]` |
+
+For local-model benchmarks, pass the real dataset path explicitly instead of relying on placeholder defaults such as `<DATA_ROOT>/...`.
+
+## 3. BabyVision
+
+`BabyVision` is the API-backed benchmark in this suite. The typical flow is inference first, then extraction and judge scoring, then score aggregation.
+
+Reference inference command:
+
+```bash
+python3 BabyVision/infer_babyvision.py \
+  --model-name local-model \
+  --data-path /path/to/meta_data.jsonl \
+  --image-root /path/to/babyvision_images \
+  --output-dir ./output/babyvision_understand \
+  --generate-urls http://127.0.0.1:8000/generate \
+  --workers 32 \
+  --max-retries 3 \
+  --backend-max-retries 20 \
+  --request-timeout 600 \
+  --max-new-tokens 32768 \
+  --no-do-sample \
+  --temperature 0 \
+  --top-p 0.95 \
+  --repetition-penalty 1.05 \
+  --min-pixels 262144 \
+  --max-pixels 4194304
+```
+
+| Argument | Meaning |
+| --- | --- |
+| `--data-path` | Path to `meta_data.jsonl`. |
+| `--image-root` | Root directory used to resolve sample image paths. |
+| `--generate-urls` | One or more `/generate` endpoints, comma-separated. |
+| `--workers` | Concurrent request workers. |
+| `--max-retries`, `--backend-max-retries` | Retry budget on the sample side and backend-request side. |
+| `--request-timeout` | Per-request timeout in seconds. |
+| `--min-pixels`, `--max-pixels` | Image preprocessing bounds. |
+
+Inference writes `babyvision_<model_name>.jsonl`. Completed `taskId`s are skipped automatically on rerun.
+
+Reference evaluation command:
+
+```bash
+python3 BabyVision/eval_babyvision.py \
+  --input ./output/babyvision_understand/babyvision_local-model.jsonl \
+  --output ./output/babyvision_understand/babyvision_local-model_eval.jsonl \
+  --endpoint https://your-judge-endpoint \
+  --api-key your_api_key \
+  --model gpt-4.1 \
+  --force \
+  --workers 16 \
+  --retries 3
+```
+
+`eval_babyvision.py` performs answer extraction and judge scoring. `--endpoint` and `--api-key` can also come from environment variables. Use `--force` to recompute existing records, or `--judge-only` to score records that already have `extracted_answer`.
+
+Reference score command:
+
+```bash
+python3 BabyVision/compute_score.py \
+  ./output/babyvision_understand/babyvision_local-model_eval.jsonl
+```
+
+The score script reports overall accuracy plus per-`type` and per-`subtype` results.
+
+## 4. OpenING
+
+`OpenING` runs local-model interleaved generation and then scores the outputs with a GPT judge.
+
+Reference single-GPU inference command:
+
+```bash
+python3 OpenING/infer_opening.py \
+  --mode opening \
+  --model_path /path/to/model \
+  --save_dir ./output/opening_interleave/opening_output \
+  --meta-path /path/to/OpenING-benchmark \
+  --data-file-name test_data.jsonl \
+  --think_mode think \
+  --cfg_scale 4.0 \
+  --img_cfg_scale 1.0 \
+  --timestep_shift 3.0 \
+  --cfg_interval 0 1.0 \
+  --num_steps 50 \
+  --max_new_tokens 4096 \
+  --max_generation_pixels 4194304 \
+  --oom_retry_max_pixels 1048576 \
+  --image_width 1920 \
+  --image_height 1088 \
+  --opening_step_prompt_style can_be \
+  --retry_short_outputs 0 \
+  --seed 42
+```
+
+Reference 8-shard single-node command:
+
+```bash
+mkdir -p logs
+for LOCAL_RANK in 0 1 2 3 4 5 6 7; do
+  echo "Starting shard ${LOCAL_RANK} on GPU ${LOCAL_RANK}"
+  CUDA_VISIBLE_DEVICES=${LOCAL_RANK} python3 OpenING/infer_opening.py \
+    --mode opening \
+    --model_path /path/to/model \
+    --save_dir ./output/opening_interleave/opening_output \
+    --meta-path /path/to/OpenING-benchmark \
+    --data-file-name test_data.jsonl \
+    --think_mode think \
+    --num_shards 8 \
+    --shard_index ${LOCAL_RANK} \
+    --cfg_scale 4.0 \
+    --img_cfg_scale 1.0 \
+    --timestep_shift 3.0 \
+    --cfg_interval 0 1.0 \
+    --num_steps 50 \
+    --max_new_tokens 4096 \
+    --max_generation_pixels 4194304 \
+    --oom_retry_max_pixels 1048576 \
+    --image_width 1920 \
+    --image_height 1088 \
+    --opening_step_prompt_style can_be \
+    --retry_short_outputs 0 \
+    --seed 42 \
+    > logs/opening_shard${LOCAL_RANK}.log 2>&1 &
+done
+wait
+```
+
+| Argument | Meaning |
+| --- | --- |
+| `--model_path` | Local model path. |
+| `--meta-path` | Dataset root. |
+| `--data-file-name` | Dataset JSONL file under the benchmark root. |
+| `--save_dir` | Output directory for per-sample JSON and images. |
+| `--think_mode` | `think`, `no_think`, or both. |
+| `--cfg_interval`, `--max_generation_pixels`, `--oom_retry_max_pixels` | Main generation and OOM-retry controls. |
+| `--num_shards`, `--shard_index` | Manual sharding for multi-process runs. |
+
+Each sample is saved as `<save_dir>/<total_uid>.json`, and generated images use names such as `<save_dir>/<total_uid>-o-0.jpg`.
+
+Reference judge command:
+
+```bash
+export OPENING_JUDGE_BASE_URL=http://127.0.0.1:8000
+export OPENING_JUDGE_API_KEY=your_api_key
+
+python3 OpenING/eval_opening.py \
+  --mode output_dir \
+  --opening_root /path/to/OpenING \
+  --output_dir ./output/opening_interleave/opening_output \
+  --output_file /path/to/OpenING/gpt-score_results_opening_output.json \
+  --workers 4 \
+  --save_every 10
+```
+
+`eval_opening.py` supports both a single model-output directory and a parent directory containing multiple outputs. Existing judge results are reused by default; `--retry_invalid_scores` retries only malformed score records.
+
+Reference summary command:
+
+```bash
+python3 OpenING/summarize_GPT_scores.py \
+  --input_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_opening_output.json \
+  --output_csv /path/to/OpenING/Interleaved_Arena/model_score_summaries.csv \
+  --filtered_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_filtered.json
+```
+
+This step converts raw judge results into a comparison-friendly CSV and can optionally emit a filtered JSON with invalid scores removed.
+
+## 5. Unimmmu
+
+`Unimmmu` supports both understanding-only and interleaved generation paths, but the interleaved mode is the one covered here.
+
+Reference inference command:
+
+```bash
+python3 Unimmmu/inference_unimmmu.py \
+  --model_path /path/to/model \
+  --data_path /path/to/unimmmu_direct.jsonl \
+  --output_dir ./output/unimmmu_interleave \
+  --inference_mode interleave \
+  --cfg_scale 4.0 \
+  --img_cfg_scale 1.0 \
+  --num_steps 50
+```
+
+Reference multi-GPU command:
+
+```bash
+torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \
+  --model_path /path/to/model \
+  --data_path /path/to/unimmmu_direct.jsonl \
+  --output_dir ./output/unimmmu_interleave \
+  --inference_mode interleave \
+  --cfg_scale 4.0 \
+  --img_cfg_scale 1.0 \
+  --num_steps 50
+```
+
+Reference `device_map` command:
+
+```bash
+python3 Unimmmu/inference_unimmmu.py \
+  --model_path /path/to/model \
+  --data_path /path/to/unimmmu_direct.jsonl \
+  --output_dir ./output/unimmmu_interleave \
+  --inference_mode interleave \
+  --device_map auto \
+  --max_memory_per_gpu_gb 60 \
+  --cfg_scale 4.0 \
+  --num_steps 50
+```
+
+Reference shard merge command:
+
+```bash
+python3 Unimmmu/merge_shards.py \
+  --data_path /path/to/unimmmu_direct.jsonl \
+  --shard_dir ./output/unimmmu_interleave/shards \
+  --output_file ./output/unimmmu_interleave/unimmmu_results.jsonl
+```
+
+| Argument | Meaning |
+| --- | --- |
+| `--inference_mode` | Use `interleave` for the benchmark covered here. |
+| `--data_path` | Benchmark JSONL path. |
+| `--output_dir` | Root directory for JSONL results and generated images. |
+| `--resume` | Skip completed `hash_uid`s. |
+| `--num_shards`, `--shard_rank` | Manual data sharding. |
+| `--device_map auto` | Single-process multi-GPU loading via Hugging Face. |
+
+The main output file is `unimmmu_results.jsonl`. Interleaved images are written under `<output_dir>/images/<task>/`. In the current implementation, resume is applied before shard selection, so rerunning one shard is most reliable after deleting that shard's outputs and rerunning without `--resume`.
+
+Reference score command:
+
+```bash
+python3 Unimmmu/calculate_score.py \
+  --input_file ./output/unimmmu_interleave/unimmmu_results.jsonl \
+  --output_dir ./output/unimmmu_interleave/scores \
+  --benchmark_path /path/to/image_text_agent
+```
+
+`calculate_score.py` delegates the actual scoring logic to the external benchmark repository pointed to by `--benchmark_path`.
+
+## 6. RealUnify (GEU)
+
+The GEU script supports both `step` and `interleave` modes. The interleaved path is the main one for SenseNova-U1 benchmarking.
+
+Reference inference command:
+
+```bash
+python3 Realunify/inference_realunify.py \
+  --model_path /path/to/model \
+  --data_path /path/to/GEU_step_processed.jsonl \
+  --output_dir ./output/realunify_interleave \
+  --inference_mode interleave \
+  --cfg_scale 4.0 \
+  --img_cfg_scale 1.0 \
+  --num_steps 50
+```
+
+Reference multi-GPU command:
+
+```bash
+torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \
+  --model_path /path/to/model \
+  --data_path /path/to/GEU_step_processed.jsonl \
+  --output_dir ./output/realunify_interleave \
+  --inference_mode interleave \
+  --cfg_scale 4.0 \
+  --img_cfg_scale 1.0 \
+  --num_steps 50
+```
+
+Reference `device_map` command:
+
+```bash
+python3 Realunify/inference_realunify.py \
+  --model_path /path/to/model \
+  --data_path /path/to/GEU_step_processed.jsonl \
+  --output_dir ./output/realunify_interleave \
+  --inference_mode interleave \
+  --device_map auto \
+  --max_memory_per_gpu_gb 60 \
+  --cfg_scale 4.0 \
+  --num_steps 50
+```
+
+Reference shard merge command:
+
+```bash
+python3 Realunify/merge_shards.py \
+  --data_path /path/to/GEU_step_processed.jsonl \
+  --shard_dir ./output/realunify_interleave/shards \
+  --output_file ./output/realunify_interleave/realunify_results.jsonl
+```
+
+The main result file is `realunify_results.jsonl`. In `step` mode, `generated_image` stores `[input_image, edited_image]`; in `interleave` mode, the generated sequence is stored under `generated_images`. As with `Unimmmu`, resume currently happens before manual shard selection.
+
+If you want a fixed output image size, pass `--target_image_size 1024`.
+
+Reference score command:
+
+```bash
+python3 Realunify/calculate_score.py \
+  --input_file ./output/realunify_interleave/realunify_results.jsonl \
+  --output_file ./output/realunify_interleave/realunify_scores.json
+```
+
+The scorer first tries to extract the answer from `<answer>...</answer>` and otherwise falls back to the first `A/B/C/D` letter found in `model_response`.
+
+## 7. RealUnify (UEG)
+
+The UEG script exposes `understand_t2i`, `interleave`, and `t2i` inference modes.
+
+Reference `understand_t2i` command:
+
+```bash
+python3 Realunify/inference_realunify_ueg.py \
+  --model_path /path/to/model \
+  --data_path /path/to/UEG_step.json \
+  --output_dir ./output/ueg_understand_t2i \
+  --inference_mode understand_t2i \
+  --cfg_scale 4.0 \
+  --num_steps 50
+```
+
+Reference `interleave` command:
+
+```bash
+python3 Realunify/inference_realunify_ueg.py \
+  --model_path /path/to/model \
+  --data_path /path/to/UEG_step.json \
+  --output_dir ./output/ueg_interleave \
+  --inference_mode interleave \
+  --cfg_scale 4.0 \
+  --timestep_shift 3.0 \
+  --num_steps 50
+```
+
+Reference `t2i` command:
+
+```bash
+python3 Realunify/inference_realunify_ueg.py \
+  --model_path /path/to/model \
+  --data_path /path/to/UEG_step.json \
+  --output_dir ./output/ueg_t2i \
+  --inference_mode t2i \
+  --cfg_scale 4.0 \
+  --num_steps 50
+```
+
+Unlike the GEU script, this one does not expose manual `--num_shards` or `--shard_rank` flags. Multi-process splitting relies on the distributed rank provided by `torchrun`. The output preserves generated image paths together with the follow-up `question_list`.
+
+Reference score command:
+
+```bash
+python3 Realunify/calculate_score_ueg.py \
+  --input_file ./output/ueg_interleave/ueg_results.json
+```
+
+`calculate_score_ueg.py` is only a scaffold in the current repository. It expects a user-provided `GeminiAPI` judge wrapper and otherwise raises `NotImplementedError`.
+
+## 8. Running the Evaluation
+
+A typical local-model evaluation flow looks like this:
+
+```bash
+MODEL_PATH=/path/to/hf_model
+
+torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \
+  --model_path ${MODEL_PATH} \
+  --data_path /path/to/GEU_step_processed.jsonl \
+  --output_dir ./output/realunify_interleave \
+  --inference_mode interleave \
+  --cfg_scale 4.0 \
+  --img_cfg_scale 1.0 \
+  --num_steps 50
+
+python3 Realunify/inference_realunify_ueg.py \
+  --model_path ${MODEL_PATH} \
+  --data_path /path/to/UEG_step.json \
+  --output_dir ./output/ueg_understand_t2i \
+  --inference_mode understand_t2i \
+  --cfg_scale 4.0 \
+  --num_steps 50
+
+torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \
+  --model_path ${MODEL_PATH} \
+  --data_path /path/to/unimmmu_direct.jsonl \
+  --output_dir ./output/unimmmu_interleave \
+  --inference_mode interleave \
+  --cfg_scale 4.0 \
+  --img_cfg_scale 1.0 \
+  --num_steps 50
+```
+
+`RealUnify (GEU)`, `RealUnify (UEG)`, and `Unimmmu` are independent and can run in parallel. `BabyVision` and `OpenING` each have their own inference-plus-evaluation pipeline as described above.
+
+## 9. Troubleshooting
+
+- Dataset file not found: check `--data_path`, or for `OpenING`, verify `--meta-path` and `--data-file-name`.
+- A path still contains `<DATA_ROOT>`: the script is using a placeholder default; pass the real path explicitly.
+- Samples are unexpectedly skipped: outputs already exist; review `--resume`, `--overwrite`, or shard outputs.
+- Rerunning a single shard gives odd behavior: for `Unimmmu` and `RealUnify (GEU)`, delete that shard's outputs first and rerun without `--resume`.
+- UEG scoring fails immediately: `calculate_score_ueg.py` needs a user-supplied `GeminiAPI` wrapper.
--- a/SenseNova-U1/evaluation/docs/understanding.md
+++ b/SenseNova-U1/evaluation/docs/understanding.md
+# Visual Understanding Evaluation
+
+Reproduction guide for SenseNova-U1 on visual understanding benchmarks.
+
+The pipeline is built on top of [EvalScope](https://github.com/OpenSenseNova/evalscope/tree/neo) (Native backend). EvalScope calls the model through an OpenAI-compatible HTTP endpoint and, for open-ended benchmarks, scores the predictions with an LLM judge.
+
+Reference config and launcher live under `evaluation/understanding/`:
+
+- `evaluation/understanding/config.yaml` — evaluation configuration
+- `evaluation/understanding/es.py` — single-entry launcher
+
+## 1. Overview
+
+```
+┌──────────────┐     OpenAI-compatible     ┌─────────────┐
+│  es.py       │ ───── HTTP requests ────▶ │ Model API   │
+│  (EvalScope) │                           │ (lightllm)  │
+└──────┬───────┘ ◀──── generations ─────── └─────────────┘
+       │
+       ▼
+   results/                 (predictions, judge scores, aggregated metrics)
+```
+
+1. Deploy SenseNova-U1 behind an OpenAI-compatible endpoint (the reference setup uses lightllm).
+2. Fill in `config.yaml` with the endpoint, model name, datasets, and generation parameters.
+3. Run `python es.py` — it calls `evalscope.run.run_task(task_cfg="config.yaml")`, which loops over the datasets, issues requests in parallel, and writes predictions plus scores under `results/`.
+
+## 2. Launcher
+
+`evaluation/understanding/es.py` is deliberately tiny:
+
+```python
+from evalscope.run import run_task
+
+run_task(task_cfg="config.yaml")
+```
+
+Everything else is driven by `config.yaml`.
+
+## 3. Benchmarks
+
+The reference run evaluates on:
+
+- `mmmu_pro`
+- `mmlu_pro`
+- `mm_bench`
+- `ai2d`
+- `math_vista`
+- `ifeval`
+
+Add or remove items under `datasets:` to extend the evaluation.
+
+## 4. Main Generation Parameters
+
+These are the parameters the model is sampled with. They live under `generation_config:` in `config.yaml` and are forwarded to the OpenAI-compatible API.
+
+| Parameter | Value | Meaning |
+| --- | --- | --- |
+| `stream` | `false` | Return the full response in one shot; simpler to score and log. |
+| `temperature` | `0.6` | Sampling temperature — recommended setting for thinking-enabled models. |
+| `top_p` | `0.95` | Nucleus sampling cutoff; used together with `top_k`. |
+| `max_tokens` | `32768` | Upper bound on generated tokens per sample. Large because `<think>…</think>` traces can be long. |
+| `timeout` | `300` | Per-request timeout in seconds. |
+| `extra_body.top_k` | `20` | Restrict sampling to the top-20 tokens at each step. |
+| `extra_body.repetition_penalty` | `1.05` | Mild penalty to suppress loops in long reasoning traces. |
+| `extra_body.chat_template_kwargs.enable_thinking` | `true` | Let the chat template emit a `<think>…</think>` section before the final answer. |
+
+Post-processing on the prediction:
+
+- `dataset_args.remove_until: </think>` — everything up to and including the closing `</think>` tag is stripped before grading, so only the final answer is scored.
+- `ignore_errors: true` — transient single-sample API failures do not abort the whole run.
+
+## 5. Judge Model
+
+Open-ended benchmarks are scored by an LLM judge.
+
+| Field | Value |
+| --- | --- |
+| `judge_worker_num` | `64` (parallel judge calls) |
+| `judge_model_args.model_id` | `gpt-4o-mini-2024-07-18` |
+| `judge_model_args.api_key` | *(fill in)* |
+| `judge_model_args.api_url` | *(fill in — OpenAI-compatible judge endpoint)* |
+| `judge_model_args.generation_config.max_tokens` | `4096` |
+| `judge_model_args.generation_config.timeout` | `300` |
+
+The reference `config.yaml` leaves the judge `api_key` / `api_url` blank; fill them before running judge-dependent tasks.
+
+## 6. Runtime Settings
+
+| Field | Value | Meaning |
+| --- | --- | --- |
+| `eval_backend` | `Native` | EvalScope native backend. |
+| `eval_type` | `openai_api` | Drive the model through an OpenAI-compatible endpoint. |
+| `eval_batch_size` | `64` | In-flight concurrent requests sent to the model server. |
+| `api_url` | `http://<host>:8000/v1/` | OpenAI-compatible serving endpoint (lightllm in the reference setup). |
+| `model` | `SenseNova-U1` | Model name as exposed by the serving endpoint. |
+| `use_cache` | `results/` | Reuse previously generated answers — supports resume / retry. |
+| `work_dir` | `results/` | Output root for predictions, judgments, and scores. |
+| `no_timestamp` | `true` | Write into a stable directory (plays well with `use_cache`). |
+
+## 7. Reference `config.yaml`
+
+```yaml
+eval_backend: Native
+eval_type: openai_api
+eval_batch_size: 64
+api_url: http://<host>:8000/v1/   # lightllm deployment
+model: SenseNova-U1
+datasets:
+  - mmmu_pro
+  - mmlu_pro
+  - mm_bench
+  - ai2d
+  - math_vista
+  - ifeval
+dataset_args:
+  remove_until: </think>
+ignore_errors: true
+generation_config:
+  stream: false
+  temperature: 0.6
+  timeout: 300
+  max_tokens: 32768
+  top_p: 0.95
+  extra_body:
+    top_k: 20
+    repetition_penalty: 1.05
+    chat_template_kwargs:
+      enable_thinking: true
+
+judge_worker_num: 64
+judge_model_args:
+  api_key: ""
+  api_url: ""
+  model_id: gpt-4o-mini-2024-07-18
+  generation_config:
+    max_tokens: 4096
+    timeout: 300
+use_cache: results/
+work_dir: results/
+no_timestamp: true
+```
+
+## 8. Running the Evaluation
+
+1. Deploy SenseNova-U1 on an OpenAI-compatible endpoint and confirm connectivity:
+
+   ```bash
+   curl -sSf -m 5 "$api_url"
+   ```
+
+2. Edit `evaluation/understanding/config.yaml`: set `api_url`, `model`, and the judge `api_key` / `api_url` if needed.
+3. Launch:
+
+   ```bash
+   cd evaluation/understanding
+   python es.py
+   ```
+
+Predictions, judge outputs, and final scores are written under `results/`. Because `use_cache: results/` and `no_timestamp: true` are set, rerunning the command skips already-answered samples, so interrupting and resuming is safe.
--- a/SenseNova-U1/evaluation/easi/GUIDE.md
+++ b/SenseNova-U1/evaluation/easi/GUIDE.md
+# EASI Benchmarking — Full Guide
+
+End-to-end setup to run visual-understanding benchmarks (VLMEvalKit, EASI, lmms-eval) against SenseNova-U1 by exposing the model as an **OpenAI-compatible HTTP endpoint** via the LightLLM inference server, then pointing the benchmark toolkit at `/v1/chat/completions`.
+
+This is the comprehensive reference. For the quickstart + file layout, see [`README.md`](README.md).
+
+This guide covers a **native (no Docker) install** on a Linux host with NVIDIA GPUs. The upstream-supported path is the Docker image `lightx2v/lightllm_lightx2v:20260407` documented in [`docs/deployment_CN.md`](../../docs/deployment_CN.md) — use that when Docker/nvidia-container-toolkit is available. This native recipe is the fallback for sandboxed environments (containers, chroot, clusters without privileged pod access).
+
+## Why LightLLM (not `transformers`)
+
+`examples/*/inference.py` scripts use the `transformers` backend, which is fine for one-off inference but not servable:
+
+- `NEOChatModel` is registered only with `AutoModel` / `AutoConfig`, not with `AutoModelForImageTextToText` or an `AutoProcessor`. `transformers serve` (4.57+) and `text-generation-inference` both dispatch on those mappings — they will not discover SenseNova-U1.
+- Inference uses a custom `model.chat(tokenizer, pixel_values, question, gen_cfg, grid_hw=...)` signature (`src/sensenova_u1/models/neo_unify/modeling_neo_chat.py:1732`), not the standard `processor(images, text) → model.generate()` flow that serving stacks expect.
+- vLLM / SGLang have no built-in NEO-Unify model class; porting is a weeks-long task.
+
+LightLLM has native `neo_chat` + `neo_chat_moe` model implementations (`lightllm/models/neo_chat/`, `lightllm/models/neo_chat_moe/`) and exposes an OpenAI-compatible `/v1/chat/completions`, which is exactly what benchmark toolkits consume.
+
+**Skipped in this guide: LightX2V.** LightX2V is only imported when `--enable_multimodal_x2i` is passed (see `lightllm/server/x2i_server/manager.py`). Visual understanding benchmarks do not generate images, so we omit that flag and avoid LightX2V's `torch<=2.8.0` pin (which conflicts with LightLLM's `torch==2.9.1` requirement). Image-generation benchmarks — when those scripts land — will need the full LightLLM + LightX2V stack via the Docker image.
+
+---
+
+## 1) Host prerequisites
+
+| Item | Required |
+| :--- | :--- |
+| OS | Linux (x86_64) |
+| NVIDIA driver | ≥ 550.x (recommended 550.90.07+) |
+| GPU | Hopper / Ampere class with compute capability ≥ 80. Verified on H100 80GB |
+| Python | 3.10 (matches LightLLM's Dockerfile) |
+| `uv` | `uv >= 0.9`. Install: <https://docs.astral.sh/uv/getting-started/installation/> |
+| System lib | `libnuma1`, `libnuma-dev` (required by `sgl-kernel`) |
+
+Install the system lib once (needs sudo):
+
+```bash
+sudo apt-get install -y libnuma1 libnuma-dev
+```
+
+The CUDA runtime is shipped inside the `torch==2.9.1+cu128` wheel — you do NOT need a matching CUDA toolkit on the host as long as the host driver supports CUDA 12.8+ (forward-compatible from 550.x).
+
+---
+
+## 2) One-shot install
+
+`evaluation/easi/scripts/setup.sh` installs the full pipeline in one idempotent run:
+
+| Phase | What |
+| :--- | :--- |
+| 1 | Host prereq checks (`uv`, `libnuma`, driver) |
+| 2 | Recursive submodule init (LightLLM, EASI, EASI/VLMEvalKit, EASI/lmms-eval) |
+| 3 | `.venv-lightllm/` Python 3.10 venv at repo root |
+| 4-6 | Pinned LightLLM deps + vllm + editable LightLLM + transitive fixes (pandas) |
+| 7 | Apply local patches from `evaluation/easi/lightllm-stack/patches/` |
+| 8 | Verify LightLLM imports + api_server CLI |
+| 9 | **EASI client venv** — delegates to `EASI/scripts/setup.sh` (Py 3.11, VLMEvalKit + lmms-eval + flash-attn) |
+| 10 | **Endpoint registration** — sitecustomize.py injector into the EASI venv |
+
+```bash
+bash evaluation/easi/scripts/setup.sh
+```
+
+Flags:
+- `--skip-lightllm` — skip phases 1, 3-7 (DON'T install LightLLM). Use when you have a SenseNova-U1 OpenAI-compatible endpoint already (docker elsewhere, remote infra, etc.). Edit `config/sensenova_models.py` to point `api_base` at your endpoint
+- `--skip-easi` — skip phase 9 (flash-attn build is slow; useful for fast reruns when only touching the LightLLM side)
+- `--skip-register` — skip phase 10 (no auto endpoint wiring)
+
+Re-running is safe — each step checks whether it already ran. If you just want to understand what happens under the hood, sections 2a–2f below spell out phases 2-8.
+
+### 2a. LightLLM submodule
+
+LightLLM is pinned as a git submodule at `evaluation/easi/lightllm-stack/LightLLM`, tracking branch `neo_plus_clean` (the branch that contains NEO-Unify model support). `evaluation/easi/lightllm-stack/patches/` holds any local fixes applied on top.
+
+```bash
+git submodule update --init evaluation/easi/lightllm-stack/LightLLM
+```
+
+To bump to a newer LightLLM commit later:
+
+```bash
+cd evaluation/easi/lightllm-stack/LightLLM
+git fetch origin neo_plus_clean && git checkout origin/neo_plus_clean
+cd -
+git add evaluation/easi/lightllm-stack/LightLLM   # records the new submodule SHA
+```
+
+> `LightX2V` can be cloned alongside for image-generation workloads later. Unused for VQA-only serving; not included as a submodule.
+
+### 2b. Create the serving venv
+
+Keep this venv separate from the main `.venv` used by `examples/*/inference.py` — LightLLM pins `torch==2.9.1` while the main env uses torch 2.8.
+
+```bash
+cd /path/to/SenseNova-U1
+uv venv -p 3.10 .venv-lightllm
+source .venv-lightllm/bin/activate
+uv pip install --upgrade pip
+```
+
+`.venv-lightllm/` is already in `.gitignore`.
+
+### 2c. Strip non-installable pins from `requirements.txt`
+
+The upstream `LightLLM/requirements.txt` includes two packages that fail in a clean environment:
+
+- **`nixl==0.8.0`** — not published to PyPI; built from source inside the Dockerfile against a custom UCX build. Only used by `--run_mode nixl_prefill/nixl_decode` (PD-disaggregation over RDMA) which we are not running.
+- **`cchardet==2.1.7`** — archived upstream, build fails on modern `setuptools`. Optional character-detection helper; `chardet` is pulled in transitively as a fallback.
+
+```bash
+grep -v "^nixl\|^cchardet" evaluation/easi/lightllm-stack/LightLLM/requirements.txt > /tmp/lightllm-req.txt
+```
+
+### 2d. Install
+
+```bash
+# pinned deps (torch 2.9.1+cu128, flashinfer, sgl-kernel, xformers, triton, ...)
+uv pip install -r /tmp/lightllm-req.txt
+
+# vllm is a hard runtime dep (used for shared utilities)
+uv pip install --no-cache-dir vllm==0.16.0
+
+# LightLLM itself, editable
+uv pip install --no-cache-dir -e evaluation/easi/lightllm-stack/LightLLM
+
+# transitive dep missing from upstream requirements
+uv pip install pandas
+```
+
+### 2e. Apply local patches
+
+See [§8 Known patches](#8-known-patches). `setup.sh` applies these automatically; if you installed by hand:
+
+```bash
+cd evaluation/easi/lightllm-stack/LightLLM
+for p in ../patches/*.patch; do git apply "$p"; done
+```
+
+### 2f. Verify
+
+```bash
+python -c "
+import torch; print('torch', torch.__version__, 'cuda', torch.cuda.is_available())
+import flashinfer, sgl_kernel, xformers, vllm, lightllm
+print('flashinfer', 'sgl-kernel', 'xformers ok', 'vllm', vllm.__version__)
+print('lightllm ok')
+"
+
+python -m lightllm.server.api_server --help | head -5
+```
+
+Expected:
+
+```
+torch 2.9.1+cu128 cuda True
+flashinfer sgl-kernel xformers ok vllm 0.16.0
+lightllm ok
+usage: api_server.py [-h] ...
+```
+
+### Features we skip (and when you'd re-enable them)
+
+| Feature | Re-enable when | How |
+| :--- | :--- | :--- |
+| **FlashMLA** | Running DeepSeek MLA-style attention (SenseNova-U1 uses Qwen3 GQA, not MLA) | Follow `evaluation/easi/lightllm-stack/LightLLM/docker/Dockerfile:49-53` |
+| **DeepEP + NVSHMEM** | Multi-node MoE with InfiniBand GPUs | `Dockerfile:78-100`; needs root for gdrcopy |
+| **NIXL + custom UCX** | PD-disaggregated serving (`--run_mode nixl_*`) | `Dockerfile:102-138`; needs root, RDMA stack |
+| **LightMem** | Disk KV-cache offload | `Dockerfile:60-64` |
+| **LightX2V (image gen)** | Running image-generation benchmarks | Install `evaluation/easi/lightllm-stack/LightX2V` in a separate venv with `torch<=2.8.0`, or use the upstream Docker image |
+
+---
+
+## 3) Launch the server
+
+Helper script: `evaluation/easi/scripts/serve.sh`. Auto-activates `.venv-lightllm`, auto-downloads model weights on first run if missing, and picks a per-model default port so both models can run concurrently without clashing.
+
+### Model → port mapping
+
+| `MODEL` value | HF repo | Port | `--reasoning_parser` default | Advertised `model` name |
+| :--- | :--- | :---: | :--- | :--- |
+| `8b-mot` *(default)* | `sensenova/SenseNova-U1-8B-MoT` | 8000 | `qwen3` (strips `<think>…</think>`) | `sensenova-u1-8b-mot` |
+
+### Defaults
+```bash
+# 8b-mot, GPUs 0-1, tp=2, port 8000
+bash evaluation/easi/scripts/serve.sh
+```
+
+### Max throughput on 8× H100 (single model)
+```bash
+MODEL=8b-mot GPUS=0,1,2,3,4,5,6,7 TP=8 bash evaluation/easi/scripts/serve.sh
+```
+
+### Env vars (full list)
+
+| Var | Default | Notes |
+| :--- | :--- | :--- |
+| `MODEL` | `8b-mot` | `8b-mot`. Ignored if `MODEL_DIR` is set |
+| `MODEL_DIR` | `./models/SenseNova-U1-Mini-<Beta\|SFT>` | Absolute path overrides |
+| `GPUS` | `0,1` | Comma-separated `CUDA_VISIBLE_DEVICES` |
+| `TP` | `2` | Tensor-parallel degree; must equal `GPUS` count |
+| `HOST` | `0.0.0.0` | |
+| `PORT` | per-model (8000 / 8001) | Overrides the default port from the table above |
+| `MAX_LEN` | `32768` | `--max_req_total_len` |
+| `MEM_FRAC` | `0.85` | `--mem_fraction` — fraction of GPU mem for KV cache |
+| `MODEL_NAME` | per-model | Advertised via `/v1/models`; benchmark client `model` field must match |
+| `REASONING` | per-model | `--reasoning_parser`. `qwen3` for beta, disabled for sft. Set to empty string to disable on beta |
+| `NO_AUTO_DL` | unset | Set to `1` to skip auto-download when model dir is missing (error out instead) |
+
+### First-launch warmup
+
+Triton / CUDA kernels compile on first request; the first `/v1/chat/completions` call can take **several minutes** to return. Subsequent calls are cached and fast. Health-check with:
+
+```bash
+# after startup log shows "Uvicorn running on http://0.0.0.0:8000"
+curl -s http://localhost:8000/v1/models | head
+curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
+  -d '{"model":"sensenova-u1-8b-mot","messages":[{"role":"user","content":"hi"}]}' | head -c 500
+```
+
+For a proper multimodal smoke test (image + text), use [`examples/serving/client.py`](../../examples/serving/client.py):
+
+```bash
+source .venv                           # the main sensenova_u1 venv has the `requests` client deps
+python examples/serving/client.py \
+  --mode vqa \
+  --prompt "Describe this image." \
+  --image_path examples/vqa/data/images/menu.jpg \
+  --url http://localhost:8000/v1
+```
+
+---
+
+### Standalone weight download (if you want to pre-fetch)
+
+`evaluation/easi/scripts/serve.sh` calls this for you automatically. To run it independently:
+
+```bash
+source .venv-lightllm/bin/activate     # hf CLI lives in this venv
+bash evaluation/easi/scripts/download_weights.sh 8b-mot
+```
+
+Weights land at `./models/SenseNova-U1-<Mini|Flash>-Beta/`. `./models/` is gitignored. Set `export HF_TOKEN=hf_...` if the HF repo is gated.
+
+---
+
+## 4) Point VLMEvalKit / EASI at the endpoint
+
+The server speaks OpenAI `/v1/chat/completions`. Any OpenAI-compatible client works unchanged. For VLMEvalKit the `GPT4V` wrapper (`vlmeval/api/gpt.py`) is the right client class.
+
+EASI is checked out as a submodule at `evaluation/easi/EASI` (tracking `EvolvingLMMs-Lab/EASI@main`). Its VLMEvalKit fork is at `evaluation/easi/EASI/VLMEvalKit` — nested submodule.
+
+VLMEvalKit's `run.py --config <json>` flag is mutually exclusive with `--data/--model`, which is what EASI's `run_easi_eval.py` uses. So the model entry must exist in `vlmeval.config.supported_VLM` at import time.
+
+### Tracked-file wire-up (what `setup.sh` does)
+
+Two files you care about:
+
+| Path | Role |
+| :--- | :--- |
+| `evaluation/easi/config/sensenova_models.py` | **Editable source of truth** for endpoint URLs, ports, `max_tokens`, temperature, retry, etc. Tweak here, commit, done |
+| `evaluation/easi/patches/easi_sensenova_config.patch` | 7-line patch that appends an import hook to `VLMEvalKit/vlmeval/config.py` |
+
+`setup.sh` Phase 10:
+1. Copies `config/sensenova_models.py` → `EASI/VLMEvalKit/vlmeval/sensenova_models.py`
+2. Applies `patches/easi_sensenova_config.patch` to `EASI/VLMEvalKit/vlmeval/config.py` — adds:
+   ```python
+   try:
+       from .sensenova_models import entries as _sensenova_u1_entries
+       supported_VLM.update(_sensenova_u1_entries)
+   except Exception as _e:
+       import sys; print(f"[sensenova-u1-config] failed: {_e}", file=sys.stderr)
+   ```
+
+Both steps idempotent — patch apply is reverse-checked before applying, copy is always overwrite (cheap). Tweak the editable module, re-run setup.sh, done.
+
+### Tweak workflow
+
+```bash
+# edit endpoint/port/max_tokens/temperature:
+$EDITOR evaluation/easi/config/sensenova_models.py
+
+# propagate to VLMEvalKit:
+bash evaluation/easi/scripts/setup.sh --skip-easi   # (--skip-easi skips the slow EASI venv check)
+```
+
+Verify:
+
+```bash
+source evaluation/easi/EASI/.venv/bin/activate
+python -c 'from vlmeval.config import supported_VLM; print([k for k in supported_VLM if "SenseNova-U1-" in k])'
+# ['SenseNova-U1-8B-MoT-Local']
+```
+
+### Why not edit `config.py` directly?
+
+The VLMEvalKit submodule is pinned to a specific upstream SHA. Any direct edit becomes dirty submodule state that doesn't survive `git submodule update`. The tracked-file + patch pattern is git-friendly: edits live in the parent SenseNova-U1 repo and re-apply cleanly after submodule bumps.
+
+### Configuring a custom OpenAI-compatible endpoint
+
+`config/sensenova_models.py` is a plain Python module containing a `entries: dict[str, partial[GPT4V]]` top-level variable. Each entry maps a **model key** (what you pass to `run_easi_eval.py --model …`) to a partially-applied `GPT4V` client bound to an endpoint.
+
+Template:
+
+```python
+from functools import partial
+from vlmeval.api.gpt import GPT4V  # type: ignore[import-not-found]
+
+entries = {
+    "<YourModelName>": partial(
+        GPT4V,
+        model="<advertised-model-name>",                    # must match the server's `model` field
+        api_base="http://<host>:<port>/v1/chat/completions",
+        key="<api-key-or-dummy>",
+        temperature=0,                                      # 0 = greedy (deterministic)
+        max_tokens=8192,                                    # higher for thinking models
+        retry=10,                                           # per-request retries on 5xx / timeout
+        verbose=False,
+    ),
+}
+```
+
+### `GPT4V` kwargs reference
+
+| Kwarg | Type | Notes |
+| :--- | :--- | :--- |
+| `model` | str | Value echoed to the server in the request `model` field. Must match `--model_name` on the server side (LightLLM: `MODEL_NAME` env var; vLLM: `--served-model-name`, etc.) |
+| `api_base` | str | Full path to the chat-completions endpoint, including `/v1/chat/completions`. Works for any OpenAI-compatible server (LightLLM, vLLM, SGLang, TGI, OpenRouter, Anthropic-via-openai-shim, etc.) |
+| `key` | str | Bearer token. Use `"dummy"` for auth-less local servers |
+| `temperature` | float | `0` for deterministic benchmarking; set > 0 if a benchmark needs sampling |
+| `max_tokens` | int | Generation cap. Thinking models (e.g. SenseNova-U1-8B-MoT) need ≥ 8192 so they don't truncate mid-`<think>` |
+| `top_p` | float | Nucleus sampling cutoff; default 1.0 (no trim) |
+| `retry` | int | HTTP-level retries on 5xx / timeout. 10 is generous |
+| `wait` | float | Seconds between retries; defaults to exponential backoff |
+| `verbose` | bool | Per-request logging; leave False for benchmark runs |
+| `img_detail` | `"low"` / `"high"` / `"auto"` | Passed through to `image_url.detail` — only matters for servers that honor it (GPT-4o). LightLLM ignores |
+| `timeout` | int | Per-request timeout in seconds. Defaults to 60 — bump to 180+ for thinking models on slow hardware |
+| `system_prompt` | str | Prepended as the system message. Leave unset unless a benchmark demands a specific persona |
+
+Full kwarg list: `evaluation/easi/EASI/VLMEvalKit/vlmeval/api/gpt.py`.
+
+### Examples
+
+**Local LightLLM (default)** — what `setup.sh` ships out of the box:
+```python
+"SenseNova-U1-8B-MoT-Local": partial(
+    GPT4V,
+    model="sensenova-u1-8b-mot",
+    api_base="http://localhost:8000/v1/chat/completions",
+    key="dummy", temperature=0, max_tokens=32768, retry=10, verbose=False,
+),
+```
+
+**Remote endpoint (infra team or production)**:
+```python
+"SenseNova-U1-8B-MoT-Prod": partial(
+    GPT4V,
+    model="sensenova-u1-8b-mot",
+    api_base="https://sensenova-u1.internal.example.com/v1/chat/completions",
+    key="sk-your-real-token",
+    temperature=0, max_tokens=32768, retry=5, verbose=False,
+),
+```
+
+**Endpoint that needs `enable_thinking` toggled off** (subclass pattern):
+```python
+from vlmeval.api.gpt import GPT4V
+
+class _SenseNovaNoThinking(GPT4V):
+    def generate_inner(self, inputs, **kwargs):
+        kwargs["chat_template_kwargs"] = {"enable_thinking": False}
+        return super().generate_inner(inputs, **kwargs)
+
+entries = {
+    "SenseNova-U1-8B-MoT-Local-NoThink": partial(
+        _SenseNovaNoThinking,
+        model="sensenova-u1-8b-mot",
+        api_base="http://localhost:8000/v1/chat/completions",
+        key="dummy", temperature=0, max_tokens=2048, retry=10, verbose=False,
+    ),
+}
+```
+
+After any edit: `bash evaluation/easi/scripts/setup.sh --skip-lightllm --skip-easi` to propagate into `VLMEvalKit/vlmeval/sensenova_models.py`, then your new model key is available via `run_easi_eval.py --model <YourModelName>`.
+
+### Running benchmarks
+
+```bash
+source evaluation/easi/EASI/.venv/bin/activate
+cd evaluation/easi/EASI
+```
+
+**Single benchmark**:
+```bash
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --output-dir eval_results_sensenova-u1-8b-mot_viewspatial \
+  --api-nproc 16 \
+  --benchmarks viewspatial
+```
+
+**Full EASI-8 suite** (omit `--benchmarks`):
+```bash
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --output-dir eval_results_sensenova-u1-8b-mot \
+  --api-nproc 16
+```
+
+**Multiple benchmarks**:
+```bash
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --benchmarks viewspatial,blink,3dsrbench \
+  --api-nproc 16 \
+  --output-dir eval_results_sensenova-u1-8b-mot
+```
+
+Benchmark keys (EASI-8): `vsi_bench, mmsi_bench, mindcube_tiny, viewspatial, site_image, site_video, blink, 3dsrbench, embspatial`. Alias `site` expands to `site_image + site_video`.
+
+Useful flags: `--no-judge` (trust exact_matching), `--rerun` (skip resume), `--verbose` (per-sample output), `--include-extra` (add non-EASI-8 benches), `--submit` (push to leaderboard — needs `HF_TOKEN`).
+
+`--api-nproc` = concurrent HTTP requests to LightLLM. 16-32 on 8× H100 with `tp=8`. Lower if you see timeouts / 500s.
+
+### Plain VLMEvalKit
+
+Same `GPT4V` wrapper pattern — just add the entry to whichever config file the standalone VLMEvalKit uses, then run `python run.py --model <key> --data <bench>`.
+
+### Thinking-mode toggle
+
+If a benchmark requires disabling (or forcing) the `<think>…</think>` reasoning block, subclass `GPT4V` and inject `chat_template_kwargs` into the payload. Pattern already used for Qwen3.5-VL at `evaluation/easi/EASI/VLMEvalKit/vlmeval/config.py:127-137`:
+
+```python
+class _SenseNovaNoThinking(GPT4V):
+    def generate_inner(self, inputs, **kwargs):
+        kwargs["chat_template_kwargs"] = {"enable_thinking": False}
+        return super().generate_inner(inputs, **kwargs)
+```
+
+### `max_tokens` and thinking mode
+
+When thinking is on (default), the `<think>...</think>` block alone can run thousands of tokens. If generation hits `max_tokens` before `</think>`, the reasoning parser returns an **empty `content`** (all the text is trapped in `reasoning_content`, the final answer never gets emitted). For VLMEvalKit, this shows up as blank model outputs / parser failures.
+
+- For thinking-mode benchmarks: set `max_tokens >= 8192` in the `GPT4V` partial, or higher for multi-hop reasoning benches. The EASI config example above uses `max_tokens=8192`.
+- For benchmarks that don't need reasoning: disable thinking via the `_SenseNovaNoThinking` subclass pattern and drop `max_tokens` back to `2048` to save latency.
+
+---
+
+## 5) Troubleshooting
+
+### `libnuma.so.1: cannot open shared object file`
+`sgl_kernel` dynamic dep. Install system lib:
+```bash
+sudo apt-get install -y libnuma1 libnuma-dev
+```
+
+### `nixl` build fails / `cchardet` build fails
+Strip them from the requirements file as shown in §2c. Neither is needed for single-node serving. `setup.sh` already does this.
+
+### `ModuleNotFoundError: No module named 'pandas'`
+Transitive dep of `lightllm/models/neo_chat_moe/vision_process.py` not declared in upstream `requirements.txt`. Install manually: `uv pip install pandas`. `setup.sh` handles this automatically.
+
+### `jinja2.exceptions.UndefinedError: 'list object' has no attribute 'startswith'`
+HF chat template called on OpenAI-style multimodal content list. Fixed by the `build_prompt_flatten_content.patch` in §6. If you freshly cloned LightLLM and skipped `setup.sh`, apply manually:
+```bash
+cd evaluation/easi/lightllm-stack/LightLLM
+git apply ../patches/build_prompt_flatten_content.patch
+```
+
+### `iptables: Permission denied` (during `docker run`)
+You are inside an unprivileged container (Kubernetes pod, LXC, chroot). Docker-in-Docker is blocked. Use this native recipe instead — no Docker needed.
+
+### Server launches but first request hangs several minutes
+Expected. Triton/CUDA kernel compilation on first invocation. Subsequent calls are cached. Confirm by tailing the server log for `compiling` / `compiled` messages.
+
+### `CUDA out of memory`
+Options, in order of impact:
+- Reduce `MEM_FRAC` (e.g. `0.7` leaves more headroom).
+- Lower `MAX_LEN` (`--max_req_total_len`) — kv-cache scales with it.
+- Increase `TP` and give more GPUs.
+- On VLMEvalKit side, lower `--api-nproc` to cap concurrent active requests.
+
+### Model loads but `/v1/chat/completions` 404s
+Check the model served: `curl http://localhost:8000/v1/models`. The `model` field in the request body must match the value there (by default `sensenova-u1`, set via `MODEL_NAME` env var).
+
+### torch version conflicts when activating both venvs
+They're deliberately separate. `.venv` uses torch 2.8 (for SenseNova-U1 transformers inference); `.venv-lightllm` uses torch 2.9.1 (LightLLM's pin). Activate only one at a time — never source both into the same shell.
+
+---
+
+## 6) Known patches
+
+Local fixes for LightLLM bugs/gaps we've hit. Tracked under `evaluation/easi/lightllm-stack/patches/`. `setup.sh` applies these automatically and skips if already applied.
+
+| Patch | Fixes | Target file |
+| :--- | :--- | :--- |
+| `build_prompt_flatten_content.patch` | OpenAI multimodal `content` lists crash the HF chat template (`'list object' has no attribute 'startswith'`). Adds `_flatten_multimodal_content` step that rewrites list-form content into a string with `<image>`/`<audio>` placeholders before `apply_chat_template`; `NeoChatTokenizer.encode` later expands them to `<img>...</img>` with injected image-token IDs. | `lightllm/server/build_prompt.py` |
+
+### Recovering after `git submodule update` in `evaluation/easi/lightllm-stack/LightLLM/`
+
+Git submodule updates reset the working tree to the pinned SHA, discarding any applied patches. Re-apply:
+
+```bash
+bash evaluation/easi/scripts/setup.sh            # idempotent; re-applies any drifted patches
+# or, manually:
+cd evaluation/easi/lightllm-stack/LightLLM
+for p in ../patches/*.patch; do
+  git apply --reverse --check "$p" 2>/dev/null || git apply "$p"
+done
+```
+
+If upstream file moves cause `git apply --check` to fail, the patch needs regenerating against the new file — inspect the conflict, reapply the logic manually, and regenerate with `git diff <file> > ../patches/<name>.patch`.
+
+### Contributing fixes upstream
+
+These patches are candidates for upstream PRs to <https://github.com/ModelTC/LightLLM>. The multimodal flatten fix is model-agnostic and should benefit every `/v1/chat/completions` multimodal client.
+
+---
+
+## 7) Recap — shortest path
+
+```bash
+# one-time host prereq (needs sudo)
+sudo apt-get install -y libnuma1 libnuma-dev
+
+# full install: LightLLM stack + EASI client venv + endpoint registration
+bash evaluation/easi/scripts/setup.sh
+
+# launch (auto-downloads weights on first run)
+bash evaluation/easi/scripts/serve.sh            # 8b-mot → port 8000
+
+# benchmark (second shell, after server up)
+source evaluation/easi/EASI/.venv/bin/activate
+cd evaluation/easi/EASI
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --benchmarks blink --api-nproc 16
+```
--- a/SenseNova-U1/evaluation/easi/README.md
+++ b/SenseNova-U1/evaluation/easi/README.md
+# evaluation/easi/ — SenseNova-U1 visual understanding benchmarking
+
+Self-contained subpackage for running benchmark harnesses (VLMEvalKit, EASI, lmms-eval) against a locally-served SenseNova-U1 model.
+
+## Layout
+
+| Path | What |
+| :--- | :--- |
+| `EASI/` | git submodule → `EvolvingLMMs-Lab/EASI@main`. Contains `VLMEvalKit` + `lmms-eval` as its own nested submodules |
+| `lightllm-stack/LightLLM/` | git submodule → `ModelTC/LightLLM@neo_plus_clean` |
+| `lightllm-stack/patches/` | local fixes applied on top of LightLLM |
+| `config/sensenova_models.py` | **editable** VLMEvalKit model entries — edit to tweak endpoint URLs, ports, `max_tokens`, etc. |
+| `patches/easi_sensenova_config.patch` | 7-line hook added to `VLMEvalKit/vlmeval/config.py` that imports the above |
+| `scripts/setup.sh` | one-shot install: submodules, deps, venv, patches, VLMEvalKit wire-up, verify |
+| `scripts/serve.sh` | launch LightLLM server. `DP=1` (default) → single instance. `DP>1` → N replicas + LB on the canonical port |
+| `scripts/lb.py` | least-in-flight HTTP LB used by `serve.sh` when `DP>1` |
+| `scripts/serve_lb.sh` | deprecated shim that forwards to `serve.sh` |
+| `scripts/download_weights.sh` | standalone HF weight fetcher |
+
+Weights land in `<repo_root>/models/` (gitignored). Serving venv `<repo_root>/.venv-lightllm/` (gitignored, separate from the main repo `.venv`).
+
+## Quickstart
+
+```bash
+# one-time host lib (needs sudo)
+sudo apt-get install -y libnuma1 libnuma-dev
+
+# install EVERYTHING — LightLLM stack + EASI client + endpoint registration.
+# Idempotent. First run takes several minutes (builds flash-attn for the EASI venv).
+bash evaluation/easi/scripts/setup.sh
+
+# launch server — auto-downloads weights on first run
+MODEL=8b-mot GPUS=0,1 TP=2 bash evaluation/easi/scripts/serve.sh   # → localhost:8000
+
+# run a benchmark from a second shell
+source evaluation/easi/EASI/.venv/bin/activate
+cd evaluation/easi/EASI
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --benchmarks blink \
+  --api-nproc 16
+```
+
+Setup modes:
+
+```bash
+bash evaluation/easi/scripts/setup.sh                   # full install
+bash evaluation/easi/scripts/setup.sh --skip-lightllm   # bring your own endpoint
+bash evaluation/easi/scripts/setup.sh --skip-easi       # LightLLM side only
+bash evaluation/easi/scripts/setup.sh --skip-register   # no auto VLMEvalKit wiring
+```
+
+### Bring-your-own endpoint
+
+If you already have a SenseNova-U1 OpenAI-compatible endpoint (docker container on another host, infra team API, production deployment, etc.), skip the LightLLM install entirely:
+
+#### 1. Install only EASI + the VLMEvalKit wiring
+
+```bash
+bash evaluation/easi/scripts/setup.sh --skip-lightllm
+```
+
+Skips: host prereq checks, `.venv-lightllm` creation, LightLLM dep install, LightLLM patches, api_server CLI verification. LightLLM submodule is NOT initialized. Only the EASI submodule (+ its nested VLMEvalKit / lmms-eval) gets pulled.
+
+#### 2. Point `config/sensenova_models.py` at your endpoint
+
+Edit the `entries` dict — change `api_base` to wherever your endpoint actually lives. Rename the dict key if you want a clearer label in `run_easi_eval.py --model …`. Full schema + `GPT4V` kwargs documented in [`GUIDE.md`](GUIDE.md) — see the "Configuring a custom OpenAI-compatible endpoint" section.
+
+Quick example — single remote endpoint:
+
+```python
+# evaluation/easi/config/sensenova_models.py
+from functools import partial
+from vlmeval.api.gpt import GPT4V  # type: ignore[import-not-found]
+
+entries = {
+    "SenseNova-U1-8B-MoT-Prod": partial(
+        GPT4V,
+        model="sensenova-u1-8b-mot",
+        api_base="https://your.host.example.com/v1/chat/completions",
+        key="sk-your-real-token-or-dummy",
+        temperature=0,
+        max_tokens=32768,
+        retry=10,
+        verbose=False,
+    ),
+}
+```
+
+#### 3. Propagate the edit into VLMEvalKit
+
+```bash
+bash evaluation/easi/scripts/setup.sh --skip-lightllm --skip-easi
+```
+
+This re-copies `sensenova_models.py` into `EASI/VLMEvalKit/vlmeval/`. Fast — no reinstalls.
+
+#### 4. Run benchmarks
+
+Same as the local-server path — just use whatever key you set in step 2:
+
+```bash
+source evaluation/easi/EASI/.venv/bin/activate
+cd evaluation/easi/EASI
+
+# single benchmark
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Prod \
+  --output-dir eval_results_prod_viewspatial \
+  --api-nproc 16 \
+  --benchmarks viewspatial
+
+# full EASI-8 suite
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Prod \
+  --output-dir eval_results_prod \
+  --api-nproc 16
+
+# multiple specific benchmarks
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Prod \
+  --benchmarks viewspatial,blink,3dsrbench \
+  --api-nproc 16 \
+  --output-dir eval_results_prod
+```
+
+Tune `--api-nproc` based on your endpoint's capacity. Remote endpoints with rate limits: start low (4-8) and ramp up. Strong production backends behind an LB: 32-64.
+
+#### Notes
+
+- `scripts/serve.sh` won't work under `--skip-lightllm` — that's the point. Your endpoint lives elsewhere.
+- If the endpoint exposes multiple SenseNova-U1 variants (or you have several endpoints to benchmark side-by-side), add multiple entries to `sensenova_models.py` — each gets its own model key.
+- The `GPT4V` wrapper auto-handles HTTP retries (on 5xx / timeout) and chunks images as base64 data URIs; no client-side prep needed.
+
+## Model → port map
+
+| `MODEL` arg | HF repo | Server port | Reasoning parser |
+| :--- | :--- | :---: | :--- |
+| `8b-mot` (default) | `sensenova/SenseNova-U1-8B-MoT` | 8000 | `qwen3` (strips `<think>`) |
+
+Override the port with `PORT=<n>`.
+
+## Multi-replica (DP) serving behind a load balancer
+
+`serve.sh` auto-switches to multi-replica mode when `DP > 1`: launches N tp-sharded LightLLM replicas on backend ports + a Python load balancer on the canonical port. **Same port as `DP=1`** — VLMEvalKit config never needs to change when scaling up/down.
+
+```bash
+# 4 replicas × tp=2 on 8 GPUs — higher throughput for many short requests
+DP=4 TP=2 bash evaluation/easi/scripts/serve.sh
+
+# 2 replicas × tp=4
+DP=2 TP=4 bash evaluation/easi/scripts/serve.sh
+```
+
+Only one serve.sh process per model at a time — both `DP=1` and `DP>1` bind the same canonical port.
+
+Port layout:
+
+| MODEL | LB (client-facing) | Backends |
+| :--- | :---: | :--- |
+| `8b-mot` | 8000 | 8100, 8110, 8120, 8130 (step 10) |
+
+Override with `LB_PORT=...` or `BACKEND_BASE_PORT=...`.
+
+Direct hits to a backend port (e.g. `http://localhost:18000/v1/models`) still work — useful for debugging one specific replica.
+
+Sanity-check guardrails (fail fast, no partial launches):
+- `DP * TP <= # visible GPUs` (from `nvidia-smi`) unless `GPUS=...` overrides
+- If `GPUS=...` provided, must contain exactly `DP * TP` entries
+- `LB_PORT` must not collide with any backend port in `[BACKEND_BASE_PORT, BACKEND_BASE_PORT + 10*(DP-1)]`
+- `MODEL` must be `8b-mot`
+- `DP`, `TP`, `LB_PORT`, `BACKEND_BASE_PORT` must be integers ≥ 1
+- **Pre-flight port probe**: every port (LB + all backends) must be free. Stale processes from a previous run are detected before any replica is launched, with `ss -lntp` / `lsof` output naming the owner when possible
+
+Balancing: least in-flight. Streaming passthrough. Per-request timeout 30 min (override `LB_REQUEST_TIMEOUT`).
+
+Health: each replica probed every 10s via `GET /v1/models`. Unhealthy backends skipped but not evicted; rejoin when probes pass. Monitor at `GET http://localhost:<LB_PORT>/_lb/status`.
+
+Per-replica logs land at `evaluation/easi/logs/lightllm-<MODEL>-<port>.log` (gitignored). Override with `LOG_DIR=...`. Ctrl-C / SIGTERM on the `serve.sh` shell cascade-kills all replicas + LB.
+
+### Process hygiene
+
+serve.sh won't leak zombie GPU procs:
+
+- Each replica + LB is launched via `setsid` into its own process group. Cleanup hits `kill -TERM -$pgid` → 10 s grace → `kill -KILL -$pgid` → the whole tree (router, tp workers, visual server, zmq, detokenizer) goes down.
+- Trap covers `EXIT INT TERM HUP`, not just signals — catches unexpected errors too.
+- Belt + suspenders: `pkill -P $$` (our direct children) + `pkill -f "lightllm.server.api_server.*$MODEL_DIR"` (escaped orphans tied to this model) run after the grace period.
+- PID file at `$LOG_DIR/serve.<MODEL>.pids` records pgids. On next `serve.sh` launch, any stale entries get TERM+KILL automatically before the new run starts.
+
+If you still end up with zombies (container crashed / SIGKILL'd), run:
+
+```bash
+pkill -KILL -f "lightllm.server.api_server.*SenseNova-U1-Mini"
+```
+
+and for GPU mem held by processes in another PID namespace (container got torn down and recreated), only a host-side `kill` or pod restart can recover.
+
+### Debugging / verbose logging
+
+`serve.sh` accepts:
+
+| Env | Effect |
+| :--- | :--- |
+| `DETAIL_LOG=1` | Adds `--detail_log` to LightLLM — logs per-request timing, prompt text, token IDs, and (for multimodal) per-image ViT inference timings |
+| `LIGHTLLM_LOG_LEVEL=debug` | Drops LightLLM's root logger to DEBUG. Everything that's wrapped in `logger.debug(...)` now prints (lots of internal signal: KV cache state, router scheduling, detokenization). Default is `info` |
+
+```bash
+# verbose single instance
+DETAIL_LOG=1 LIGHTLLM_LOG_LEVEL=debug \
+  MODEL=8b-mot GPUS=0,1 TP=2 bash evaluation/easi/scripts/serve.sh
+
+# verbose multi-replica (env flows to every replica)
+DETAIL_LOG=1 LIGHTLLM_LOG_LEVEL=debug \
+  DP=4 TP=2 MODEL=8b-mot bash evaluation/easi/scripts/serve.sh
+
+# tail one replica's log
+tail -f evaluation/easi/logs/lightllm-8b-mot-18000.log
+```
+
+Use when a benchmark reports 100% API failures or suspicious per-sample outputs — full HTTP/tokenization trace surfaces on the server side.
+
+## Running benchmarks
+
+Activate the EASI client venv and work from `evaluation/easi/EASI`:
+
+```bash
+source evaluation/easi/EASI/.venv/bin/activate
+cd evaluation/easi/EASI
+```
+
+### Single benchmark
+```bash
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --output-dir eval_results_sensenova-u1-8b-mot_viewspatial \
+  --api-nproc 16 \
+  --benchmarks viewspatial
+```
+
+### Full EASI-8 suite (omit `--benchmarks`)
+```bash
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --output-dir eval_results_sensenova-u1-8b-mot \
+  --api-nproc 16
+```
+
+### Multiple benchmarks at once
+```bash
+python scripts/submissions/run_easi_eval.py \
+  --model SenseNova-U1-8B-MoT-Local \
+  --benchmarks viewspatial,blink,3dsrbench \
+  --api-nproc 16 \
+  --output-dir eval_results_sensenova-u1-8b-mot
+```
+
+### Benchmark keys (EASI-8 core)
+
+`vsi_bench`, `mmsi_bench`, `mindcube_tiny`, `viewspatial`, `site_image`, `site_video`, `blink`, `3dsrbench`, `embspatial`.
+
+Aliases: `site` = `site_image + site_video`. Group name: `sitebench`. Extra (opt-in via `--include-extra`): `mmsi_video_bench`, `omnispatial_(manual_cot)`, `spar_bench`, `vsi_debiased`.
+
+### Useful flags
+
+| Flag | Purpose |
+| :--- | :--- |
+| `--api-nproc N` | Concurrent HTTP requests to the LightLLM server. 16-32 on 8× H100 with `tp=8`. Lower if timeouts/500s |
+| `--no-judge` | Skip LLM-judge re-eval, trust `exact_matching` scores. Faster; less accurate for free-form answers |
+| `--rerun` | Force re-evaluation, bypass resume |
+| `--verbose` | Print per-sample model responses |
+| `--include-extra` | Also run the extra (non-EASI-8) benchmarks |
+| `--submit` | Push results to EASI leaderboard (requires `HF_TOKEN`) |
+| `--nproc N` | torchrun DP (for local-model backends only, ignored for API endpoints) |
+
+## Tweaking VLMEvalKit endpoint config
+
+Endpoints live in `config/sensenova_models.py` — a committed, in-repo Python module with the `partial(GPT4V, ...)` entries. Edit it, re-run `setup.sh` (idempotent, a few seconds if `--skip-easi`), and `supported_VLM` picks up the change on next interpreter start.
+
+```bash
+$EDITOR evaluation/easi/config/sensenova_models.py     # change max_tokens / temperature / URLs
+bash evaluation/easi/scripts/setup.sh --skip-easi       # propagate
+```
+
+Full schema of the `entries` dict and `GPT4V` kwarg reference (including `img_detail`, `timeout`, `system_prompt`, thinking-mode subclass pattern, remote-endpoint examples): [`GUIDE.md` §4](GUIDE.md#configuring-a-custom-openai-compatible-endpoint).
+
+The patch at `patches/easi_sensenova_config.patch` only adds a 7-line `from .sensenova_models import entries; supported_VLM.update(entries)` hook to `VLMEvalKit/vlmeval/config.py` — you shouldn't need to touch it.
+
+## Full guide
+
+See [`GUIDE.md`](GUIDE.md) — covers host prereqs, dependency filtering, patch workflow, VLMEvalKit wiring, thinking-mode handling, troubleshooting.
+
+## Image generation benchmarks
+
+Not covered here. Requires the full LightLLM + LightX2V stack (via the upstream Docker image `lightx2v/lightllm_lightx2v:20260407` — see [`../../docs/deployment_CN.md`](../../docs/deployment_CN.md)). LightX2V pins `torch<=2.8.0` which conflicts with LightLLM's `torch==2.9.1`, so image-gen serving lives in a separate environment.
--- a/SenseNova-U1/evaluation/easi/config/sensenova_models.py
+++ b/SenseNova-U1/evaluation/easi/config/sensenova_models.py
+"""Local SenseNova-U1 LightLLM endpoints for VLMEvalKit.
+
+Edit this file to tweak endpoint URLs, ports, max_tokens, temperature, retry,
+or to add new variants. Changes take effect the next time VLMEvalKit imports
+`vlmeval.config` (typically on the next invocation of `run_easi_eval.py`).
+
+Setup mechanism
+---------------
+`evaluation/easi/scripts/setup.sh` copies this file into the EASI submodule's
+VLMEvalKit as `vlmeval/sensenova_models.py`, then applies a one-line patch to
+`vlmeval/config.py` that does:
+
+    from .sensenova_models import entries as _sensenova_entries
+    supported_VLM.update(_sensenova_entries)
+
+So any edits here just need a re-run of setup.sh (idempotent) to propagate.
+
+Port assignments MUST match `evaluation/easi/scripts/serve.sh`:
+    8b-mot -> 8000    (thinking/reasoning variant)
+"""
+
+from functools import partial
+
+# This import only resolves once setup.sh has copied this file into
+# evaluation/easi/EASI/VLMEvalKit/vlmeval/. Linter warnings in-tree are expected.
+from vlmeval.api.gpt import GPT4V  # type: ignore[import-not-found]
+
+entries = {
+    "SenseNova-U1-8B-MoT-Local": partial(
+        GPT4V,
+        model="sensenova-u1-8b-mot",
+        api_base="http://localhost:8000/v1/chat/completions",
+        key="dummy",
+        temperature=0,
+        max_tokens=8192,  # thinking mode needs headroom
+        retry=10,
+        verbose=False,
+    ),
+}
--- a/SenseNova-U1/evaluation/easi/lightllm-stack/patches/build_prompt_flatten_content.patch
+++ b/SenseNova-U1/evaluation/easi/lightllm-stack/patches/build_prompt_flatten_content.patch
+diff --git a/lightllm/server/build_prompt.py b/lightllm/server/build_prompt.py
+index 21c1cde6..843f2832 100644
+--- a/lightllm/server/build_prompt.py
+++ b/lightllm/server/build_prompt.py
+@@ -63,11 +63,37 @@ def _normalize_tool_call_arguments(messages: list) -> None:
+                             pass
+ 
+ 
+def _flatten_multimodal_content(messages: list) -> None:
+    # OpenAI-style chat requests may send content as a list of parts
+    # ({"type": "text", ...}, {"type": "image_url", ...}). Most HF chat
+    # templates (Qwen, LLaMA, ...) call string methods on `content` directly
+    # and crash on lists. Flatten into a plain string with placeholder tokens
+    # so the template can render; the image data itself is already extracted
+    # upstream via `_get_images_and_audios` and fed through MultimodalParams.
+    for msg in messages:
+        content = msg.get("content")
+        if not isinstance(content, list):
+            continue
+        parts: list = []
+        for item in content:
+            if not isinstance(item, dict):
+                continue
+            t = item.get("type")
+            if t == "text":
+                parts.append(item.get("text", "") or "")
+            elif t == "image_url":
+                parts.append("<image>")
+            elif t == "audio_url":
+                parts.append("<audio>")
+        msg["content"] = "\n".join(p for p in parts if p != "")
+
+
+ async def build_prompt(request, tools) -> str:
+     global tokenizer
+     # pydantic格式转成dict， 否则，当根据tokenizer_config.json拼template时，Jinja判断无法识别
+     messages = [m.model_dump(by_alias=True, exclude_none=True) for m in request.messages]
+     _normalize_tool_call_arguments(messages)
+    _flatten_multimodal_content(messages)
+ 
+     kwargs = {"conversation": messages}
+     if request.character_settings:
--- a/SenseNova-U1/evaluation/easi/patches/easi_sensenova_config.patch
+++ b/SenseNova-U1/evaluation/easi/patches/easi_sensenova_config.patch
+diff --git a/vlmeval/config.py b/vlmeval/config.py
+index e6ae86c..6f35947 100644
+--- a/vlmeval/config.py
+++ b/vlmeval/config.py
+@@ -2170,3 +2170,12 @@ model_groups.extend([bagel_series, spatial_related_models, sensenova_si_series])
+ 
+ for grp in model_groups:
+     supported_VLM.update(grp)
+
+# --- SenseNova-U1 local LightLLM endpoints (injected by SenseNova-U1 repo) ---
+# Edit: evaluation/easi/config/sensenova_models.py  (then re-run setup.sh)
+try:
+    from .sensenova_models import entries as _sensenova_u1_entries
+    supported_VLM.update(_sensenova_u1_entries)
+except Exception as _sensenova_u1_err:  # pragma: no cover
+    import sys as _sys
+    print(f"[sensenova-u1-config] failed to load local entries: {_sensenova_u1_err}", file=_sys.stderr)
--- a/SenseNova-U1/evaluation/easi/scripts/download_weights.sh
+++ b/SenseNova-U1/evaluation/easi/scripts/download_weights.sh
+#!/usr/bin/env bash
+# Download SenseNova-U1 weights from HuggingFace into <repo_root>/models/.
+#
+# Usage:
+#   bash evaluation/easi/scripts/download_weights.sh 8b-mot   # sensenova/SenseNova-U1-8B-MoT (reasoning)
+#
+# Requires: .venv-lightllm activated (has huggingface_hub installed).
+# First-time use: `uv pip install huggingface_hub` if `hf` command not found.
+# Optional: export HF_TOKEN=... if the repo gates downloads.
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
+VENV_DIR="${REPO_ROOT}/.venv-lightllm"
+MODELS_DIR="${REPO_ROOT}/models"
+mkdir -p "${MODELS_DIR}"
+
+# Auto-activate .venv-lightllm if it exists and we're not already in it.
+# Avoids picking up `hf` from an arbitrary venv that may lack hf_transfer etc.
+if [ -d "${VENV_DIR}" ] && { [ -z "${VIRTUAL_ENV:-}" ] || [ "${VIRTUAL_ENV}" != "${VENV_DIR}" ]; }; then
+  # shellcheck disable=SC1091
+  source "${VENV_DIR}/bin/activate"
+fi
+
+# If HF_HUB_ENABLE_HF_TRANSFER=1 is set but hf_transfer isn't importable, fall
+# back to plain HTTP downloads rather than crashing mid-file.
+if [ "${HF_HUB_ENABLE_HF_TRANSFER:-0}" = "1" ]; then
+  if ! python -c "import hf_transfer" >/dev/null 2>&1; then
+    echo "[warn] HF_HUB_ENABLE_HF_TRANSFER=1 but hf_transfer not installed — disabling for this run" >&2
+    unset HF_HUB_ENABLE_HF_TRANSFER
+  fi
+fi
+
+# Prefer the new `hf` CLI (huggingface_hub >= 0.34). Fall back to huggingface-cli.
+if command -v hf >/dev/null 2>&1; then
+  DL="hf download"
+elif command -v huggingface-cli >/dev/null 2>&1; then
+  DL="huggingface-cli download"
+else
+  echo "[error] neither 'hf' nor 'huggingface-cli' found. Activate .venv-lightllm and run: uv pip install huggingface_hub" >&2
+  exit 1
+fi
+
+download() {
+  local repo_id="$1"
+  local local_dir="${MODELS_DIR}/$(basename "${repo_id}")"
+  echo "[download] ${repo_id} -> ${local_dir}"
+  ${DL} "${repo_id}" --local-dir "${local_dir}"
+}
+
+target="${1:-8b-mot}"
+case "${target}" in
+  8b-mot) download "sensenova/SenseNova-U1-8B-MoT" ;;
+  *)
+    echo "[error] unknown target: ${target}. Use: 8b-mot" >&2
+    exit 1
+    ;;
+esac
+
+echo "[done] weights at ${MODELS_DIR}"
--- a/SenseNova-U1/evaluation/easi/scripts/lb.py
+++ b/SenseNova-U1/evaluation/easi/scripts/lb.py
+"""Least-in-flight HTTP load balancer for multiple LightLLM server replicas.
+
+Purpose: front N tp-sharded LightLLM instances behind a single OpenAI-compatible
+endpoint, so VLMEvalKit / EASI can treat them as one higher-throughput server
+without any client-side changes.
+
+Backend selection: least in-flight requests (falls back to round-robin when tied).
+Health checks: periodic GET /v1/models against each backend; unhealthy backends
+are skipped (but never permanently removed — they rejoin when they recover).
+Streaming: passthrough via httpx.stream() + StreamingResponse — long SSE works.
+
+Configuration (env):
+    BACKENDS          Comma-separated backend base URLs, e.g.
+                      "http://localhost:8000,http://localhost:8010"
+                      (REQUIRED)
+    LB_HOST           Bind address                      (default: 0.0.0.0)
+    LB_PORT           Listen port                       (default: 9000)
+    LB_REQUEST_TIMEOUT Seconds per proxied request     (default: 1800 = 30 min)
+    LB_HEALTH_INTERVAL Seconds between health probes   (default: 10)
+    LB_STARTUP_TIMEOUT Max wait for first healthy     (default: 600)
+
+Run via `serve_lb.sh` or directly:
+    BACKENDS=http://localhost:8000,http://localhost:8010 python lb.py
+"""
+
+from __future__ import annotations
+
+import asyncio
+import os
+import sys
+import time
+from contextlib import asynccontextmanager
+
+import httpx
+import uvicorn
+from fastapi import FastAPI, Request
+from fastapi.responses import JSONResponse, StreamingResponse
+
+BACKENDS: list[str] = [b.strip().rstrip("/") for b in os.environ.get("BACKENDS", "").split(",") if b.strip()]
+if not BACKENDS:
+    print("[lb] ERROR: BACKENDS env var required (comma-separated base URLs)", file=sys.stderr)
+    sys.exit(1)
+
+LB_HOST = os.environ.get("LB_HOST", "0.0.0.0")
+LB_PORT = int(os.environ.get("LB_PORT", "9000"))
+REQUEST_TIMEOUT = float(os.environ.get("LB_REQUEST_TIMEOUT", "1800"))
+HEALTH_INTERVAL = float(os.environ.get("LB_HEALTH_INTERVAL", "10"))
+STARTUP_TIMEOUT = float(os.environ.get("LB_STARTUP_TIMEOUT", "600"))
+
+# Per-backend state: in-flight count + health flag.
+inflight: dict[str, int] = {b: 0 for b in BACKENDS}
+healthy: dict[str, bool] = {b: False for b in BACKENDS}
+
+
+async def _probe_once(client: httpx.AsyncClient, backend: str) -> bool:
+    """Return True if the backend looks like a LightLLM server.
+
+    Pitfall: any random HTTP listener on the same port would answer, so we
+    can't just check "did the connection succeed". Instead, hit /health (if
+    exposed) or /v1/models and require the response to be from LightLLM —
+    we key off the `Server: hypercorn-h11` header which LightLLM emits
+    (hypercorn is its ASGI server) and upstream OpenAI-compat servers
+    generally don't use hypercorn.
+
+    Falls back to "any response at /v1/chat/completions with OPTIONS" when
+    the server header check is ambiguous.
+    """
+    try:
+        # /v1/models currently 500s on LightLLM due to an upstream pydantic
+        # bug in ModelCard (owned_by=None). That's actually fine for our
+        # probe — we just need a response.
+        resp = await client.get(f"{backend}/v1/models", timeout=5.0)
+    except (httpx.ConnectError, httpx.ConnectTimeout, httpx.ReadTimeout, OSError):
+        return False
+    except Exception:
+        return False
+
+    # Require the Server header to mention hypercorn — distinguishes a real
+    # LightLLM from a stray process on the same port.
+    server_hdr = resp.headers.get("server", "").lower()
+    if "hypercorn" in server_hdr:
+        return True
+
+    # Fallback: some reverse proxies strip Server headers. Accept any non-HTML
+    # response body shorter than 4 KB with a 2xx/4xx/5xx status (i.e. not 200
+    # HTML from a static file server).
+    ctype = resp.headers.get("content-type", "").lower()
+    if "html" in ctype:
+        return False
+    return True
+
+
+async def health_loop(client: httpx.AsyncClient):
+    """Periodically probe each backend; flip the healthy dict."""
+    while True:
+        for b in BACKENDS:
+            ok = await _probe_once(client, b)
+            if ok != healthy[b]:
+                print(f"[lb] backend {b}: {'UP' if ok else 'DOWN'}", flush=True)
+            healthy[b] = ok
+        await asyncio.sleep(HEALTH_INTERVAL)
+
+
+async def wait_for_first_healthy(client: httpx.AsyncClient):
+    """Block until at least one backend passes a probe, or STARTUP_TIMEOUT."""
+    deadline = time.monotonic() + STARTUP_TIMEOUT
+    print(f"[lb] waiting for at least one healthy backend (timeout {STARTUP_TIMEOUT:.0f}s)...", flush=True)
+    while time.monotonic() < deadline:
+        for b in BACKENDS:
+            if await _probe_once(client, b):
+                healthy[b] = True
+                print(f"[lb] first healthy backend: {b}", flush=True)
+                return
+        await asyncio.sleep(2.0)
+    print(
+        f"[lb] WARNING: no backend became healthy within {STARTUP_TIMEOUT:.0f}s — serving anyway",
+        file=sys.stderr,
+        flush=True,
+    )
+
+
+def pick_backend() -> str | None:
+    """Least in-flight among healthy backends; ties broken by first-seen order."""
+    candidates = [b for b in BACKENDS if healthy[b]]
+    if not candidates:
+        # All unhealthy — fall back to any backend (will error back to client).
+        candidates = BACKENDS
+    return min(candidates, key=lambda b: inflight[b])
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    # One shared client used for both proxying and health checks; large pool
+    # so we can overlap many vlmevalkit workers.
+    limits = httpx.Limits(max_keepalive_connections=128, max_connections=256)
+    client = httpx.AsyncClient(timeout=httpx.Timeout(REQUEST_TIMEOUT), limits=limits)
+    app.state.client = client
+
+    await wait_for_first_healthy(client)
+    health_task = asyncio.create_task(health_loop(client))
+
+    try:
+        yield
+    finally:
+        health_task.cancel()
+        await client.aclose()
+
+
+app = FastAPI(lifespan=lifespan)
+
+
+@app.get("/_lb/status")
+async def lb_status():
+    return {
+        "backends": BACKENDS,
+        "healthy": healthy,
+        "inflight": inflight,
+    }
+
+
+@app.api_route("/{full_path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS", "HEAD"])
+async def proxy(full_path: str, request: Request):
+    backend = pick_backend()
+    if backend is None:
+        return JSONResponse({"error": "no backends configured"}, status_code=503)
+
+    url = f"{backend}/{full_path}"
+    if request.url.query:
+        url = f"{url}?{request.url.query}"
+
+    # Strip hop-by-hop + host headers; pass the rest through.
+    drop = {"host", "content-length", "connection", "transfer-encoding"}
+    fwd_headers = {k: v for k, v in request.headers.items() if k.lower() not in drop}
+
+    body = await request.body()
+
+    client: httpx.AsyncClient = request.app.state.client
+    inflight[backend] += 1
+    try:
+        upstream_req = client.build_request(
+            request.method,
+            url,
+            content=body if body else None,
+            headers=fwd_headers,
+        )
+        upstream_resp = await client.send(upstream_req, stream=True)
+    except Exception as e:
+        inflight[backend] -= 1
+        return JSONResponse({"error": f"backend {backend} unreachable: {e}"}, status_code=502)
+
+    async def body_iter():
+        try:
+            async for chunk in upstream_resp.aiter_raw():
+                yield chunk
+        finally:
+            await upstream_resp.aclose()
+            inflight[backend] -= 1
+
+    # Filter response headers (don't let chunked/transfer-encoding pass through verbatim).
+    out_headers = {k: v for k, v in upstream_resp.headers.items() if k.lower() not in drop}
+
+    return StreamingResponse(
+        body_iter(),
+        status_code=upstream_resp.status_code,
+        headers=out_headers,
+        media_type=upstream_resp.headers.get("content-type"),
+    )
+
+
+if __name__ == "__main__":
+    print(f"[lb] backends: {BACKENDS}", flush=True)
+    print(f"[lb] listening on http://{LB_HOST}:{LB_PORT}  (status: /_lb/status)", flush=True)
+    uvicorn.run(app, host=LB_HOST, port=LB_PORT, log_level="info", access_log=False)
--- a/SenseNova-U1/evaluation/easi/scripts/serve.sh
+++ b/SenseNova-U1/evaluation/easi/scripts/serve.sh
+#!/usr/bin/env bash
+# Launch LightLLM OpenAI-compatible server(s) for SenseNova-U1.
+#
+# Two modes, picked automatically by DP:
+#
+#   DP=1 (default)  — single LightLLM instance bound directly to the canonical
+#                     per-model port (8b-mot → 8000).
+#                     No load balancer, no extra process.
+#
+#   DP>1            — DP tp-sharded replicas on backend ports (8100+, step 10)
+#                     + a Python load balancer on the canonical port.
+#                     Clients always hit the same port regardless of DP, so
+#                     VLMEvalKit config.py never needs to change.
+#
+# Total GPUs used = DP × TP, assigned contiguously from GPU 0 unless GPUS set.
+#
+# Usage:
+#   bash evaluation/easi/scripts/serve.sh                               # DP=1, 8b-mot, GPUs 0-1, port 8000
+#   TP=8 GPUS=0,1,2,3,4,5,6,7 bash evaluation/easi/scripts/serve.sh     # single big instance
+#   DP=4 TP=2 bash evaluation/easi/scripts/serve.sh                     # 4 replicas × tp=2, LB on 8000
+#
+# Env vars:
+#   MODEL               8b-mot                             (default: 8b-mot)
+#   MODEL_DIR           explicit path, overrides MODEL     (default: ./models/SenseNova-U1-8B-MoT)
+#   DP                  # of replicas                      (default: 1)
+#   TP                  tensor parallel degree / replica   (default: 2)
+#   GPUS                CSV of CUDA_VISIBLE_DEVICES ids    (default: 0,1,...,DP*TP-1)
+#   HOST                bind address                       (default: 0.0.0.0)
+#   LB_PORT             canonical client-facing port       (default: 8000)
+#                       (alias: PORT — honored for backcompat when DP=1)
+#   BACKEND_BASE_PORT   first backend port when DP>1       (default: 8100, step 10)
+#   MAX_LEN             --max_req_total_len                (default: 32768)
+#   MEM_FRAC            --mem_fraction                     (default: 0.85)
+#   MODEL_NAME          advertised model name              (default: per-model)
+#   REASONING           --reasoning_parser                 (default: qwen3)
+#                         qwen3: strips <think>...</think> into reasoning_content
+#                         qwen3-thinking: force reasoning even w/o <think> tag
+#                         "" (empty): disable parser (raw content)
+#   NO_AUTO_DL          1 = skip weight auto-download      (default: unset)
+#   DETAIL_LOG          1 = --detail_log (per-request DEBUG: timing, prompt, tokens, ViT)
+#   LIGHTLLM_LOG_LEVEL  debug|info|warning|error           (default: info)
+#   LOG_DIR             dir for replica log files (DP>1)   (default: evaluation/easi/logs/)
+#
+# Guardrails:
+#   - DP*TP > #visible GPUs → fail fast (or set GPUS=... explicitly)
+#   - GPUS count must equal DP*TP
+#   - Any port (LB + backends) already in use → fail fast with owner PID when discoverable
+#   - DP=1 vs DP>1 are mutually exclusive on the canonical port — don't run both
+#     concurrently for the same MODEL
+#
+# Notes:
+#   - Ctrl-C / SIGTERM cascade-kills all replicas + LB when DP>1.
+#   - Per-replica logs at $LOG_DIR/lightllm-<MODEL>-<port>.log (DP>1 only).
+#   - LB passes through health probes + streaming; least in-flight balancing.
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+EASI_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
+REPO_ROOT="$(cd "${EASI_ROOT}/../.." && pwd)"
+cd "${REPO_ROOT}"
+
+VENV_DIR="${REPO_ROOT}/.venv-lightllm"
+
+# ---------------------------------------------------------------------------
+# 1) Resolve model defaults
+# ---------------------------------------------------------------------------
+MODEL="${MODEL:-8b-mot}"
+case "${MODEL}" in
+  8b-mot)
+    DEFAULT_DIR="${REPO_ROOT}/models/SenseNova-U1-8B-MoT"
+    DEFAULT_LB_PORT=8000
+    DEFAULT_BACKEND_BASE=8100
+    DEFAULT_MODEL_NAME="sensenova-u1-8b-mot"
+    DEFAULT_REASONING="qwen3"
+    ;;
+  *)
+    echo "[error] MODEL must be '8b-mot' (got: ${MODEL})" >&2
+    exit 1
+    ;;
+esac
+
+MODEL_DIR="${MODEL_DIR:-${DEFAULT_DIR}}"
+DP="${DP:-1}"
+TP="${TP:-2}"
+HOST="${HOST:-0.0.0.0}"
+# PORT is kept as an alias for LB_PORT for backcompat. LB_PORT wins if both set.
+LB_PORT="${LB_PORT:-${PORT:-${DEFAULT_LB_PORT}}}"
+BACKEND_BASE_PORT="${BACKEND_BASE_PORT:-${DEFAULT_BACKEND_BASE}}"
+MAX_LEN="${MAX_LEN:-32768}"
+MEM_FRAC="${MEM_FRAC:-0.85}"
+MODEL_NAME="${MODEL_NAME:-${DEFAULT_MODEL_NAME}}"
+REASONING="${REASONING-${DEFAULT_REASONING}}"    # unset → default; "" → user disable
+DETAIL_LOG="${DETAIL_LOG:-0}"
+LIGHTLLM_LOG_LEVEL="${LIGHTLLM_LOG_LEVEL:-info}"
+LOG_DIR="${LOG_DIR:-${EASI_ROOT}/logs}"
+
+# ---------------------------------------------------------------------------
+# 2) Integer validation
+# ---------------------------------------------------------------------------
+case "${DP}${TP}${LB_PORT}${BACKEND_BASE_PORT}" in *[!0-9]*)
+  echo "[error] DP, TP, LB_PORT, BACKEND_BASE_PORT must all be integers" >&2
+  exit 1
+  ;;
+esac
+if [ "${DP}" -lt 1 ] || [ "${TP}" -lt 1 ]; then
+  echo "[error] DP and TP must be >= 1 (got DP=${DP} TP=${TP})" >&2
+  exit 1
+fi
+
+# When DP>1, LB_PORT must not overlap the backend range.
+if [ "${DP}" -gt 1 ]; then
+  backend_end=$((BACKEND_BASE_PORT + 10 * (DP - 1)))
+  if [ "${LB_PORT}" -ge "${BACKEND_BASE_PORT}" ] && [ "${LB_PORT}" -le "${backend_end}" ] \
+     && [ $(( (LB_PORT - BACKEND_BASE_PORT) % 10 )) -eq 0 ]; then
+    echo "[error] LB_PORT=${LB_PORT} collides with backend port range ${BACKEND_BASE_PORT}..${backend_end} (step 10)" >&2
+    echo "        move LB_PORT or BACKEND_BASE_PORT so they don't overlap" >&2
+    exit 1
+  fi
+fi
+
+# ---------------------------------------------------------------------------
+# 3) GPU sanity
+# ---------------------------------------------------------------------------
+NEED=$(( DP * TP ))
+
+if [ -n "${GPUS:-}" ]; then
+  gpu_count="$(echo "${GPUS}" | awk -F, '{print NF}')"
+  if [ "${gpu_count}" != "${NEED}" ]; then
+    echo "[error] GPUS has ${gpu_count} entries but DP*TP=${NEED}" >&2
+    echo "        provide exactly ${NEED} GPU IDs, or unset GPUS to auto-assign" >&2
+    exit 1
+  fi
+  IFS=',' read -r -a GPU_ARR <<< "${GPUS}"
+else
+  if ! command -v nvidia-smi >/dev/null 2>&1; then
+    echo "[error] nvidia-smi not found — can't auto-detect GPUs. Set GPUS=..." >&2
+    exit 1
+  fi
+  avail="$(nvidia-smi --query-gpu=index --format=csv,noheader | wc -l | tr -d ' ')"
+  if [ "${avail}" -lt "${NEED}" ]; then
+    echo "[error] DP*TP = ${NEED} GPUs required but only ${avail} visible to nvidia-smi" >&2
+    echo "        reduce DP or TP, or set GPUS=... to a subset of available GPUs" >&2
+    exit 1
+  fi
+  GPU_ARR=()
+  for i in $(seq 0 $((NEED - 1))); do GPU_ARR+=("${i}"); done
+fi
+
+# ---------------------------------------------------------------------------
+# 4) Venv activation
+# ---------------------------------------------------------------------------
+if [ ! -d "${VENV_DIR}" ]; then
+  echo "[error] venv not found at ${VENV_DIR}" >&2
+  echo "        run: bash evaluation/easi/scripts/setup.sh" >&2
+  exit 1
+fi
+if [ -z "${VIRTUAL_ENV:-}" ] || [ "${VIRTUAL_ENV}" != "${VENV_DIR}" ]; then
+  echo "[serve] activating ${VENV_DIR}"
+  # shellcheck disable=SC1091
+  source "${VENV_DIR}/bin/activate"
+fi
+
+# ---------------------------------------------------------------------------
+# 5) Weight check (once, before any replica forks)
+# ---------------------------------------------------------------------------
+if [ ! -f "${MODEL_DIR}/config.json" ]; then
+  if [ "${NO_AUTO_DL:-0}" = "1" ]; then
+    echo "[error] ${MODEL_DIR}/config.json missing" >&2
+    echo "        set NO_AUTO_DL=0 or run: bash evaluation/easi/scripts/download_weights.sh ${MODEL}" >&2
+    exit 1
+  fi
+  echo "[serve] config.json missing at ${MODEL_DIR} — downloading ${MODEL}..."
+  bash "${SCRIPT_DIR}/download_weights.sh" "${MODEL}"
+  if [ ! -f "${MODEL_DIR}/config.json" ]; then
+    echo "[error] download appears to have failed — still no ${MODEL_DIR}/config.json" >&2
+    echo "        check HF_TOKEN, network, and org membership for SenseNova" >&2
+    exit 1
+  fi
+fi
+
+# ---------------------------------------------------------------------------
+# 6) Pre-flight port probe
+# ---------------------------------------------------------------------------
+# Catches stale servers on any port we plan to bind. Avoids the
+# "replica crash-binds, but zombie on port fakes LB health" failure mode.
+port_in_use() {
+  local port="$1"
+  if (exec 3<>"/dev/tcp/127.0.0.1/${port}") 2>/dev/null; then
+    exec 3<&-
+    return 0
+  fi
+  return 1
+}
+
+ports_to_check=()
+if [ "${DP}" -eq 1 ]; then
+  # Single-instance mode: bind the canonical port directly.
+  ports_to_check+=("${LB_PORT}")
+else
+  # Multi-replica mode: LB on canonical + each backend.
+  ports_to_check+=("${LB_PORT}")
+  for i in $(seq 0 $((DP - 1))); do
+    ports_to_check+=($((BACKEND_BASE_PORT + i * 10)))
+  done
+fi
+
+busy=()
+for p in "${ports_to_check[@]}"; do
+  port_in_use "${p}" && busy+=("${p}")
+done
+
+if [ "${#busy[@]}" -gt 0 ]; then
+  echo "[error] port(s) already in use: ${busy[*]}" >&2
+  for p in "${busy[@]}"; do
+    echo "  port ${p}:" >&2
+    ss -lntp 2>/dev/null | awk -v pat=":${p} " '$4 ~ pat {print "    " $0}' >&2 \
+      || lsof -i":${p}" 2>/dev/null | sed 's/^/    /' >&2 \
+      || echo "    (couldn't identify owner; try: sudo lsof -i:${p} or sudo ss -lntp | grep :${p})" >&2
+  done
+  echo "[error] stop the conflicting process(es) first, or override LB_PORT / BACKEND_BASE_PORT" >&2
+  exit 1
+fi
+
+# ---------------------------------------------------------------------------
+# 7) Helper: launch one LightLLM instance (blocking unless backgrounded by caller)
+# ---------------------------------------------------------------------------
+launch_lightllm() {
+  local gpus="$1" port="$2"
+  local extra=()
+  [ -n "${REASONING}" ] && extra+=(--reasoning_parser "${REASONING}")
+  [ "${DETAIL_LOG}" = "1" ] && extra+=(--detail_log)
+  env CUDA_VISIBLE_DEVICES="${gpus}" \
+      LIGHTLLM_LOG_LEVEL="${LIGHTLLM_LOG_LEVEL}" \
+      python -m lightllm.server.api_server \
+      --model_dir "${MODEL_DIR}" \
+      --model_name "${MODEL_NAME}" \
+      --model_owner "sensenova" \
+      --host "${HOST}" \
+      --port "${port}" \
+      --tp "${TP}" \
+      --max_req_total_len "${MAX_LEN}" \
+      --mem_fraction "${MEM_FRAC}" \
+      --trust_remote_code \
+      "${extra[@]}"
+}
+
+# ---------------------------------------------------------------------------
+# 8) Dispatch: single-instance or multi-replica + LB
+# ---------------------------------------------------------------------------
+if [ "${DP}" -eq 1 ]; then
+  # --- single instance, direct on LB_PORT ---
+  mkdir -p "${LOG_DIR}"
+  PIDFILE="${LOG_DIR}/serve.${MODEL}.pids"
+  set -m
+
+  # Self-heal stale PIDs from prior run (same as multi-replica path).
+  if [ -f "${PIDFILE}" ]; then
+    stale="$(cat "${PIDFILE}" 2>/dev/null | tr '\n' ' ')"
+    for pg in ${stale}; do
+      [ -z "${pg}" ] && continue
+      kill -TERM -"${pg}" 2>/dev/null || true
+    done
+    sleep 1
+    for pg in ${stale}; do
+      [ -z "${pg}" ] && continue
+      kill -KILL -"${pg}" 2>/dev/null || true
+    done
+    pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
+    rm -f "${PIDFILE}"
+  fi
+
+  echo "[serve] model=${MODEL_NAME} dir=${MODEL_DIR}"
+  echo "[serve] GPUS=${GPUS:-$(IFS=,; echo "${GPU_ARR[*]}")} TP=${TP} port=${LB_PORT} reasoning=${REASONING:-off} detail_log=${DETAIL_LOG} log_level=${LIGHTLLM_LOG_LEVEL}"
+
+  _gpus_csv="$(IFS=,; echo "${GPU_ARR[*]}")"
+  ( launch_lightllm "${_gpus_csv}" "${LB_PORT}" ) &
+  LIGHTLLM_PID=$!
+  echo "${LIGHTLLM_PID}" > "${PIDFILE}"
+
+  _cleaning_up=0
+  cleanup() {
+    [ "${_cleaning_up}" = "1" ] && return
+    _cleaning_up=1
+    echo ""
+    echo "[serve] shutting down..."
+    kill -TERM -"${LIGHTLLM_PID}" 2>/dev/null || kill -TERM "${LIGHTLLM_PID}" 2>/dev/null || true
+    local waited=0
+    while [ "${waited}" -lt 10 ]; do
+      kill -0 "${LIGHTLLM_PID}" 2>/dev/null || break
+      sleep 1; waited=$((waited + 1))
+    done
+    kill -KILL -"${LIGHTLLM_PID}" 2>/dev/null || true
+    kill -KILL "${LIGHTLLM_PID}" 2>/dev/null || true
+    pkill -KILL -P $$ 2>/dev/null || true
+    pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
+    rm -f "${PIDFILE}"
+    echo "[serve] cleanup done"
+  }
+  trap cleanup EXIT INT TERM HUP
+
+  wait "${LIGHTLLM_PID}"
+  exit $?
+fi
+
+# --- multi-replica + LB ---
+mkdir -p "${LOG_DIR}"
+PIDFILE="${LOG_DIR}/serve.${MODEL}.pids"
+
+# Enable job control so each backgrounded subshell gets its own process group
+# (pgid == subshell pid). We kill the whole group on cleanup — catches every
+# LightLLM worker / router / zmq / visual server / detokenizer child.
+set -m
+
+# ---------------------------------------------------------------------------
+# Self-heal: if a previous run left a PID file, try to clean its processes
+# ---------------------------------------------------------------------------
+if [ -f "${PIDFILE}" ]; then
+  echo "[serve] stale PID file found: ${PIDFILE}"
+  stale_pgs="$(cat "${PIDFILE}" 2>/dev/null | tr '\n' ' ')"
+  for pg in ${stale_pgs}; do
+    [ -z "${pg}" ] && continue
+    # Negative PID = signal entire process group
+    kill -TERM -"${pg}" 2>/dev/null || true
+  done
+  sleep 1
+  for pg in ${stale_pgs}; do
+    [ -z "${pg}" ] && continue
+    kill -KILL -"${pg}" 2>/dev/null || true
+  done
+  # Belt + suspenders: kill any lightllm server still tied to this MODEL_DIR
+  pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
+  rm -f "${PIDFILE}"
+  echo "[serve] stale processes cleaned; continuing..."
+fi
+
+echo "[serve] multi-replica mode: DP=${DP} TP=${TP} total_gpus=${NEED}"
+echo "[serve] GPU assignment: ${GPU_ARR[*]}"
+echo "[serve] backend ports (step 10): $(for i in $(seq 0 $((DP-1))); do echo -n "$((BACKEND_BASE_PORT + i*10)) "; done)"
+echo "[serve] LB listens on :${LB_PORT}  (canonical port for ${MODEL})"
+echo "[serve] replica logs: ${LOG_DIR}/"
+echo "[serve] reasoning=${REASONING:-off} detail_log=${DETAIL_LOG} log_level=${LIGHTLLM_LOG_LEVEL}"
+
+REPLICA_PIDS=()    # pid of each backgrounded subshell (== pgid with job control)
+BACKENDS=()
+for i in $(seq 0 $((DP - 1))); do
+  port=$((BACKEND_BASE_PORT + i * 10))
+  start=$((i * TP))
+  end=$((start + TP - 1))
+  gpus=""
+  for j in $(seq "${start}" "${end}"); do
+    gpus="${gpus}${gpus:+,}${GPU_ARR[$j]}"
+  done
+  log="${LOG_DIR}/lightllm-${MODEL}-${port}.log"
+  echo "[serve] launching replica $((i+1))/${DP}: GPUS=${gpus} PORT=${port} → ${log}"
+  # With `set -m` above, each backgrounded subshell gets its own pgid (== $!).
+  # LightLLM's fork-spawned children inherit that pgid, so `kill -- -$pid` on
+  # cleanup hits the whole tree (router, tp workers, visual server, zmq, ...).
+  ( launch_lightllm "${gpus}" "${port}" ) >"${log}" 2>&1 &
+  REPLICA_PIDS+=("$!")
+  BACKENDS+=("http://localhost:${port}")
+done
+
+BACKENDS_CSV="$(IFS=,; echo "${BACKENDS[*]}")"
+
+# Persist PIDs/PGIDs for self-heal on next run.
+{ for pid in "${REPLICA_PIDS[@]}"; do echo "${pid}"; done; } > "${PIDFILE}"
+
+# ---------------------------------------------------------------------------
+# Hardened cleanup: SIGTERM process groups → wait → SIGKILL → pattern sweep
+# ---------------------------------------------------------------------------
+_cleaning_up=0
+cleanup() {
+  [ "${_cleaning_up}" = "1" ] && return
+  _cleaning_up=1
+  echo ""
+  echo "[serve] shutting down..."
+
+  local all_pids=("${REPLICA_PIDS[@]}" "${LB_PID:-}")
+
+  # 1) SIGTERM the entire process group of each replica + LB.
+  for pid in "${all_pids[@]}"; do
+    [ -z "${pid}" ] && continue
+    kill -TERM -"${pid}" 2>/dev/null \
+      || kill -TERM "${pid}" 2>/dev/null || true
+  done
+
+  # 2) Poll up to 10s for graceful exit.
+  local waited=0
+  while [ "${waited}" -lt 10 ]; do
+    local alive=0
+    for pid in "${all_pids[@]}"; do
+      [ -z "${pid}" ] && continue
+      kill -0 "${pid}" 2>/dev/null && { alive=1; break; }
+    done
+    [ "${alive}" = "0" ] && break
+    sleep 1
+    waited=$((waited + 1))
+  done
+
+  # 3) SIGKILL any survivors (groups, then individual pids).
+  for pid in "${all_pids[@]}"; do
+    [ -z "${pid}" ] && continue
+    kill -KILL -"${pid}" 2>/dev/null || true
+    kill -KILL "${pid}" 2>/dev/null || true
+  done
+
+  # 4) Belt + suspenders: sweep any remaining children of this shell and any
+  #    lightllm procs tied to our MODEL_DIR (catches orphans that escaped pg).
+  pkill -KILL -P $$ 2>/dev/null || true
+  pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
+  pkill -KILL -f "${SCRIPT_DIR}/lb.py" 2>/dev/null || true
+
+  rm -f "${PIDFILE}"
+  echo "[serve] cleanup done"
+}
+# EXIT covers normal exit + unexpected errors (set -e); INT/TERM/HUP covers signals.
+trap cleanup EXIT INT TERM HUP
+
+echo "[serve] starting LB (backends: ${BACKENDS_CSV})"
+( env BACKENDS="${BACKENDS_CSV}" LB_PORT="${LB_PORT}" python "${SCRIPT_DIR}/lb.py" ) &
+LB_PID=$!
+echo "${LB_PID}" >> "${PIDFILE}"
+
+wait "${LB_PID}"
+# EXIT trap fires cleanup
--- a/SenseNova-U1/evaluation/easi/scripts/serve_lb.sh
+++ b/SenseNova-U1/evaluation/easi/scripts/serve_lb.sh
+#!/usr/bin/env bash
+# Deprecated: merged into serve.sh. serve.sh now handles both DP=1 (direct)
+# and DP>1 (multi-replica + LB) based on the DP env var.
+#
+# This shim forwards to serve.sh with a warning. Update your commands:
+#
+#   Old:  DP=4 TP=2 bash evaluation/easi/scripts/serve_lb.sh
+#   New:  DP=4 TP=2 bash evaluation/easi/scripts/serve.sh
+set -euo pipefail
+
+echo "[serve_lb] DEPRECATED: use serve.sh instead — it now supports DP>1 natively." >&2
+echo "[serve_lb] forwarding to serve.sh with the same env..." >&2
+exec bash "$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/serve.sh" "$@"
--- a/SenseNova-U1/evaluation/easi/scripts/setup.sh
+++ b/SenseNova-U1/evaluation/easi/scripts/setup.sh
+#!/usr/bin/env bash
+# Set up everything needed to run SenseNova-U1 visual understanding benchmarks:
+# LightLLM serving stack + EASI benchmark client + VLMEvalKit endpoint
+# registration.
+#
+# Idempotent: safe to re-run. Each step checks whether it already ran.
+#
+# What it does (full run):
+#   1. Sanity-checks host prereqs (uv, libnuma, nvidia driver).             [lightllm]
+#   2. Initializes submodules (LightLLM, EASI, EASI/VLMEvalKit, EASI/lmms-eval).
+#   3. Creates .venv-lightllm/ Python 3.10 venv at the repo root.           [lightllm]
+#   4. Installs pinned LightLLM deps (strips unbuildable nixl + cchardet).  [lightllm]
+#   5. Installs vllm 0.16.0 + LightLLM editable + pandas.                   [lightllm]
+#   6. Applies local patches from evaluation/easi/lightllm-stack/patches/.  [lightllm]
+#   7. Verifies LightLLM imports + api_server CLI.                          [lightllm]
+#   8. Installs EASI benchmark client venv (delegates to EASI's own setup.sh,
+#      which creates evaluation/easi/EASI/.venv with Python 3.11 and installs
+#      both VLMEvalKit and lmms-eval backends + flash-attn).
+#   9. Registers SenseNova-U1 endpoints in VLMEvalKit:
+#      - copies evaluation/easi/config/sensenova_models.py into
+#        evaluation/easi/EASI/VLMEvalKit/vlmeval/sensenova_models.py
+#      - applies evaluation/easi/patches/easi_sensenova_config.patch (7-line hook).
+#      Edit config/sensenova_models.py then re-run this script to propagate.
+#      Point `api_base` at your own endpoint (localhost, docker host, remote infra
+#      team endpoint, etc.) — nothing in EASI itself assumes the server is local.
+#
+# Flags:
+#   --skip-lightllm  skip steps 1, 3-7 — DON'T install the LightLLM serving stack
+#                    (use when you already have a SenseNova-U1 OpenAI-compatible
+#                    endpoint elsewhere: docker, remote infra team, etc.)
+#   --skip-easi      skip step 8 (EASI client venv install — slow, builds flash-attn)
+#   --skip-register  skip step 9 (VLMEvalKit endpoint registration)
+#
+# Usage:
+#   bash evaluation/easi/scripts/setup.sh                         # full install
+#   bash evaluation/easi/scripts/setup.sh --skip-lightllm         # bring your own endpoint
+#   bash evaluation/easi/scripts/setup.sh --skip-easi             # lightllm only
+#   bash evaluation/easi/scripts/setup.sh --skip-register         # no config.py patch
+set -euo pipefail
+
+SKIP_LIGHTLLM=0
+SKIP_EASI=0
+SKIP_REGISTER=0
+for arg in "$@"; do
+  case "${arg}" in
+    --skip-lightllm) SKIP_LIGHTLLM=1 ;;
+    --skip-easi)     SKIP_EASI=1 ;;
+    --skip-register) SKIP_REGISTER=1 ;;
+    -h|--help)
+      sed -n '1,/^set -euo/p' "${BASH_SOURCE[0]}" | sed 's/^# \{0,1\}//'
+      exit 0
+      ;;
+    *)
+      echo "[error] unknown flag: ${arg}" >&2
+      exit 1
+      ;;
+  esac
+done
+
+if [ "${SKIP_LIGHTLLM}" = "1" ] && [ "${SKIP_EASI}" = "1" ] && [ "${SKIP_REGISTER}" = "1" ]; then
+  echo "[error] --skip-lightllm + --skip-easi + --skip-register leaves nothing to do" >&2
+  exit 1
+fi
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+EASI_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
+REPO_ROOT="$(cd "${EASI_ROOT}/../.." && pwd)"
+
+STACK_DIR="${EASI_ROOT}/lightllm-stack"
+LIGHTLLM_DIR="${STACK_DIR}/LightLLM"
+VENV_DIR="${REPO_ROOT}/.venv-lightllm"
+PATCHES_DIR="${STACK_DIR}/patches"
+REQ_OUT="/tmp/lightllm-req.txt"
+
+EASI_DIR="${EASI_ROOT}/EASI"
+EASI_VENV="${EASI_DIR}/.venv"
+EASI_VLMEVAL_DIR="${EASI_DIR}/VLMEvalKit"
+EASI_PATCHES_DIR="${EASI_ROOT}/patches"
+EASI_CONFIG_DIR="${EASI_ROOT}/config"
+
+log() { echo "[setup] $*"; }
+err() { echo "[error] $*" >&2; }
+
+# -------------------------------------------------------------------------
+# 1) Host prereqs (LightLLM-side only)
+# -------------------------------------------------------------------------
+if [ "${SKIP_LIGHTLLM}" = "1" ]; then
+  log "skipping LightLLM setup (--skip-lightllm)"
+  log "  (assumes you already have a SenseNova-U1 OpenAI-compatible endpoint —"
+  log "   point config/sensenova_models.py at it before running step 10)"
+
+  # Still need uv for the EASI venv delegation below, and git to init EASI submodule.
+  if [ "${SKIP_EASI}" = "0" ] && ! command -v uv >/dev/null 2>&1; then
+    err "uv not found. Install: https://docs.astral.sh/uv/getting-started/installation/"
+    exit 1
+  fi
+else
+  log "checking host prereqs..."
+
+  if ! command -v uv >/dev/null 2>&1; then
+    err "uv not found. Install: https://docs.astral.sh/uv/getting-started/installation/"
+    exit 1
+  fi
+
+  if ! ldconfig -p 2>/dev/null | grep -q libnuma.so.1; then
+    err "libnuma.so.1 not found. Install system package:"
+    err "  sudo apt-get install -y libnuma1 libnuma-dev"
+    exit 1
+  fi
+
+  if ! command -v nvidia-smi >/dev/null 2>&1; then
+    err "nvidia-smi not found. NVIDIA driver required."
+    exit 1
+  fi
+
+  driver_ok="$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1 | awk -F. '{print ($1 >= 550)}')"
+  if [ "${driver_ok}" != "1" ]; then
+    err "NVIDIA driver < 550.x detected. torch 2.9.1+cu128 requires >= 550."
+    nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
+    exit 1
+  fi
+fi
+
+# -------------------------------------------------------------------------
+# 2) Initialize submodules
+# -------------------------------------------------------------------------
+# With --skip-lightllm we only need EASI (+ its nested VLMEvalKit/lmms-eval).
+# Otherwise init everything recursively.
+if [ "${SKIP_LIGHTLLM}" = "1" ]; then
+  log "initializing EASI submodule (skipping LightLLM)..."
+  (cd "${REPO_ROOT}" && git submodule update --init --recursive evaluation/easi/EASI)
+else
+  log "initializing submodules (recursive)..."
+  (cd "${REPO_ROOT}" && git submodule update --init --recursive)
+
+  if [ ! -f "${LIGHTLLM_DIR}/setup.py" ]; then
+    err "LightLLM submodule still empty after init: ${LIGHTLLM_DIR}"
+    err "  run: cd ${REPO_ROOT} && git submodule status"
+    exit 1
+  fi
+fi
+
+# -------------------------------------------------------------------------
+# 3) Create LightLLM venv (.venv-lightllm)
+# -------------------------------------------------------------------------
+if [ "${SKIP_LIGHTLLM}" = "0" ]; then
+  if [ ! -d "${VENV_DIR}" ]; then
+    log "creating Python 3.10 venv at ${VENV_DIR}..."
+    uv venv -p 3.10 "${VENV_DIR}"
+  else
+    log "venv already exists at ${VENV_DIR}"
+  fi
+
+  # shellcheck disable=SC1091
+  source "${VENV_DIR}/bin/activate"
+  uv pip install --quiet --upgrade pip
+fi
+
+# -------------------------------------------------------------------------
+# 4-6) LightLLM deps, patches, verify
+# -------------------------------------------------------------------------
+if [ "${SKIP_LIGHTLLM}" = "0" ]; then
+  log "filtering upstream requirements (dropping nixl, cchardet)..."
+  grep -v "^nixl\|^cchardet" "${LIGHTLLM_DIR}/requirements.txt" > "${REQ_OUT}"
+
+  # Skip reinstall if lightllm is already importable and key pins match.
+  # Also verify the editable install points at the current LIGHTLLM_DIR (not a
+  # stale pre-move path).
+  lightllm_path="$(python -c "import lightllm, os; print(os.path.dirname(os.path.dirname(lightllm.__file__)))" 2>/dev/null || echo "")"
+  if python -c "import lightllm, vllm, torch, flashinfer, sgl_kernel, xformers, pandas" >/dev/null 2>&1 \
+     && [ "${lightllm_path}" = "${LIGHTLLM_DIR}" ]; then
+    log "all required packages already installed — skipping pip phase"
+  else
+    if [ -n "${lightllm_path}" ] && [ "${lightllm_path}" != "${LIGHTLLM_DIR}" ]; then
+      log "existing lightllm install points at stale path ${lightllm_path} — reinstalling"
+    fi
+    log "installing LightLLM requirements (torch 2.9.1+cu128, flashinfer, sgl-kernel, xformers, ...)"
+    uv pip install -r "${REQ_OUT}"
+
+    log "installing vllm 0.16.0..."
+    uv pip install --no-cache-dir vllm==0.16.0
+
+    log "installing LightLLM (editable)..."
+    uv pip install --no-cache-dir -e "${LIGHTLLM_DIR}"
+
+    log "installing missing transitive deps (pandas)..."
+    uv pip install --quiet pandas
+  fi
+
+  if [ -d "${PATCHES_DIR}" ] && ls "${PATCHES_DIR}"/*.patch >/dev/null 2>&1; then
+    log "applying patches in ${PATCHES_DIR}..."
+    for p in "${PATCHES_DIR}"/*.patch; do
+      name="$(basename "${p}")"
+      if (cd "${LIGHTLLM_DIR}" && git apply --reverse --check "${p}" >/dev/null 2>&1); then
+        log "  ${name}: already applied"
+        continue
+      fi
+      if (cd "${LIGHTLLM_DIR}" && git apply --check "${p}" >/dev/null 2>&1); then
+        (cd "${LIGHTLLM_DIR}" && git apply "${p}")
+        log "  ${name}: applied"
+      else
+        err "  ${name}: patch does not apply cleanly — file may have drifted upstream"
+        err "         inspect manually: cd ${LIGHTLLM_DIR} && git apply --check ${p}"
+        exit 1
+      fi
+    done
+  else
+    log "no patches to apply"
+  fi
+
+  log "verifying imports..."
+  python - <<'PY'
+import torch, flashinfer, sgl_kernel, xformers, vllm, lightllm
+print(f"  torch      {torch.__version__}  cuda={torch.cuda.is_available()}")
+print(f"  vllm       {vllm.__version__}")
+print(f"  flashinfer ok")
+print(f"  sgl_kernel ok")
+print(f"  xformers   ok")
+print(f"  lightllm   ok")
+PY
+
+  log "verifying LightLLM CLI..."
+  python -m lightllm.server.api_server --help | head -3
+fi
+
+# -------------------------------------------------------------------------
+# 7) Install EASI benchmark client venv (separate from .venv-lightllm)
+# -------------------------------------------------------------------------
+# Delegates to EASI's own setup.sh, which creates EASI/.venv with Python 3.11
+# and installs -e VLMEvalKit -e lmms-eval + pinned deps + flash-attn.
+if [ "${SKIP_EASI}" = "1" ]; then
+  log "skipping EASI client install (--skip-easi)"
+elif [ -d "${EASI_VENV}" ] && "${EASI_VENV}/bin/python" -c "import vlmeval" >/dev/null 2>&1; then
+  log "EASI client venv already has vlmeval — skipping install"
+else
+  log "installing EASI client deps (creates ${EASI_VENV}, may take several minutes)..."
+  log "  (delegating to ${EASI_DIR}/scripts/setup.sh)"
+  # Deactivate LightLLM venv if active so EASI's setup.sh creates its own cleanly.
+  deactivate 2>/dev/null || true
+  (cd "${EASI_DIR}" && bash scripts/setup.sh)
+  # Re-activate LightLLM venv only if we set it up earlier.
+  if [ "${SKIP_LIGHTLLM}" = "0" ] && [ -d "${VENV_DIR}" ]; then
+    # shellcheck disable=SC1091
+    source "${VENV_DIR}/bin/activate"
+  fi
+fi
+
+# -------------------------------------------------------------------------
+# 8) Register local LightLLM endpoints with EASI's VLMEvalKit
+# -------------------------------------------------------------------------
+# Two-step wire-up, both idempotent:
+#   (a) copy evaluation/easi/config/sensenova_models.py into VLMEvalKit/vlmeval/
+#   (b) apply evaluation/easi/patches/easi_sensenova_config.patch to
+#       VLMEvalKit/vlmeval/config.py so it imports those entries and updates
+#       supported_VLM at module load.
+# Edit the endpoint/port/max_tokens values in config/sensenova_models.py
+# then re-run this script to propagate.
+if [ "${SKIP_REGISTER}" = "1" ]; then
+  log "skipping endpoint registration (--skip-register)"
+elif [ ! -d "${EASI_VLMEVAL_DIR}" ]; then
+  log "skipping endpoint registration (VLMEvalKit submodule not initialized)"
+else
+  log "registering SenseNova-U1 endpoints in VLMEvalKit..."
+
+  # (a) copy the editable config module
+  src="${EASI_CONFIG_DIR}/sensenova_models.py"
+  dst="${EASI_VLMEVAL_DIR}/vlmeval/sensenova_models.py"
+  if [ ! -f "${src}" ]; then
+    err "  missing ${src} — can't register endpoints"
+    exit 1
+  fi
+  cp -f "${src}" "${dst}"
+  log "  copied sensenova_models.py -> ${dst}"
+
+  # (b) apply the config.py hook patch (idempotent)
+  patch_file="${EASI_PATCHES_DIR}/easi_sensenova_config.patch"
+  if [ ! -f "${patch_file}" ]; then
+    err "  missing ${patch_file}"
+    exit 1
+  fi
+  if (cd "${EASI_VLMEVAL_DIR}" && git apply --reverse --check "${patch_file}" >/dev/null 2>&1); then
+    log "  easi_sensenova_config.patch: already applied"
+  elif (cd "${EASI_VLMEVAL_DIR}" && git apply --check "${patch_file}" >/dev/null 2>&1); then
+    (cd "${EASI_VLMEVAL_DIR}" && git apply "${patch_file}")
+    log "  easi_sensenova_config.patch: applied"
+  else
+    err "  easi_sensenova_config.patch does not apply cleanly — inspect manually:"
+    err "    cd ${EASI_VLMEVAL_DIR} && git apply --check ${patch_file}"
+    exit 1
+  fi
+fi
+
+log "done."
+log ""
+log "next steps:"
+if [ "${SKIP_LIGHTLLM}" = "1" ]; then
+  log "  - point config/sensenova_models.py at YOUR endpoint, then propagate:"
+  log "      \$EDITOR evaluation/easi/config/sensenova_models.py"
+  log "      bash evaluation/easi/scripts/setup.sh --skip-lightllm --skip-easi"
+  log ""
+  log "  - run a benchmark:"
+  log "      source evaluation/easi/EASI/.venv/bin/activate"
+  log "      cd evaluation/easi/EASI"
+  log "      python scripts/submissions/run_easi_eval.py --model SenseNova-U1-8B-MoT-Local --benchmarks blink"
+else
+  log "  - launch server (weights auto-downloaded on first call):"
+  log "      bash evaluation/easi/scripts/serve.sh                 # 8b-mot → port 8000"
+  log "      # or multi-replica DP (same script, DP env flips mode):"
+  log "      DP=4 TP=2 bash evaluation/easi/scripts/serve.sh"
+  log ""
+  log "  - run a benchmark (from a second shell, after server is up):"
+  log "      source evaluation/easi/EASI/.venv/bin/activate"
+  log "      cd evaluation/easi/EASI"
+  log "      python scripts/submissions/run_easi_eval.py --model SenseNova-U1-8B-MoT-Local --benchmarks blink"
+fi
--- a/SenseNova-U1/evaluation/gen/bizgeneval/data/test.jsonl
+++ b/SenseNova-U1/evaluation/gen/bizgeneval/data/test.jsonl
--- a/SenseNova-U1/evaluation/gen/bizgeneval/eval_images_bizgeneval.py
+++ b/SenseNova-U1/evaluation/gen/bizgeneval/eval_images_bizgeneval.py
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import sys
+import threading
+from collections import defaultdict
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from statistics import mean
+from typing import Any
+
+if __package__ in {None, ""}:
+    repo_root = Path(__file__).resolve().parents[3]
+    sys.path.insert(0, str(repo_root))
+
+from evaluation.gen.bizgeneval.eval_prompt import EVAL_GENERATION_PROMPTS as EVAL_PROMPTS
+from evaluation.gen.common.judge import JudgeClient
+
+try:
+    from tqdm import tqdm
+except ImportError:
+    tqdm = None
+
+DEFAULT_DATA_PATH = Path(__file__).resolve().parent / "data" / "test.jsonl"
+ERROR_ALPHA = 0.1
+
+# Reference:
+#   BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
+#   https://arxiv.org/abs/2603.25732
+
+
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Evaluate BizGenEval images with Gemini/OpenAI-compatible judge.")
+    parser.add_argument("--image-dir", required=True, help="Directory containing generated BizGenEval images.")
+    parser.add_argument("--output-dir", required=True, help="Directory to save per-item results and summary.")
+    parser.add_argument("--data-path", default=str(DEFAULT_DATA_PATH), help="BizGenEval JSONL path.")
+    parser.add_argument("--api-base", required=True)
+    parser.add_argument("--api-key", required=True)
+    parser.add_argument("--judge-model", required=True)
+    parser.add_argument("--timeout", type=int, default=240)
+    parser.add_argument("--concurrency", type=int, default=8)
+    parser.add_argument("--force-rerun", action="store_true")
+    return parser.parse_args()
+
+
+def _to_bool(value: Any) -> bool | None:
+    if isinstance(value, bool):
+        return value
+    if isinstance(value, str):
+        text = value.strip().lower()
+        if text in {"true", "1", "yes"}:
+            return True
+        if text in {"false", "0", "no"}:
+            return False
+    if isinstance(value, int) and value in {0, 1}:
+        return bool(value)
+    return None
+
+
+def _strip_json_fence(text: str) -> str:
+    content = (text or "").strip()
+    if content.startswith("```json"):
+        content = content[7:]
+    elif content.startswith("```"):
+        content = content[3:]
+    if content.endswith("```"):
+        content = content[:-3]
+    return content.strip()
+
+
+def _parse_json_safe(text: str) -> dict[str, Any]:
+    content = _strip_json_fence(text)
+    try:
+        return json.loads(content)
+    except Exception:
+        pass
+    match = re.search(r"\{.*\}", content, flags=re.DOTALL)
+    if match:
+        extracted = match.group(0)
+        try:
+            return json.loads(extracted)
+        except Exception:
+            pass
+        normalized = re.sub(r"\bTrue\b", "true", extracted)
+        normalized = re.sub(r"\bFalse\b", "false", normalized)
+        normalized = re.sub(r"\bNone\b", "null", normalized)
+        try:
+            return json.loads(normalized)
+        except Exception:
+            pass
+    raise ValueError("failed to parse judge JSON")
+
+
+def _extract_results_only(raw_text: str, n_questions: int) -> dict[str, Any] | None:
+    if not isinstance(raw_text, str) or n_questions <= 0:
+        return None
+    matches = re.findall(
+        r'"result"\s*:\s*(true|false|True|False|"true"|"false"|1|0)',
+        raw_text,
+        flags=re.DOTALL,
+    )
+    if len(matches) < n_questions:
+        return None
+    parsed: dict[str, Any] = {}
+    for idx, match in enumerate(matches[:n_questions], start=1):
+        val = _to_bool(str(match).strip().strip('"'))
+        if val is None:
+            return None
+        parsed[str(idx)] = {"result": val}
+    return parsed
+
+
+def _format_checklist(questions: list[str]) -> str:
+    return "\n".join(f"{idx}. {question}" for idx, question in enumerate(questions, start=1))
+
+
+def _render_prompt(user_template: str, questions: list[str]) -> str:
+    kwargs = {"checklist": _format_checklist(questions)}
+    if "{expected_count}" in user_template:
+        kwargs["expected_count"] = len(questions)
+    if "{required_keys}" in user_template:
+        kwargs["required_keys"] = ", ".join(str(i) for i in range(1, len(questions) + 1))
+    return user_template.format(**kwargs)
+
+
+def _compute_item_scores(
+    item: dict[str, Any],
+    meta_info: dict[str, dict[str, Any]],
+    error_alpha: float,
+    qidxs_key: str | None = None,
+) -> dict[str, float | int]:
+    questions = item.get("questions", [])
+    qidxs = list(item.get(qidxs_key, []) or []) if qidxs_key else list(range(1, len(questions) + 1))
+    if not qidxs:
+        return {"accuracy": 0.0, "error_score": 0.0, "errors": 0, "n_questions": 0}
+    errors = 0
+    for qidx in qidxs:
+        if meta_info.get(str(qidx), {}).get("result") is not True:
+            errors += 1
+    n_questions = len(qidxs)
+    accuracy = (n_questions - errors) / n_questions
+    error_score = max(0.0, 1.0 - error_alpha * errors)
+    return {
+        "accuracy": accuracy,
+        "error_score": error_score,
+        "errors": errors,
+        "n_questions": n_questions,
+    }
+
+
+def _aggregate(records: list[dict[str, Any]], group_key: str) -> dict[str, dict[str, float | int]]:
+    grouped: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    for record in records:
+        grouped[str(record[group_key])].append(record)
+    summary: dict[str, dict[str, float | int]] = {}
+    for key, items in sorted(grouped.items()):
+        summary[key] = {
+            "count": len(items),
+            "accuracy": mean([float(item["accuracy"]) for item in items]) if items else 0.0,
+            "error_score": mean([float(item["error_score"]) for item in items]) if items else 0.0,
+        }
+    return summary
+
+
+def _load_items(data_path: Path) -> list[dict[str, Any]]:
+    items = []
+    with data_path.open(encoding="utf-8") as f:
+        for prompt_id, line in enumerate(f):
+            line = line.strip()
+            if not line:
+                continue
+            item = json.loads(line)
+            item["_prompt_id"] = prompt_id
+            items.append(item)
+    return items
+
+
+def _resolve_image_path(image_dir: Path, prompt_id: int) -> Path | None:
+    direct = image_dir / f"{prompt_id:04d}.png"
+    if direct.exists():
+        return direct
+    repeat0 = image_dir / f"{prompt_id:04d}_0.png"
+    if repeat0.exists():
+        return repeat0
+    return None
+
+
+def _is_complete(path: Path) -> bool:
+    if not path.exists():
+        return False
+    try:
+        with path.open(encoding="utf-8") as f:
+            data = json.load(f)
+        meta = data.get("meta_info") or {}
+        n_questions = int(data.get("n_questions", 0))
+        return (
+            data.get("accuracy") is not None
+            and len(meta) == n_questions
+            and all(isinstance(v, dict) and "result" in v for v in meta.values())
+        )
+    except Exception:
+        return False
+
+
+def _record_error(prompt_id: int, exc: Exception) -> None:
+    print(f"[warn] prompt_id={prompt_id} failed: {type(exc).__name__}: {exc}")
+
+
+def eval_one(
+    item: dict[str, Any],
+    *,
+    image_dir: Path,
+    items_dir: Path,
+    client: JudgeClient,
+    error_alpha: float,
+    force_rerun: bool,
+    write_lock: threading.Lock,
+) -> dict[str, Any] | None:
+    prompt_id = int(item["_prompt_id"])
+    dataset_id = item.get("id")
+    image_path = _resolve_image_path(image_dir, prompt_id)
+    if image_path is None:
+        print(f"[warn] missing image for prompt_id={prompt_id}")
+        return None
+
+    cache_path = items_dir / f"{prompt_id:04d}.json"
+    if not force_rerun and _is_complete(cache_path):
+        with cache_path.open(encoding="utf-8") as f:
+            cached = json.load(f)
+        cached["_cached"] = True
+        return cached
+
+    eval_tag = str(item.get("eval_tag") or item.get("dimension") or "").strip()
+    if eval_tag not in EVAL_PROMPTS:
+        print(f"[warn] skip prompt_id={prompt_id}: unsupported eval_tag={eval_tag!r}")
+        return None
+    questions = list(item.get("questions") or [])
+    if not questions:
+        print(f"[warn] skip prompt_id={prompt_id}: empty questions")
+        return None
+
+    system_prompt, user_template = EVAL_PROMPTS[eval_tag]
+    raw_output = client.judge_image_text(
+        image_path=image_path,
+        system_prompt=system_prompt,
+        user_prompt=_render_prompt(user_template, questions),
+    ).strip()
+
+    try:
+        parsed = _parse_json_safe(raw_output)
+    except Exception:
+        parsed = _extract_results_only(raw_output, len(questions)) or {}
+
+    meta_info: dict[str, dict[str, Any]] = {}
+    for idx, question in enumerate(questions, start=1):
+        rec = parsed.get(str(idx)) if isinstance(parsed, dict) else None
+        if not isinstance(rec, dict):
+            meta_info[str(idx)] = {
+                "result": False,
+                "raw_description": question,
+                "reason": "missing_from_output",
+            }
+            continue
+        val = _to_bool(rec.get("result"))
+        meta_info[str(idx)] = {
+            "result": bool(val) if isinstance(val, bool) else False,
+            "raw_description": question,
+            "reason": rec.get("reason", ""),
+        }
+
+    overall = _compute_item_scores(item, meta_info, error_alpha)
+    easy = _compute_item_scores(item, meta_info, error_alpha * 2, "easy_qidxs")
+    hard = _compute_item_scores(item, meta_info, error_alpha * 2, "hard_qidxs")
+    result = {
+        "prompt_id": prompt_id,
+        "dataset_id": dataset_id,
+        "domain": item.get("domain", ""),
+        "dimension": item.get("dimension", ""),
+        "eval_tag": eval_tag,
+        "prompt": item.get("prompt", ""),
+        "image_path": str(image_path),
+        "n_questions": len(questions),
+        "accuracy": overall["accuracy"],
+        "error_score": overall["error_score"],
+        "easy_accuracy": easy["accuracy"],
+        "easy_error_score": easy["error_score"],
+        "hard_accuracy": hard["accuracy"],
+        "hard_error_score": hard["error_score"],
+        "meta_info": meta_info,
+        "raw_model_response": raw_output,
+    }
+    with write_lock:
+        with cache_path.open("w", encoding="utf-8") as f:
+            json.dump(result, f, ensure_ascii=False, indent=2)
+    return result
+
+
+def main() -> None:
+    args = _parse_args()
+    image_dir = Path(args.image_dir).resolve()
+    output_dir = Path(args.output_dir).resolve()
+    items_dir = output_dir / "items"
+    output_dir.mkdir(parents=True, exist_ok=True)
+    items_dir.mkdir(parents=True, exist_ok=True)
+
+    client = JudgeClient(
+        api_base=args.api_base,
+        api_key=args.api_key,
+        model=args.judge_model,
+        timeout=args.timeout,
+    )
+    items = _load_items(Path(args.data_path).resolve())
+    print(
+        f"[bizgeneval] items={len(items)} concurrency={args.concurrency} "
+        f"judge_model={client.model} image_dir={image_dir}"
+    )
+
+    write_lock = threading.Lock()
+
+    tasks = list(items)
+    results: list[dict[str, Any] | None] = []
+
+    def _result_status(result: dict[str, Any] | None) -> str:
+        if result is None:
+            return "skipped"
+        if result.get("_cached"):
+            return "cached"
+        return "done"
+
+    max_workers = max(1, args.concurrency)
+    with ThreadPoolExecutor(max_workers=max_workers) as pool:
+        futures = {
+            pool.submit(
+                eval_one,
+                item,
+                image_dir=image_dir,
+                items_dir=items_dir,
+                client=client,
+                error_alpha=ERROR_ALPHA,
+                force_rerun=args.force_rerun,
+                write_lock=write_lock,
+            ): item
+            for item in tasks
+        }
+        total = len(futures)
+        progress = tqdm(total=total, desc="bizgeneval eval", dynamic_ncols=True) if tqdm else None
+        try:
+            for done, future in enumerate(as_completed(futures), start=1):
+                item = futures[future]
+                try:
+                    result = future.result()
+                except Exception as exc:
+                    _record_error(int(item["_prompt_id"]), exc)
+                    result = None
+                results.append(result)
+                status = _result_status(result)
+                if progress is not None:
+                    progress.update(1)
+                    progress.set_postfix_str(f"prompt_id={item['_prompt_id']} {status}")
+                else:
+                    print(f"[{done}/{total}] prompt_id={item['_prompt_id']} {status}")
+        finally:
+            if progress is not None:
+                progress.close()
+
+    valid_results = [result for result in results if result]
+    if not valid_results:
+        raise RuntimeError("BizGenEval produced no valid evaluation results.")
+
+    final_records = [
+        {
+            "prompt_id": int(result["prompt_id"]),
+            "dataset_id": result["dataset_id"],
+            "domain": result["domain"],
+            "dimension": result["dimension"],
+            "accuracy": float(result["accuracy"]),
+            "error_score": float(result["error_score"]),
+            "easy_accuracy": float(result["easy_accuracy"]),
+            "easy_error_score": float(result["easy_error_score"]),
+            "hard_accuracy": float(result["hard_accuracy"]),
+            "hard_error_score": float(result["hard_error_score"]),
+            "n_questions": int(result["n_questions"]),
+        }
+        for result in sorted(valid_results, key=lambda item: int(item["prompt_id"]))
+    ]
+
+    by_domain = _aggregate(final_records, "domain")
+    by_dimension = _aggregate(final_records, "dimension")
+    overall_accuracy = mean([record["accuracy"] for record in final_records]) if final_records else 0.0
+    overall_error_score = mean([record["error_score"] for record in final_records]) if final_records else 0.0
+    easy_accuracy = mean([record["easy_accuracy"] for record in final_records]) if final_records else 0.0
+    easy_error_score = mean([record["easy_error_score"] for record in final_records]) if final_records else 0.0
+    hard_accuracy = mean([record["hard_accuracy"] for record in final_records]) if final_records else 0.0
+    hard_error_score = mean([record["hard_error_score"] for record in final_records]) if final_records else 0.0
+
+    summary = {
+        "benchmark": "bizgeneval",
+        "data_path": str(Path(args.data_path).resolve()),
+        "eval_provider": "judge_client",
+        "judge_model": client.model,
+        "error_alpha": ERROR_ALPHA,
+        "easy_hard_error_alpha": ERROR_ALPHA * 2,
+        "items": len(final_records),
+        "overall_accuracy": overall_accuracy,
+        "overall_error_score": overall_error_score,
+        "easy_accuracy": easy_accuracy,
+        "easy_error_score": easy_error_score,
+        "hard_accuracy": hard_accuracy,
+        "hard_error_score": hard_error_score,
+        "by_domain": by_domain,
+        "by_dimension": by_dimension,
+        "records": final_records,
+    }
+
+    summary_path = output_dir / "bizgeneval_summary.json"
+    with summary_path.open("w", encoding="utf-8") as f:
+        json.dump(summary, f, ensure_ascii=False, indent=2)
+
+    print(
+        f"[bizgeneval] overall_error_score={overall_error_score:.4f} "
+        f"overall_accuracy={overall_accuracy:.4f} "
+        f"easy_error_score={easy_error_score:.4f} "
+        f"hard_error_score={hard_error_score:.4f}"
+    )
+    print(f"[bizgeneval] summary saved -> {summary_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/SenseNova-U1/evaluation/gen/bizgeneval/eval_prompt.py
+++ b/SenseNova-U1/evaluation/gen/bizgeneval/eval_prompt.py
+# Reference:
+#   Evaluation prompt adapted from BizGenEval:
+#   BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
+#   https://arxiv.org/abs/2603.25732
+
+ATTRIBUTE_PROMPT_SYSTEM_V2 = """
+You are an expert visual attribute evaluator.
+
+Your task is to determine whether each given description is true or false based strictly on the provided image.
+
+Evaluation Rules:
+
+1. Base your judgment ONLY on visible evidence in the image.
+2. Do NOT rely on assumptions, common sense, or inferred intent.
+3. If the attribute is clearly satisfied, return True.
+4. If the attribute is clearly violated, return False.
+5. If the attribute cannot be determined with certainty from the image, return False.
+6. Be strict about:
+   - Exact colors (approximate matches are ok for similar colors, e.g., dark gray vs black, light blue vs blue, but not for distinct colors like red vs green)
+   - Exact counts
+   - Relative sizes and proportions
+   - Shape types
+   - Line styles (solid, dashed, dotted)
+   - Font types (if distinguishable)
+
+# ⚠️ Output Format (Strict JSON Only)
+
+Your output must be valid JSON and follow this structure exactly:
+
+{
+  "1": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise visual evidence explanation>"
+  },
+  "2": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise visual evidence explanation>"
+  }
+}
+
+Requirements:
+- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
+- Do not include extra commentary or speculation.
+- Output valid JSON only. No extra fields. Do not output anything outside JSON.
+"""
+
+
+ATTRIBUTE_USER_TEMPLATE_V2 = """
+Evaluate the following descriptions based on the image:
+
+{checklist}
+"""
+
+LAYOUT_PROMPT_SYSTEM_V1 = """
+You are an expert layout evaluator.
+
+Your task is to determine whether each given description is true or false based strictly on the provided image.
+
+Evaluation Rules:
+
+1. Base your judgment ONLY on visible spatial evidence in the image.
+2. Do NOT rely on assumptions, common sense, or inferred intent.
+3. If the layout relationship is clearly satisfied, return True.
+4. If the layout relationship is clearly violated, return False.
+5. Be strict about:
+   - Relative position (above, below, left, right)
+   - Arrangement direction (horizontal, vertical, grid)
+   - Section hierarchy (header at top, footer at bottom, sidebar on left)
+   - Alignment (left-aligned, centered, right-aligned)
+   - Grouping and containment (elements inside a container)
+   - Discrete structural counts (two columns, three stacked cards)
+
+# ⚠️ Output Format (Strict JSON Only)
+
+Your output must be valid JSON and follow this structure exactly:
+
+{
+  "1": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise spatial evidence explanation>"
+  },
+  "2": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise spatial evidence explanation>"
+  }
+}
+
+Requirements:
+- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
+- Do not include extra commentary or speculation.
+- Output valid JSON only. No extra fields. Do not output anything outside JSON.
+"""
+
+
+LAYOUT_USER_TEMPLATE = """
+Evaluate the following layout descriptions based on the image:
+
+{checklist}
+"""
+
+
+TEXT_EVALUATION_PROMPT_SYSTEM_V1 = """
+You are an expert character-level text evaluator.
+
+Your task is to determine whether each given description is true or false based strictly on the provided image and its textual content.
+
+Evaluation Rules:
+
+1. Base your judgment ONLY on **visible text in the image**, including all letters, numbers, symbols, punctuation, and whitespace.
+2. Do NOT rely on assumptions, common sense, or inferred intent.
+3. For each description:
+   - If the text in the specified block exactly matches the description, return True.
+   - If there is any mismatch (even a single character, symbol, number, or space), return False.
+4. Be strict about:
+   - Exact character match (case-sensitive, punctuation-sensitive, spacing-sensitive)
+   - Formulas and scientific symbols (Greek letters, superscripts/subscripts, operators)
+   - Numbers and table values
+   - Entire text block content (paragraph, list, table row/column, formula)
+   - Absolute position and context (as specified in the description)
+
+# ⚠️ Output Format (Strict JSON Only)
+
+Your output must be valid JSON and follow this structure exactly:
+
+{
+  "1": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise text-based evidence explanation>"
+  },
+  "2": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise text-based evidence explanation>"
+  }
+}
+
+Requirements:
+- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
+- Do not include extra commentary or speculation.
+- Output valid JSON only. No extra fields. Do not output anything outside JSON.
+"""
+
+
+TEXT_EVALUATION_PROMPT_SYSTEM_V2 = """
+You are an expert character-level text evaluator.
+
+Your task is to determine whether each given description is true or false based strictly on the provided image and its textual content.
+
+Evaluation Rules:
+
+1. Base your judgment ONLY on **visible text in the image**, including all letters, numbers, symbols, punctuation, and whitespace.
+2. Do NOT rely on assumptions, common sense, or inferred intent.
+3. For each description:
+   - If the text in the specified block exactly matches the description, return True.
+   - If there is any mismatch within the core sentence or word content (even a single character, symbol, or number inside a word or sentence), return False.
+   - Minor formatting differences at the boundaries (e.g., leading bullet points such as "-" or "•", and a trailing period ".", "?", "!") should be ignored and still considered True.
+4. Be strict about:
+   - Exact character match (case-sensitive, punctuation-sensitive, spacing-sensitive)
+   - Formulas and scientific symbols (Greek letters, superscripts/subscripts, operators)
+   - Numbers and table values
+   - Entire text block content (paragraph, list, table row/column, formula)
+   - Absolute position and context (as specified in the description)
+
+# ⚠️ Output Format (Strict JSON Only)
+
+Your output must be valid JSON and follow this structure exactly:
+
+{
+  "1": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise text-based evidence explanation>"
+  },
+  "2": {
+    "result": True/False,
+    "raw_description": "<original question>",
+    "reason": "<concise text-based evidence explanation>"
+  }
+}
+
+Requirements:
+- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
+- Do not include extra commentary or speculation.
+- Output valid JSON only. No extra fields. Do not output anything outside JSON.
+"""
+
+TEXT_EVALUATION_USER_TEMPLATE = """
+Evaluate the following text descriptions based on the image content and absolute block positions:
+
+{checklist}
+"""
+
+KNOWLEDGE_PROMPT_SYSTEM_V1 = """
+You are an expert factual-and-reasoning evaluator for chart/diagram/poster/webpage/slides images.
+
+Your task is to determine whether each given Yes/No checklist question is true or false based on the provided image.
+
+Evaluation Rules:
+
+1. Judge using visible image evidence plus standard domain knowledge (math, physics, chemistry, history, engineering, etc.).
+2. For each question:
+   - Return True only if the statement is clearly correct.
+   - Return False if it is incorrect, inconsistent, implausible, or not verifiable from the image.
+3. Be strict about:
+   - Numeric correctness (arithmetic, units, ranges, proportions)
+   - Equation correctness (balance, signs, symbols, consistency with text/chart)
+   - Cross-panel/internal consistency (chart vs table vs annotation vs diagram)
+   - Historical/scientific plausibility
+4. If content is missing/ambiguous/illegible, return False.
+5. Do not give partial credit.
+
+# Output Format (Strict JSON Only)
+
+Your output must be valid JSON and follow this structure exactly:
+
+{
+  "1": {
+    "result": True/False,
+    "raw_question": "<original question>",
+    "reason": "<concise evidence-based explanation>"
+  },
+  "2": {
+    "result": True/False,
+    "raw_question": "<original question>",
+    "reason": "<concise evidence-based explanation>"
+  }
+}
+
+Requirements:
+- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
+- Do not include extra commentary or speculation.
+- Output valid JSON only. No extra fields. Do not output anything outside JSON.
+"""
+
+
+KNOWLEDGE_USER_TEMPLATE_V1 = """
+Evaluate the following Yes/No knowledge/reasoning questions based on the image:
+
+{checklist}
+"""
+
+
+CHART_USER_TEMPLATE = """
+Evaluate the following chart statements based on the image:
+
+{checklist}
+
+Strict output coverage requirement:
+- There are exactly {expected_count} statements above.
+- Return a JSON object containing ALL keys from 1 to {expected_count} (no missing indices).
+- Required keys: {required_keys}
+"""
+
+
+EVAL_GENERATION_PROMPTS = {
+    "attribute": (ATTRIBUTE_PROMPT_SYSTEM_V2, ATTRIBUTE_USER_TEMPLATE_V2),
+    "layout": (LAYOUT_PROMPT_SYSTEM_V1, LAYOUT_USER_TEMPLATE),
+    "text": (TEXT_EVALUATION_PROMPT_SYSTEM_V2, TEXT_EVALUATION_USER_TEMPLATE),
+    "knowledge": (KNOWLEDGE_PROMPT_SYSTEM_V1, KNOWLEDGE_USER_TEMPLATE_V1),
+    "text1": (TEXT_EVALUATION_PROMPT_SYSTEM_V1, TEXT_EVALUATION_USER_TEMPLATE),
+}
--- a/SenseNova-U1/evaluation/gen/bizgeneval/gen_images_bizgeneval.py
+++ b/SenseNova-U1/evaluation/gen/bizgeneval/gen_images_bizgeneval.py
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Any
+
+if __package__ in {None, ""}:
+    repo_root = Path(__file__).resolve().parents[3]
+    sys.path.insert(0, str(repo_root))
+    sys.path.insert(0, str(repo_root / "src"))
+
+import torch
+
+import sensenova_u1
+from examples.t2i.inference import SenseNovaU1T2I
+
+DEFAULT_DATA_PATH = Path(__file__).resolve().parent / "data" / "test.jsonl"
+DEFAULT_ASPECT_RATIO = "1:1"
+RATIO_LONG_SIDE = 2048
+DEFAULT_ATTN_BACKEND = "auto"
+
+
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Generate BizGenEval images with SenseNova-U1.")
+    parser.add_argument("--model-path", required=True, help="Local checkpoint path or HF model id.")
+    parser.add_argument("--output-dir", required=True, help="Directory to save generated images.")
+    parser.add_argument("--data-path", default=str(DEFAULT_DATA_PATH), help="BizGenEval JSONL path.")
+    parser.add_argument("--device", default="cuda")
+    parser.add_argument("--dtype", default="bfloat16", choices=["bfloat16", "float16", "float32"])
+    parser.add_argument("--cfg-scale", type=float, default=4.0)
+    parser.add_argument("--cfg-norm", default="none", choices=["none", "global", "channel", "cfg_zero_star"])
+    parser.add_argument("--timestep-shift", type=float, default=3.0)
+    parser.add_argument("--num-steps", type=int, default=50)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--attn-backend", default=DEFAULT_ATTN_BACKEND, choices=["auto", "flash", "sdpa"])
+    return parser.parse_args()
+
+
+def _load_items(*, data_path: Path) -> list[dict[str, Any]]:
+    items = []
+    with data_path.open(encoding="utf-8") as f:
+        for prompt_id, line in enumerate(f):
+            if not line.strip():
+                continue
+            item = json.loads(line)
+            item["_prompt_id"] = prompt_id
+            items.append(item)
+    return items
+
+
+def _resolve_dtype(name: str) -> torch.dtype:
+    return {
+        "bfloat16": torch.bfloat16,
+        "float16": torch.float16,
+        "float32": torch.float32,
+    }[name]
+
+
+def _round_by_factor(number: int, factor: int) -> int:
+    return round(number / factor) * factor
+
+
+def _parse_ratio(value: object) -> tuple[int, int] | None:
+    try:
+        if isinstance(value, str):
+            lower = value.lower()
+            if "x" in lower:
+                rw, rh = [int(x) for x in lower.split("x", 1)]
+            elif ":" in value:
+                rw, rh = [int(x) for x in value.split(":", 1)]
+            else:
+                return None
+        elif isinstance(value, (list, tuple)) and len(value) >= 2:
+            rw, rh = int(value[0]), int(value[1])
+        else:
+            return None
+    except Exception:
+        return None
+    if rw <= 0 or rh <= 0:
+        return None
+    return rw, rh
+
+
+def _dims_from_ratio_long_side(
+    ratio: tuple[int, int],
+    long_side: int,
+    factor: int = 32,
+) -> tuple[int, int]:
+    rw, rh = ratio
+    if rw >= rh:
+        width = max(factor, _round_by_factor(long_side, factor))
+        height = max(factor, _round_by_factor(long_side * rh / rw, factor))
+    else:
+        height = max(factor, _round_by_factor(long_side, factor))
+        width = max(factor, _round_by_factor(long_side * rw / rh, factor))
+    return int(width), int(height)
+
+
+def _resolve_image_size(item: dict[str, Any], *, default_aspect_ratio: str, ratio_long_side: int) -> tuple[int, int]:
+    ratio = _parse_ratio(item.get("aspect_ratio")) or _parse_ratio(default_aspect_ratio)
+    if ratio is None:
+        raise ValueError(
+            f"Failed to resolve aspect ratio for prompt_id={item.get('_prompt_id')}: "
+            f"aspect_ratio={item.get('aspect_ratio')!r}, default_aspect_ratio={default_aspect_ratio!r}"
+        )
+    return _dims_from_ratio_long_side(ratio, ratio_long_side)
+
+
+def main() -> None:
+    args = _parse_args()
+    data_path = Path(args.data_path).resolve()
+    output_dir = Path(args.output_dir).resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    items = _load_items(data_path=data_path)
+    print(f"[bizgeneval] loaded {len(items)} items from {data_path}")
+
+    sensenova_u1.set_attn_backend(args.attn_backend)
+    print(f"[bizgeneval] attn_backend={args.attn_backend!r} effective={sensenova_u1.effective_attn_backend()!r}")
+
+    engine = SenseNovaU1T2I(
+        model_path=args.model_path,
+        device=args.device,
+        dtype=_resolve_dtype(args.dtype),
+    )
+
+    generated = 0
+    skipped = 0
+
+    for item in items:
+        prompt_id = int(item["_prompt_id"])
+        width, height = _resolve_image_size(
+            item,
+            default_aspect_ratio=DEFAULT_ASPECT_RATIO,
+            ratio_long_side=RATIO_LONG_SIDE,
+        )
+        out_path = output_dir / f"{prompt_id:04d}.png"
+        if out_path.exists():
+            skipped += 1
+            continue
+
+        images = engine.generate(
+            item["prompt"],
+            image_size=(width, height),
+            cfg_scale=args.cfg_scale,
+            cfg_norm=args.cfg_norm,
+            timestep_shift=args.timestep_shift,
+            num_steps=args.num_steps,
+            batch_size=1,
+            seed=args.seed,
+        )
+        images[0].save(out_path)
+        generated += 1
+        print(
+            f"[saved] prompt_id={prompt_id} "
+            f"size={width}x{height} domain={item.get('domain')} "
+            f"dimension={item.get('dimension')} -> {out_path}"
+        )
+
+    print(f"[bizgeneval] done: items={len(items)} generated={generated} skipped={skipped} output_dir={output_dir}")
+
+
+if __name__ == "__main__":
+    main()