Commit 2a934cec authored by raojy's avatar raojy
Browse files

first

parent 4b618aa3
This source diff could not be displayed because it is too large. You can view the blob instead.
# Evaluation
Benchmark reproduction scripts and guides for SenseNova-U1.
## Sections
- [Visual Understanding](docs/understanding.md) — reproduction scripts for visual understanding benchmarks
- [Image Generation](docs/image_generation.md) — reproduction scripts for image generation benchmarks
- [Interleaved Generation](docs/interleaved.md) — reproduction scripts for interleaved generation benchmarks
# Evaluation(评测)
SenseNova-U1 的 benchmark 复现脚本与指南。
## 目录
- [视觉理解](docs/understanding.md) — 视觉理解 benchmark 复现脚本
- [图像生成](docs/image_generation.md) — 图像生成 benchmark 复现脚本
- [交错生成](docs/interleaved.md) — 交错生成 benchmark 复现脚本
# Image Generation Evaluation
Reproduction scripts for SenseNova-U1 on image generation benchmarks. Each
benchmark lives in its own subfolder under [`evaluation/gen/`](../gen/) and
ships with a generation script, an evaluation script, and a shell launcher
wiring them together:
```
evaluation/gen/
├── bizgeneval/ # BizGenEval — business / infographic prompts
│ ├── gen_images_bizgeneval.py
│ ├── eval_images_bizgeneval.py
│ ├── run_bizgeneval.sh
│ └── data/test.jsonl
├── igenbench/ # IGenBench — general-purpose T2I benchmark
│ ├── gen_images_igenbench.py
│ ├── eval_images_igenbench.py
│ ├── run_igenbench.sh
│ └── data/*.json
├── longtext/ # LongText — long-text rendering (en / zh)
│ ├── gen_images_longtext.py
│ ├── eval_images_longtext.py
│ ├── run_longtext.sh
│ └── data/{text_prompts.jsonl,text_prompts_zh.jsonl}
├── cvtg/ # CVTG-2K — complex visual text generation
│ ├── eval_cvtg.py
│ ├── unified_metrics_eval.py
│ ├── sa_0_4_vit_l_14_linear.pth
│ ├── run_cvtgeval.sh
│ └── data/{CVTG,CVTG-Style}/{2..5}{,_combined}.json
└── tiif/ # TIIF-Bench — text-image instruction following
├── eval_tiif.py
├── run_tiifeval.sh
├── eval/{eval_with_vlm_mp,summary_results,summary_dimension_results}.py
└── data/{testmini,test}{_prompts,_eval_prompts}/*.jsonl
```
Every benchmark follows the same two-stage flow: **generate images**, then
**evaluate them** (usually against an OpenAI-compatible judge model). The
shell launchers chain both stages, so the typical entry point is just:
```bash
bash evaluation/gen/<bench>/run_<bench>.sh
```
Edit the variables at the top of each launcher (model path, API key / base,
judge model, output dirs) before running.
## BizGenEval
Infographic / business-style prompts. Images are judged by an
OpenAI-compatible VLM (Gemini 3 Pro by default).
End-to-end:
```bash
bash evaluation/gen/bizgeneval/run_bizgeneval.sh
```
Or run the two stages manually:
```bash
# 1) Generate
python evaluation/gen/bizgeneval/gen_images_bizgeneval.py \
--model-path sensenova/SenseNova-U1-8B-MoT-SFT \
--output-dir outputs/sensenova/bizgeneval \
--cfg-scale 4.0 --cfg-norm none --timestep-shift 3.0 --num-steps 50
# 2) Judge
python evaluation/gen/bizgeneval/eval_images_bizgeneval.py \
--image-dir outputs/sensenova/bizgeneval \
--output-dir outputs/sensenova/bizgeneval_eval \
--api-base http://your-api-base/v1 \
--api-key sk-... \
--judge-model gemini-3-pro-preview \
--concurrency 8
```
Prompts are loaded from [`bizgeneval/data/test.jsonl`](../gen/bizgeneval/data/test.jsonl).
The summary (per-item scores + aggregate) is written under `--output-dir`.
## IGenBench
General-purpose T2I benchmark with direct image-question judging.
Prepare the IGenBench metadata from
[`Brookseeworld/IGenBench-Dataset`](https://huggingface.co/datasets/Brookseeworld/IGenBench-Dataset/tree/main)
and place the per-item JSON files under
[`igenbench/data/`](../gen/igenbench/data/). The scripts read those JSON files
directly for both generation prompts and evaluation questions.
```bash
bash evaluation/gen/igenbench/run_igenbench.sh
```
Manual:
```bash
python evaluation/gen/igenbench/gen_images_igenbench.py \
--model-path sensenova/SenseNova-U1-8B-MoT-SFT \
--output-dir outputs/sensenova/igenbench \
--cfg-scale 4.0 --cfg-norm none --timestep-shift 3.0 --num-steps 50
python evaluation/gen/igenbench/eval_images_igenbench.py \
--image-dir outputs/sensenova/igenbench \
--output-dir outputs/sensenova/igenbench_eval \
--api-base http://your-api-base/v1 \
--api-key sk-... \
--judge-model gemini-3-pro-preview \
--concurrency 128
```
Set `--gen-model-name` to tag the judgments with a custom identifier (useful
when comparing multiple generators under the same `--output-dir`).
## LongText
Long-text rendering benchmark, run separately for English (`--lang en`) and
Chinese (`--lang zh`). The launcher executes both passes back to back:
```bash
bash evaluation/gen/longtext/run_longtext.sh
```
Manual (single language):
```bash
python evaluation/gen/longtext/gen_images_longtext.py \
--model-path sensenova/SenseNova-U1-8B-MoT-SFT \
--output-dir outputs/longtext/en \
--lang en \
--cfg-scale 4.0 --cfg-norm none --timestep-shift 3.0 --num-steps 50
python evaluation/gen/longtext/eval_images_longtext.py \
--image-dir outputs/longtext/en \
--output-dir outputs/longtext/en_eval \
--mode en
```
Evaluation runs OCR + text-match locally, so no judge API is required.
Prompts live in [`longtext/data/`](../gen/longtext/data/) (`text_prompts.jsonl`
for `en`, `text_prompts_zh.jsonl` for `zh`).
## CVTG-2K
Complex visual text generation at 2K resolution, evaluated with the
in-tree [`unified_metrics_eval.py`](../gen/cvtg/unified_metrics_eval.py)
script (PaddleOCR-based word accuracy + unified visual-text metrics).
Generation runs as a single Python process with the model sharded across
visible GPUs via HuggingFace `device_map`.
```bash
bash evaluation/gen/cvtg/run_cvtgeval.sh
```
Prepare the CVTG-2K data from
[`dnkdnk/CVTG-2K`](https://huggingface.co/datasets/dnkdnk/CVTG-2K)
and place it under [`cvtg/data/`](../gen/cvtg/data/). The LAION
aesthetic-predictor head
[`sa_0_4_vit_l_14_linear.pth`](../gen/cvtg/sa_0_4_vit_l_14_linear.pth)
sits next to the eval script.
Common overrides (set as env vars before the launcher):
| Variable | Default | Description |
| :------- | :------ | :---------- |
| `MODEL_PATH` | `sensenova/SenseNova-U1-8B-MoT-SFT` | Local checkpoint path or HF model id |
| `BENCHMARK_ROOT` | `evaluation/gen/cvtg/data` | CVTG-2K dataset root |
| `OUTPUT_DIR` | `<repo>/outputs/sensenova/cvtg` | Generated-image + results dir |
| `PADDLEOCR_SOURCE_DIR` | — | Pre-downloaded PaddleOCR cache (copied to `$HOME/.paddleocr` if missing) |
| `IMAGE_SIZE` / `CFG_SCALE` / `TIMESTEP_SHIFT` / `NUM_STEPS` | `2048` / `7.0` / `1.0` / `50` | Sampling config |
| `SAVE_SIZE` | unset (= `IMAGE_SIZE`) | Downsample with LANCZOS to this resolution before writing PNGs. Set to `1024` to use the "generate at 2048, evaluate at 1024" protocol. |
| `CVTG_SUBSETS` / `CVTG_AREAS` | `CVTG,CVTG-Style` / `2,3,4,5` | Which splits to run |
| `CUDA_VISIBLE_DEVICES` | `0,1,2,3,4,5,6,7` | GPUs available for model sharding |
| `DEVICE_MAP` / `MAX_MEMORY_PER_GPU_GB` | `auto` / `70` | HF `device_map` strategy and per-GPU memory cap |
| `RUN_GENERATION` / `RUN_EVAL` | `1` / `1` | Stage toggles |
Example — generation only:
```bash
RUN_EVAL=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
bash evaluation/gen/cvtg/run_cvtgeval.sh
```
Generated images land under `$OUTPUT_DIR/<subset>/<area>/<key>.png`, and the
aggregated metrics are written to `$OUTPUT_DIR/CVTG_results.json`. Re-runs
skip samples whose output PNG already exists, so an interrupted run can be
resumed by simply re-invoking the launcher.
## TIIF-Bench
Text-image instruction following benchmark, evaluated with a GPT-4o-class
judge via the in-tree
[`eval/eval_with_vlm_mp.py`](../gen/tiif/eval/eval_with_vlm_mp.py).
```bash
API_KEY=sk-... \
bash evaluation/gen/tiif/run_tiifeval.sh
```
Prepare the TIIF-Bench data from
[`A113N-W3I/TIIF-Bench`](https://github.com/A113N-W3I/TIIF-Bench)
and place the prompts under
[`tiif/data/`](../gen/tiif/data/). The three eval helper scripts live under
[`tiif/eval/`](../gen/tiif/eval/).
Required / common overrides:
| Variable | Default | Description |
| :------- | :------ | :---------- |
| `MODEL_PATH` | `sensenova/SenseNova-U1-8B-MoT-SFT` | Local checkpoint path or HF model id |
| `OUTPUT_DIR` | `<repo>/outputs/sensenova/tiif` | Generated-image + results dir |
| `TIIFBENCH_SPLIT` | `testmini` | Which split to run (`testmini` / `test`) |
| `TIIFBENCH_EVAL_MODEL` | `gpt-4o` | Judge model |
| `API_KEY` (+ optional `TIIFBENCH_AZURE_ENDPOINT` / `TIIFBENCH_API_VERSION`) | — | Judge API credentials |
| `IMAGE_SIZE` / `CFG_SCALE` / `CFG_NORM` / `TIMESTEP_SHIFT` / `NUM_STEPS` | `1024` / `4.0` / `global` / `3.0` / `50` | Sampling config |
| `SAVE_SIZE` | unset (= `IMAGE_SIZE`) | Downsample with LANCZOS to this resolution before writing PNGs. Set to `1024` (with `IMAGE_SIZE=2048`) to use the "generate at 2048, evaluate at 1024" protocol. |
| `GPUS` / `CUDA_VISIBLE_DEVICES` | `8` / `0..7` | GPU layout (generation uses `torchrun`) |
| `NUM_NODES` / `NODE_RANK` | `1` / `0` | Multi-node sharding (eval runs only on node 0) |
| `RUN_GENERATION` / `RUN_EVAL` | `1` / `1` | Stage toggles |
Example — single-node generation + eval against an Azure OpenAI endpoint:
```bash
API_KEY=sk-... \
TIIFBENCH_AZURE_ENDPOINT=https://your-endpoint.openai.azure.com \
MODEL_PATH=/path/to/checkpoint \
bash evaluation/gen/tiif/run_tiifeval.sh
```
Per-question judgments are written to `$OUTPUT_DIR/tiifbench-<split>_results/eval_json/`,
with a dimension-level summary in `result_summary_dimension.txt` next to it.
## Tips
- **Sampling config.** Defaults mirror the values used in the SenseNova-U1
tech report. CVTG-2K in particular expects 2048-pixel outputs — lower
resolutions will not be comparable.
- **Judge APIs.** All API-based evaluators accept any OpenAI-compatible
endpoint — point them at SenseNova, Gemini (OpenAI-compat), Azure OpenAI,
or a local vLLM / sglang server as needed.
- **Multi-GPU.** `run_tiifeval.sh` uses DDP (`torchrun --nproc_per_node`)
with one full model replica per GPU. `run_cvtgeval.sh` instead shards a
single model across GPUs via HF `device_map="auto"` — preferred when one
GPU cannot hold the whole model. To scale further, run multiple
invocations against disjoint `--output_dir`s and merge the results.
- **Re-evaluation.** `eval_images_bizgeneval.py` / `eval_images_igenbench.py`
skip items whose judgments already exist in `--output-dir`. Pass
`--force-rerun` to ignore the cache.
# Interleaved Generation Evaluation
Reproduction guide for SenseNova-U1 on interleaved generation benchmarks.
The benchmark scripts live under `evaluation/interleave/`:
- `BabyVision/` — API-based multimodal understanding with answer extraction and judge scoring
- `OpenING/` — local-model interleaved generation with GPT-based judging
- `Unimmmu/` — local-model interleaved generation with external score computation
- `Realunify/` — local-model interleaved generation for GEU and UEG
## 1. Overview
```
┌──────────────────────┐
│ evaluation/interleave│
└──────────┬───────────┘
├── BabyVision ── API inference ── extract / judge ── aggregate score
├── OpenING ── local inference ─ GPT judge ── summarize
├── Unimmmu ── local inference ─ external scorer
└── RealUnify ── local inference ─ rule / judge scoring
```
1. `BabyVision` sends requests to one or more `/generate` endpoints and writes JSONL predictions.
2. `OpenING`, `Unimmmu`, and `RealUnify` load the model locally through `transformers`.
3. Some benchmarks have a separate judge or score-aggregation step after inference.
4. Most scripts support resume-friendly reruns through existing outputs, explicit `--resume`, or shard merging.
All commands below assume:
```bash
cd evaluation/interleave
```
## 2. Benchmark Matrix
| Benchmark | Inference backend | Evaluation backend | Primary outputs |
| --- | --- | --- | --- |
| `BabyVision` | HTTP `/generate` API | extraction + LLM judge | `babyvision_<model>.jsonl`, `*_eval.jsonl` |
| `OpenING` | local model | GPT judge + CSV summary | per-sample JSON, generated images, judge JSON |
| `Unimmmu` | local model | external scorer | `unimmmu_results.jsonl`, generated images |
| `RealUnify (GEU)` | local model | rule-based scorer | `realunify_results.jsonl`, score JSON |
| `RealUnify (UEG)` | local model | user-provided judge wrapper | `ueg_results.json[l]` |
For local-model benchmarks, pass the real dataset path explicitly instead of relying on placeholder defaults such as `<DATA_ROOT>/...`.
## 3. BabyVision
`BabyVision` is the API-backed benchmark in this suite. The typical flow is inference first, then extraction and judge scoring, then score aggregation.
Reference inference command:
```bash
python3 BabyVision/infer_babyvision.py \
--model-name local-model \
--data-path /path/to/meta_data.jsonl \
--image-root /path/to/babyvision_images \
--output-dir ./output/babyvision_understand \
--generate-urls http://127.0.0.1:8000/generate \
--workers 32 \
--max-retries 3 \
--backend-max-retries 20 \
--request-timeout 600 \
--max-new-tokens 32768 \
--no-do-sample \
--temperature 0 \
--top-p 0.95 \
--repetition-penalty 1.05 \
--min-pixels 262144 \
--max-pixels 4194304
```
| Argument | Meaning |
| --- | --- |
| `--data-path` | Path to `meta_data.jsonl`. |
| `--image-root` | Root directory used to resolve sample image paths. |
| `--generate-urls` | One or more `/generate` endpoints, comma-separated. |
| `--workers` | Concurrent request workers. |
| `--max-retries`, `--backend-max-retries` | Retry budget on the sample side and backend-request side. |
| `--request-timeout` | Per-request timeout in seconds. |
| `--min-pixels`, `--max-pixels` | Image preprocessing bounds. |
Inference writes `babyvision_<model_name>.jsonl`. Completed `taskId`s are skipped automatically on rerun.
Reference evaluation command:
```bash
python3 BabyVision/eval_babyvision.py \
--input ./output/babyvision_understand/babyvision_local-model.jsonl \
--output ./output/babyvision_understand/babyvision_local-model_eval.jsonl \
--endpoint https://your-judge-endpoint \
--api-key your_api_key \
--model gpt-4.1 \
--force \
--workers 16 \
--retries 3
```
`eval_babyvision.py` performs answer extraction and judge scoring. `--endpoint` and `--api-key` can also come from environment variables. Use `--force` to recompute existing records, or `--judge-only` to score records that already have `extracted_answer`.
Reference score command:
```bash
python3 BabyVision/compute_score.py \
./output/babyvision_understand/babyvision_local-model_eval.jsonl
```
The score script reports overall accuracy plus per-`type` and per-`subtype` results.
## 4. OpenING
`OpenING` runs local-model interleaved generation and then scores the outputs with a GPT judge.
Reference single-GPU inference command:
```bash
python3 OpenING/infer_opening.py \
--mode opening \
--model_path /path/to/model \
--save_dir ./output/opening_interleave/opening_output \
--meta-path /path/to/OpenING-benchmark \
--data-file-name test_data.jsonl \
--think_mode think \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--timestep_shift 3.0 \
--cfg_interval 0 1.0 \
--num_steps 50 \
--max_new_tokens 4096 \
--max_generation_pixels 4194304 \
--oom_retry_max_pixels 1048576 \
--image_width 1920 \
--image_height 1088 \
--opening_step_prompt_style can_be \
--retry_short_outputs 0 \
--seed 42
```
Reference 8-shard single-node command:
```bash
mkdir -p logs
for LOCAL_RANK in 0 1 2 3 4 5 6 7; do
echo "Starting shard ${LOCAL_RANK} on GPU ${LOCAL_RANK}"
CUDA_VISIBLE_DEVICES=${LOCAL_RANK} python3 OpenING/infer_opening.py \
--mode opening \
--model_path /path/to/model \
--save_dir ./output/opening_interleave/opening_output \
--meta-path /path/to/OpenING-benchmark \
--data-file-name test_data.jsonl \
--think_mode think \
--num_shards 8 \
--shard_index ${LOCAL_RANK} \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--timestep_shift 3.0 \
--cfg_interval 0 1.0 \
--num_steps 50 \
--max_new_tokens 4096 \
--max_generation_pixels 4194304 \
--oom_retry_max_pixels 1048576 \
--image_width 1920 \
--image_height 1088 \
--opening_step_prompt_style can_be \
--retry_short_outputs 0 \
--seed 42 \
> logs/opening_shard${LOCAL_RANK}.log 2>&1 &
done
wait
```
| Argument | Meaning |
| --- | --- |
| `--model_path` | Local model path. |
| `--meta-path` | Dataset root. |
| `--data-file-name` | Dataset JSONL file under the benchmark root. |
| `--save_dir` | Output directory for per-sample JSON and images. |
| `--think_mode` | `think`, `no_think`, or both. |
| `--cfg_interval`, `--max_generation_pixels`, `--oom_retry_max_pixels` | Main generation and OOM-retry controls. |
| `--num_shards`, `--shard_index` | Manual sharding for multi-process runs. |
Each sample is saved as `<save_dir>/<total_uid>.json`, and generated images use names such as `<save_dir>/<total_uid>-o-0.jpg`.
Reference judge command:
```bash
export OPENING_JUDGE_BASE_URL=http://127.0.0.1:8000
export OPENING_JUDGE_API_KEY=your_api_key
python3 OpenING/eval_opening.py \
--mode output_dir \
--opening_root /path/to/OpenING \
--output_dir ./output/opening_interleave/opening_output \
--output_file /path/to/OpenING/gpt-score_results_opening_output.json \
--workers 4 \
--save_every 10
```
`eval_opening.py` supports both a single model-output directory and a parent directory containing multiple outputs. Existing judge results are reused by default; `--retry_invalid_scores` retries only malformed score records.
Reference summary command:
```bash
python3 OpenING/summarize_GPT_scores.py \
--input_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_opening_output.json \
--output_csv /path/to/OpenING/Interleaved_Arena/model_score_summaries.csv \
--filtered_json /path/to/OpenING/Interleaved_Arena/gpt-score_results_filtered.json
```
This step converts raw judge results into a comparison-friendly CSV and can optionally emit a filtered JSON with invalid scores removed.
## 5. Unimmmu
`Unimmmu` supports both understanding-only and interleaved generation paths, but the interleaved mode is the one covered here.
Reference inference command:
```bash
python3 Unimmmu/inference_unimmmu.py \
--model_path /path/to/model \
--data_path /path/to/unimmmu_direct.jsonl \
--output_dir ./output/unimmmu_interleave \
--inference_mode interleave \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--num_steps 50
```
Reference multi-GPU command:
```bash
torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \
--model_path /path/to/model \
--data_path /path/to/unimmmu_direct.jsonl \
--output_dir ./output/unimmmu_interleave \
--inference_mode interleave \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--num_steps 50
```
Reference `device_map` command:
```bash
python3 Unimmmu/inference_unimmmu.py \
--model_path /path/to/model \
--data_path /path/to/unimmmu_direct.jsonl \
--output_dir ./output/unimmmu_interleave \
--inference_mode interleave \
--device_map auto \
--max_memory_per_gpu_gb 60 \
--cfg_scale 4.0 \
--num_steps 50
```
Reference shard merge command:
```bash
python3 Unimmmu/merge_shards.py \
--data_path /path/to/unimmmu_direct.jsonl \
--shard_dir ./output/unimmmu_interleave/shards \
--output_file ./output/unimmmu_interleave/unimmmu_results.jsonl
```
| Argument | Meaning |
| --- | --- |
| `--inference_mode` | Use `interleave` for the benchmark covered here. |
| `--data_path` | Benchmark JSONL path. |
| `--output_dir` | Root directory for JSONL results and generated images. |
| `--resume` | Skip completed `hash_uid`s. |
| `--num_shards`, `--shard_rank` | Manual data sharding. |
| `--device_map auto` | Single-process multi-GPU loading via Hugging Face. |
The main output file is `unimmmu_results.jsonl`. Interleaved images are written under `<output_dir>/images/<task>/`. In the current implementation, resume is applied before shard selection, so rerunning one shard is most reliable after deleting that shard's outputs and rerunning without `--resume`.
Reference score command:
```bash
python3 Unimmmu/calculate_score.py \
--input_file ./output/unimmmu_interleave/unimmmu_results.jsonl \
--output_dir ./output/unimmmu_interleave/scores \
--benchmark_path /path/to/image_text_agent
```
`calculate_score.py` delegates the actual scoring logic to the external benchmark repository pointed to by `--benchmark_path`.
## 6. RealUnify (GEU)
The GEU script supports both `step` and `interleave` modes. The interleaved path is the main one for SenseNova-U1 benchmarking.
Reference inference command:
```bash
python3 Realunify/inference_realunify.py \
--model_path /path/to/model \
--data_path /path/to/GEU_step_processed.jsonl \
--output_dir ./output/realunify_interleave \
--inference_mode interleave \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--num_steps 50
```
Reference multi-GPU command:
```bash
torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \
--model_path /path/to/model \
--data_path /path/to/GEU_step_processed.jsonl \
--output_dir ./output/realunify_interleave \
--inference_mode interleave \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--num_steps 50
```
Reference `device_map` command:
```bash
python3 Realunify/inference_realunify.py \
--model_path /path/to/model \
--data_path /path/to/GEU_step_processed.jsonl \
--output_dir ./output/realunify_interleave \
--inference_mode interleave \
--device_map auto \
--max_memory_per_gpu_gb 60 \
--cfg_scale 4.0 \
--num_steps 50
```
Reference shard merge command:
```bash
python3 Realunify/merge_shards.py \
--data_path /path/to/GEU_step_processed.jsonl \
--shard_dir ./output/realunify_interleave/shards \
--output_file ./output/realunify_interleave/realunify_results.jsonl
```
The main result file is `realunify_results.jsonl`. In `step` mode, `generated_image` stores `[input_image, edited_image]`; in `interleave` mode, the generated sequence is stored under `generated_images`. As with `Unimmmu`, resume currently happens before manual shard selection.
If you want a fixed output image size, pass `--target_image_size 1024`.
Reference score command:
```bash
python3 Realunify/calculate_score.py \
--input_file ./output/realunify_interleave/realunify_results.jsonl \
--output_file ./output/realunify_interleave/realunify_scores.json
```
The scorer first tries to extract the answer from `<answer>...</answer>` and otherwise falls back to the first `A/B/C/D` letter found in `model_response`.
## 7. RealUnify (UEG)
The UEG script exposes `understand_t2i`, `interleave`, and `t2i` inference modes.
Reference `understand_t2i` command:
```bash
python3 Realunify/inference_realunify_ueg.py \
--model_path /path/to/model \
--data_path /path/to/UEG_step.json \
--output_dir ./output/ueg_understand_t2i \
--inference_mode understand_t2i \
--cfg_scale 4.0 \
--num_steps 50
```
Reference `interleave` command:
```bash
python3 Realunify/inference_realunify_ueg.py \
--model_path /path/to/model \
--data_path /path/to/UEG_step.json \
--output_dir ./output/ueg_interleave \
--inference_mode interleave \
--cfg_scale 4.0 \
--timestep_shift 3.0 \
--num_steps 50
```
Reference `t2i` command:
```bash
python3 Realunify/inference_realunify_ueg.py \
--model_path /path/to/model \
--data_path /path/to/UEG_step.json \
--output_dir ./output/ueg_t2i \
--inference_mode t2i \
--cfg_scale 4.0 \
--num_steps 50
```
Unlike the GEU script, this one does not expose manual `--num_shards` or `--shard_rank` flags. Multi-process splitting relies on the distributed rank provided by `torchrun`. The output preserves generated image paths together with the follow-up `question_list`.
Reference score command:
```bash
python3 Realunify/calculate_score_ueg.py \
--input_file ./output/ueg_interleave/ueg_results.json
```
`calculate_score_ueg.py` is only a scaffold in the current repository. It expects a user-provided `GeminiAPI` judge wrapper and otherwise raises `NotImplementedError`.
## 8. Running the Evaluation
A typical local-model evaluation flow looks like this:
```bash
MODEL_PATH=/path/to/hf_model
torchrun --nproc_per_node=2 --master_port=29501 Realunify/inference_realunify.py \
--model_path ${MODEL_PATH} \
--data_path /path/to/GEU_step_processed.jsonl \
--output_dir ./output/realunify_interleave \
--inference_mode interleave \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--num_steps 50
python3 Realunify/inference_realunify_ueg.py \
--model_path ${MODEL_PATH} \
--data_path /path/to/UEG_step.json \
--output_dir ./output/ueg_understand_t2i \
--inference_mode understand_t2i \
--cfg_scale 4.0 \
--num_steps 50
torchrun --nproc_per_node=2 --master_port=29503 Unimmmu/inference_unimmmu.py \
--model_path ${MODEL_PATH} \
--data_path /path/to/unimmmu_direct.jsonl \
--output_dir ./output/unimmmu_interleave \
--inference_mode interleave \
--cfg_scale 4.0 \
--img_cfg_scale 1.0 \
--num_steps 50
```
`RealUnify (GEU)`, `RealUnify (UEG)`, and `Unimmmu` are independent and can run in parallel. `BabyVision` and `OpenING` each have their own inference-plus-evaluation pipeline as described above.
## 9. Troubleshooting
- Dataset file not found: check `--data_path`, or for `OpenING`, verify `--meta-path` and `--data-file-name`.
- A path still contains `<DATA_ROOT>`: the script is using a placeholder default; pass the real path explicitly.
- Samples are unexpectedly skipped: outputs already exist; review `--resume`, `--overwrite`, or shard outputs.
- Rerunning a single shard gives odd behavior: for `Unimmmu` and `RealUnify (GEU)`, delete that shard's outputs first and rerun without `--resume`.
- UEG scoring fails immediately: `calculate_score_ueg.py` needs a user-supplied `GeminiAPI` wrapper.
# Visual Understanding Evaluation
Reproduction guide for SenseNova-U1 on visual understanding benchmarks.
The pipeline is built on top of [EvalScope](https://github.com/OpenSenseNova/evalscope/tree/neo) (Native backend). EvalScope calls the model through an OpenAI-compatible HTTP endpoint and, for open-ended benchmarks, scores the predictions with an LLM judge.
Reference config and launcher live under `evaluation/understanding/`:
- `evaluation/understanding/config.yaml` — evaluation configuration
- `evaluation/understanding/es.py` — single-entry launcher
## 1. Overview
```
┌──────────────┐ OpenAI-compatible ┌─────────────┐
│ es.py │ ───── HTTP requests ────▶ │ Model API │
│ (EvalScope) │ │ (lightllm) │
└──────┬───────┘ ◀──── generations ─────── └─────────────┘
results/ (predictions, judge scores, aggregated metrics)
```
1. Deploy SenseNova-U1 behind an OpenAI-compatible endpoint (the reference setup uses lightllm).
2. Fill in `config.yaml` with the endpoint, model name, datasets, and generation parameters.
3. Run `python es.py` — it calls `evalscope.run.run_task(task_cfg="config.yaml")`, which loops over the datasets, issues requests in parallel, and writes predictions plus scores under `results/`.
## 2. Launcher
`evaluation/understanding/es.py` is deliberately tiny:
```python
from evalscope.run import run_task
run_task(task_cfg="config.yaml")
```
Everything else is driven by `config.yaml`.
## 3. Benchmarks
The reference run evaluates on:
- `mmmu_pro`
- `mmlu_pro`
- `mm_bench`
- `ai2d`
- `math_vista`
- `ifeval`
Add or remove items under `datasets:` to extend the evaluation.
## 4. Main Generation Parameters
These are the parameters the model is sampled with. They live under `generation_config:` in `config.yaml` and are forwarded to the OpenAI-compatible API.
| Parameter | Value | Meaning |
| --- | --- | --- |
| `stream` | `false` | Return the full response in one shot; simpler to score and log. |
| `temperature` | `0.6` | Sampling temperature — recommended setting for thinking-enabled models. |
| `top_p` | `0.95` | Nucleus sampling cutoff; used together with `top_k`. |
| `max_tokens` | `32768` | Upper bound on generated tokens per sample. Large because `<think>…</think>` traces can be long. |
| `timeout` | `300` | Per-request timeout in seconds. |
| `extra_body.top_k` | `20` | Restrict sampling to the top-20 tokens at each step. |
| `extra_body.repetition_penalty` | `1.05` | Mild penalty to suppress loops in long reasoning traces. |
| `extra_body.chat_template_kwargs.enable_thinking` | `true` | Let the chat template emit a `<think>…</think>` section before the final answer. |
Post-processing on the prediction:
- `dataset_args.remove_until: </think>` — everything up to and including the closing `</think>` tag is stripped before grading, so only the final answer is scored.
- `ignore_errors: true` — transient single-sample API failures do not abort the whole run.
## 5. Judge Model
Open-ended benchmarks are scored by an LLM judge.
| Field | Value |
| --- | --- |
| `judge_worker_num` | `64` (parallel judge calls) |
| `judge_model_args.model_id` | `gpt-4o-mini-2024-07-18` |
| `judge_model_args.api_key` | *(fill in)* |
| `judge_model_args.api_url` | *(fill in — OpenAI-compatible judge endpoint)* |
| `judge_model_args.generation_config.max_tokens` | `4096` |
| `judge_model_args.generation_config.timeout` | `300` |
The reference `config.yaml` leaves the judge `api_key` / `api_url` blank; fill them before running judge-dependent tasks.
## 6. Runtime Settings
| Field | Value | Meaning |
| --- | --- | --- |
| `eval_backend` | `Native` | EvalScope native backend. |
| `eval_type` | `openai_api` | Drive the model through an OpenAI-compatible endpoint. |
| `eval_batch_size` | `64` | In-flight concurrent requests sent to the model server. |
| `api_url` | `http://<host>:8000/v1/` | OpenAI-compatible serving endpoint (lightllm in the reference setup). |
| `model` | `SenseNova-U1` | Model name as exposed by the serving endpoint. |
| `use_cache` | `results/` | Reuse previously generated answers — supports resume / retry. |
| `work_dir` | `results/` | Output root for predictions, judgments, and scores. |
| `no_timestamp` | `true` | Write into a stable directory (plays well with `use_cache`). |
## 7. Reference `config.yaml`
```yaml
eval_backend: Native
eval_type: openai_api
eval_batch_size: 64
api_url: http://<host>:8000/v1/ # lightllm deployment
model: SenseNova-U1
datasets:
- mmmu_pro
- mmlu_pro
- mm_bench
- ai2d
- math_vista
- ifeval
dataset_args:
remove_until: </think>
ignore_errors: true
generation_config:
stream: false
temperature: 0.6
timeout: 300
max_tokens: 32768
top_p: 0.95
extra_body:
top_k: 20
repetition_penalty: 1.05
chat_template_kwargs:
enable_thinking: true
judge_worker_num: 64
judge_model_args:
api_key: ""
api_url: ""
model_id: gpt-4o-mini-2024-07-18
generation_config:
max_tokens: 4096
timeout: 300
use_cache: results/
work_dir: results/
no_timestamp: true
```
## 8. Running the Evaluation
1. Deploy SenseNova-U1 on an OpenAI-compatible endpoint and confirm connectivity:
```bash
curl -sSf -m 5 "$api_url"
```
2. Edit `evaluation/understanding/config.yaml`: set `api_url`, `model`, and the judge `api_key` / `api_url` if needed.
3. Launch:
```bash
cd evaluation/understanding
python es.py
```
Predictions, judge outputs, and final scores are written under `results/`. Because `use_cache: results/` and `no_timestamp: true` are set, rerunning the command skips already-answered samples, so interrupting and resuming is safe.
# EASI Benchmarking — Full Guide
End-to-end setup to run visual-understanding benchmarks (VLMEvalKit, EASI, lmms-eval) against SenseNova-U1 by exposing the model as an **OpenAI-compatible HTTP endpoint** via the LightLLM inference server, then pointing the benchmark toolkit at `/v1/chat/completions`.
This is the comprehensive reference. For the quickstart + file layout, see [`README.md`](README.md).
This guide covers a **native (no Docker) install** on a Linux host with NVIDIA GPUs. The upstream-supported path is the Docker image `lightx2v/lightllm_lightx2v:20260407` documented in [`docs/deployment_CN.md`](../../docs/deployment_CN.md) — use that when Docker/nvidia-container-toolkit is available. This native recipe is the fallback for sandboxed environments (containers, chroot, clusters without privileged pod access).
## Why LightLLM (not `transformers`)
`examples/*/inference.py` scripts use the `transformers` backend, which is fine for one-off inference but not servable:
- `NEOChatModel` is registered only with `AutoModel` / `AutoConfig`, not with `AutoModelForImageTextToText` or an `AutoProcessor`. `transformers serve` (4.57+) and `text-generation-inference` both dispatch on those mappings — they will not discover SenseNova-U1.
- Inference uses a custom `model.chat(tokenizer, pixel_values, question, gen_cfg, grid_hw=...)` signature (`src/sensenova_u1/models/neo_unify/modeling_neo_chat.py:1732`), not the standard `processor(images, text) → model.generate()` flow that serving stacks expect.
- vLLM / SGLang have no built-in NEO-Unify model class; porting is a weeks-long task.
LightLLM has native `neo_chat` + `neo_chat_moe` model implementations (`lightllm/models/neo_chat/`, `lightllm/models/neo_chat_moe/`) and exposes an OpenAI-compatible `/v1/chat/completions`, which is exactly what benchmark toolkits consume.
**Skipped in this guide: LightX2V.** LightX2V is only imported when `--enable_multimodal_x2i` is passed (see `lightllm/server/x2i_server/manager.py`). Visual understanding benchmarks do not generate images, so we omit that flag and avoid LightX2V's `torch<=2.8.0` pin (which conflicts with LightLLM's `torch==2.9.1` requirement). Image-generation benchmarks — when those scripts land — will need the full LightLLM + LightX2V stack via the Docker image.
---
## 1) Host prerequisites
| Item | Required |
| :--- | :--- |
| OS | Linux (x86_64) |
| NVIDIA driver | ≥ 550.x (recommended 550.90.07+) |
| GPU | Hopper / Ampere class with compute capability ≥ 80. Verified on H100 80GB |
| Python | 3.10 (matches LightLLM's Dockerfile) |
| `uv` | `uv >= 0.9`. Install: <https://docs.astral.sh/uv/getting-started/installation/> |
| System lib | `libnuma1`, `libnuma-dev` (required by `sgl-kernel`) |
Install the system lib once (needs sudo):
```bash
sudo apt-get install -y libnuma1 libnuma-dev
```
The CUDA runtime is shipped inside the `torch==2.9.1+cu128` wheel — you do NOT need a matching CUDA toolkit on the host as long as the host driver supports CUDA 12.8+ (forward-compatible from 550.x).
---
## 2) One-shot install
`evaluation/easi/scripts/setup.sh` installs the full pipeline in one idempotent run:
| Phase | What |
| :--- | :--- |
| 1 | Host prereq checks (`uv`, `libnuma`, driver) |
| 2 | Recursive submodule init (LightLLM, EASI, EASI/VLMEvalKit, EASI/lmms-eval) |
| 3 | `.venv-lightllm/` Python 3.10 venv at repo root |
| 4-6 | Pinned LightLLM deps + vllm + editable LightLLM + transitive fixes (pandas) |
| 7 | Apply local patches from `evaluation/easi/lightllm-stack/patches/` |
| 8 | Verify LightLLM imports + api_server CLI |
| 9 | **EASI client venv** — delegates to `EASI/scripts/setup.sh` (Py 3.11, VLMEvalKit + lmms-eval + flash-attn) |
| 10 | **Endpoint registration** — sitecustomize.py injector into the EASI venv |
```bash
bash evaluation/easi/scripts/setup.sh
```
Flags:
- `--skip-lightllm` — skip phases 1, 3-7 (DON'T install LightLLM). Use when you have a SenseNova-U1 OpenAI-compatible endpoint already (docker elsewhere, remote infra, etc.). Edit `config/sensenova_models.py` to point `api_base` at your endpoint
- `--skip-easi` — skip phase 9 (flash-attn build is slow; useful for fast reruns when only touching the LightLLM side)
- `--skip-register` — skip phase 10 (no auto endpoint wiring)
Re-running is safe — each step checks whether it already ran. If you just want to understand what happens under the hood, sections 2a–2f below spell out phases 2-8.
### 2a. LightLLM submodule
LightLLM is pinned as a git submodule at `evaluation/easi/lightllm-stack/LightLLM`, tracking branch `neo_plus_clean` (the branch that contains NEO-Unify model support). `evaluation/easi/lightllm-stack/patches/` holds any local fixes applied on top.
```bash
git submodule update --init evaluation/easi/lightllm-stack/LightLLM
```
To bump to a newer LightLLM commit later:
```bash
cd evaluation/easi/lightllm-stack/LightLLM
git fetch origin neo_plus_clean && git checkout origin/neo_plus_clean
cd -
git add evaluation/easi/lightllm-stack/LightLLM # records the new submodule SHA
```
> `LightX2V` can be cloned alongside for image-generation workloads later. Unused for VQA-only serving; not included as a submodule.
### 2b. Create the serving venv
Keep this venv separate from the main `.venv` used by `examples/*/inference.py` — LightLLM pins `torch==2.9.1` while the main env uses torch 2.8.
```bash
cd /path/to/SenseNova-U1
uv venv -p 3.10 .venv-lightllm
source .venv-lightllm/bin/activate
uv pip install --upgrade pip
```
`.venv-lightllm/` is already in `.gitignore`.
### 2c. Strip non-installable pins from `requirements.txt`
The upstream `LightLLM/requirements.txt` includes two packages that fail in a clean environment:
- **`nixl==0.8.0`** — not published to PyPI; built from source inside the Dockerfile against a custom UCX build. Only used by `--run_mode nixl_prefill/nixl_decode` (PD-disaggregation over RDMA) which we are not running.
- **`cchardet==2.1.7`** — archived upstream, build fails on modern `setuptools`. Optional character-detection helper; `chardet` is pulled in transitively as a fallback.
```bash
grep -v "^nixl\|^cchardet" evaluation/easi/lightllm-stack/LightLLM/requirements.txt > /tmp/lightllm-req.txt
```
### 2d. Install
```bash
# pinned deps (torch 2.9.1+cu128, flashinfer, sgl-kernel, xformers, triton, ...)
uv pip install -r /tmp/lightllm-req.txt
# vllm is a hard runtime dep (used for shared utilities)
uv pip install --no-cache-dir vllm==0.16.0
# LightLLM itself, editable
uv pip install --no-cache-dir -e evaluation/easi/lightllm-stack/LightLLM
# transitive dep missing from upstream requirements
uv pip install pandas
```
### 2e. Apply local patches
See [§8 Known patches](#8-known-patches). `setup.sh` applies these automatically; if you installed by hand:
```bash
cd evaluation/easi/lightllm-stack/LightLLM
for p in ../patches/*.patch; do git apply "$p"; done
```
### 2f. Verify
```bash
python -c "
import torch; print('torch', torch.__version__, 'cuda', torch.cuda.is_available())
import flashinfer, sgl_kernel, xformers, vllm, lightllm
print('flashinfer', 'sgl-kernel', 'xformers ok', 'vllm', vllm.__version__)
print('lightllm ok')
"
python -m lightllm.server.api_server --help | head -5
```
Expected:
```
torch 2.9.1+cu128 cuda True
flashinfer sgl-kernel xformers ok vllm 0.16.0
lightllm ok
usage: api_server.py [-h] ...
```
### Features we skip (and when you'd re-enable them)
| Feature | Re-enable when | How |
| :--- | :--- | :--- |
| **FlashMLA** | Running DeepSeek MLA-style attention (SenseNova-U1 uses Qwen3 GQA, not MLA) | Follow `evaluation/easi/lightllm-stack/LightLLM/docker/Dockerfile:49-53` |
| **DeepEP + NVSHMEM** | Multi-node MoE with InfiniBand GPUs | `Dockerfile:78-100`; needs root for gdrcopy |
| **NIXL + custom UCX** | PD-disaggregated serving (`--run_mode nixl_*`) | `Dockerfile:102-138`; needs root, RDMA stack |
| **LightMem** | Disk KV-cache offload | `Dockerfile:60-64` |
| **LightX2V (image gen)** | Running image-generation benchmarks | Install `evaluation/easi/lightllm-stack/LightX2V` in a separate venv with `torch<=2.8.0`, or use the upstream Docker image |
---
## 3) Launch the server
Helper script: `evaluation/easi/scripts/serve.sh`. Auto-activates `.venv-lightllm`, auto-downloads model weights on first run if missing, and picks a per-model default port so both models can run concurrently without clashing.
### Model → port mapping
| `MODEL` value | HF repo | Port | `--reasoning_parser` default | Advertised `model` name |
| :--- | :--- | :---: | :--- | :--- |
| `8b-mot` *(default)* | `sensenova/SenseNova-U1-8B-MoT` | 8000 | `qwen3` (strips `<think>…</think>`) | `sensenova-u1-8b-mot` |
### Defaults
```bash
# 8b-mot, GPUs 0-1, tp=2, port 8000
bash evaluation/easi/scripts/serve.sh
```
### Max throughput on 8× H100 (single model)
```bash
MODEL=8b-mot GPUS=0,1,2,3,4,5,6,7 TP=8 bash evaluation/easi/scripts/serve.sh
```
### Env vars (full list)
| Var | Default | Notes |
| :--- | :--- | :--- |
| `MODEL` | `8b-mot` | `8b-mot`. Ignored if `MODEL_DIR` is set |
| `MODEL_DIR` | `./models/SenseNova-U1-Mini-<Beta\|SFT>` | Absolute path overrides |
| `GPUS` | `0,1` | Comma-separated `CUDA_VISIBLE_DEVICES` |
| `TP` | `2` | Tensor-parallel degree; must equal `GPUS` count |
| `HOST` | `0.0.0.0` | |
| `PORT` | per-model (8000 / 8001) | Overrides the default port from the table above |
| `MAX_LEN` | `32768` | `--max_req_total_len` |
| `MEM_FRAC` | `0.85` | `--mem_fraction` — fraction of GPU mem for KV cache |
| `MODEL_NAME` | per-model | Advertised via `/v1/models`; benchmark client `model` field must match |
| `REASONING` | per-model | `--reasoning_parser`. `qwen3` for beta, disabled for sft. Set to empty string to disable on beta |
| `NO_AUTO_DL` | unset | Set to `1` to skip auto-download when model dir is missing (error out instead) |
### First-launch warmup
Triton / CUDA kernels compile on first request; the first `/v1/chat/completions` call can take **several minutes** to return. Subsequent calls are cached and fast. Health-check with:
```bash
# after startup log shows "Uvicorn running on http://0.0.0.0:8000"
curl -s http://localhost:8000/v1/models | head
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"sensenova-u1-8b-mot","messages":[{"role":"user","content":"hi"}]}' | head -c 500
```
For a proper multimodal smoke test (image + text), use [`examples/serving/client.py`](../../examples/serving/client.py):
```bash
source .venv # the main sensenova_u1 venv has the `requests` client deps
python examples/serving/client.py \
--mode vqa \
--prompt "Describe this image." \
--image_path examples/vqa/data/images/menu.jpg \
--url http://localhost:8000/v1
```
---
### Standalone weight download (if you want to pre-fetch)
`evaluation/easi/scripts/serve.sh` calls this for you automatically. To run it independently:
```bash
source .venv-lightllm/bin/activate # hf CLI lives in this venv
bash evaluation/easi/scripts/download_weights.sh 8b-mot
```
Weights land at `./models/SenseNova-U1-<Mini|Flash>-Beta/`. `./models/` is gitignored. Set `export HF_TOKEN=hf_...` if the HF repo is gated.
---
## 4) Point VLMEvalKit / EASI at the endpoint
The server speaks OpenAI `/v1/chat/completions`. Any OpenAI-compatible client works unchanged. For VLMEvalKit the `GPT4V` wrapper (`vlmeval/api/gpt.py`) is the right client class.
EASI is checked out as a submodule at `evaluation/easi/EASI` (tracking `EvolvingLMMs-Lab/EASI@main`). Its VLMEvalKit fork is at `evaluation/easi/EASI/VLMEvalKit` — nested submodule.
VLMEvalKit's `run.py --config <json>` flag is mutually exclusive with `--data/--model`, which is what EASI's `run_easi_eval.py` uses. So the model entry must exist in `vlmeval.config.supported_VLM` at import time.
### Tracked-file wire-up (what `setup.sh` does)
Two files you care about:
| Path | Role |
| :--- | :--- |
| `evaluation/easi/config/sensenova_models.py` | **Editable source of truth** for endpoint URLs, ports, `max_tokens`, temperature, retry, etc. Tweak here, commit, done |
| `evaluation/easi/patches/easi_sensenova_config.patch` | 7-line patch that appends an import hook to `VLMEvalKit/vlmeval/config.py` |
`setup.sh` Phase 10:
1. Copies `config/sensenova_models.py``EASI/VLMEvalKit/vlmeval/sensenova_models.py`
2. Applies `patches/easi_sensenova_config.patch` to `EASI/VLMEvalKit/vlmeval/config.py` — adds:
```python
try:
from .sensenova_models import entries as _sensenova_u1_entries
supported_VLM.update(_sensenova_u1_entries)
except Exception as _e:
import sys; print(f"[sensenova-u1-config] failed: {_e}", file=sys.stderr)
```
Both steps idempotent — patch apply is reverse-checked before applying, copy is always overwrite (cheap). Tweak the editable module, re-run setup.sh, done.
### Tweak workflow
```bash
# edit endpoint/port/max_tokens/temperature:
$EDITOR evaluation/easi/config/sensenova_models.py
# propagate to VLMEvalKit:
bash evaluation/easi/scripts/setup.sh --skip-easi # (--skip-easi skips the slow EASI venv check)
```
Verify:
```bash
source evaluation/easi/EASI/.venv/bin/activate
python -c 'from vlmeval.config import supported_VLM; print([k for k in supported_VLM if "SenseNova-U1-" in k])'
# ['SenseNova-U1-8B-MoT-Local']
```
### Why not edit `config.py` directly?
The VLMEvalKit submodule is pinned to a specific upstream SHA. Any direct edit becomes dirty submodule state that doesn't survive `git submodule update`. The tracked-file + patch pattern is git-friendly: edits live in the parent SenseNova-U1 repo and re-apply cleanly after submodule bumps.
### Configuring a custom OpenAI-compatible endpoint
`config/sensenova_models.py` is a plain Python module containing a `entries: dict[str, partial[GPT4V]]` top-level variable. Each entry maps a **model key** (what you pass to `run_easi_eval.py --model …`) to a partially-applied `GPT4V` client bound to an endpoint.
Template:
```python
from functools import partial
from vlmeval.api.gpt import GPT4V # type: ignore[import-not-found]
entries = {
"<YourModelName>": partial(
GPT4V,
model="<advertised-model-name>", # must match the server's `model` field
api_base="http://<host>:<port>/v1/chat/completions",
key="<api-key-or-dummy>",
temperature=0, # 0 = greedy (deterministic)
max_tokens=8192, # higher for thinking models
retry=10, # per-request retries on 5xx / timeout
verbose=False,
),
}
```
### `GPT4V` kwargs reference
| Kwarg | Type | Notes |
| :--- | :--- | :--- |
| `model` | str | Value echoed to the server in the request `model` field. Must match `--model_name` on the server side (LightLLM: `MODEL_NAME` env var; vLLM: `--served-model-name`, etc.) |
| `api_base` | str | Full path to the chat-completions endpoint, including `/v1/chat/completions`. Works for any OpenAI-compatible server (LightLLM, vLLM, SGLang, TGI, OpenRouter, Anthropic-via-openai-shim, etc.) |
| `key` | str | Bearer token. Use `"dummy"` for auth-less local servers |
| `temperature` | float | `0` for deterministic benchmarking; set > 0 if a benchmark needs sampling |
| `max_tokens` | int | Generation cap. Thinking models (e.g. SenseNova-U1-8B-MoT) need ≥ 8192 so they don't truncate mid-`<think>` |
| `top_p` | float | Nucleus sampling cutoff; default 1.0 (no trim) |
| `retry` | int | HTTP-level retries on 5xx / timeout. 10 is generous |
| `wait` | float | Seconds between retries; defaults to exponential backoff |
| `verbose` | bool | Per-request logging; leave False for benchmark runs |
| `img_detail` | `"low"` / `"high"` / `"auto"` | Passed through to `image_url.detail` — only matters for servers that honor it (GPT-4o). LightLLM ignores |
| `timeout` | int | Per-request timeout in seconds. Defaults to 60 — bump to 180+ for thinking models on slow hardware |
| `system_prompt` | str | Prepended as the system message. Leave unset unless a benchmark demands a specific persona |
Full kwarg list: `evaluation/easi/EASI/VLMEvalKit/vlmeval/api/gpt.py`.
### Examples
**Local LightLLM (default)** — what `setup.sh` ships out of the box:
```python
"SenseNova-U1-8B-MoT-Local": partial(
GPT4V,
model="sensenova-u1-8b-mot",
api_base="http://localhost:8000/v1/chat/completions",
key="dummy", temperature=0, max_tokens=32768, retry=10, verbose=False,
),
```
**Remote endpoint (infra team or production)**:
```python
"SenseNova-U1-8B-MoT-Prod": partial(
GPT4V,
model="sensenova-u1-8b-mot",
api_base="https://sensenova-u1.internal.example.com/v1/chat/completions",
key="sk-your-real-token",
temperature=0, max_tokens=32768, retry=5, verbose=False,
),
```
**Endpoint that needs `enable_thinking` toggled off** (subclass pattern):
```python
from vlmeval.api.gpt import GPT4V
class _SenseNovaNoThinking(GPT4V):
def generate_inner(self, inputs, **kwargs):
kwargs["chat_template_kwargs"] = {"enable_thinking": False}
return super().generate_inner(inputs, **kwargs)
entries = {
"SenseNova-U1-8B-MoT-Local-NoThink": partial(
_SenseNovaNoThinking,
model="sensenova-u1-8b-mot",
api_base="http://localhost:8000/v1/chat/completions",
key="dummy", temperature=0, max_tokens=2048, retry=10, verbose=False,
),
}
```
After any edit: `bash evaluation/easi/scripts/setup.sh --skip-lightllm --skip-easi` to propagate into `VLMEvalKit/vlmeval/sensenova_models.py`, then your new model key is available via `run_easi_eval.py --model <YourModelName>`.
### Running benchmarks
```bash
source evaluation/easi/EASI/.venv/bin/activate
cd evaluation/easi/EASI
```
**Single benchmark**:
```bash
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--output-dir eval_results_sensenova-u1-8b-mot_viewspatial \
--api-nproc 16 \
--benchmarks viewspatial
```
**Full EASI-8 suite** (omit `--benchmarks`):
```bash
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--output-dir eval_results_sensenova-u1-8b-mot \
--api-nproc 16
```
**Multiple benchmarks**:
```bash
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--benchmarks viewspatial,blink,3dsrbench \
--api-nproc 16 \
--output-dir eval_results_sensenova-u1-8b-mot
```
Benchmark keys (EASI-8): `vsi_bench, mmsi_bench, mindcube_tiny, viewspatial, site_image, site_video, blink, 3dsrbench, embspatial`. Alias `site` expands to `site_image + site_video`.
Useful flags: `--no-judge` (trust exact_matching), `--rerun` (skip resume), `--verbose` (per-sample output), `--include-extra` (add non-EASI-8 benches), `--submit` (push to leaderboard — needs `HF_TOKEN`).
`--api-nproc` = concurrent HTTP requests to LightLLM. 16-32 on 8× H100 with `tp=8`. Lower if you see timeouts / 500s.
### Plain VLMEvalKit
Same `GPT4V` wrapper pattern — just add the entry to whichever config file the standalone VLMEvalKit uses, then run `python run.py --model <key> --data <bench>`.
### Thinking-mode toggle
If a benchmark requires disabling (or forcing) the `<think>…</think>` reasoning block, subclass `GPT4V` and inject `chat_template_kwargs` into the payload. Pattern already used for Qwen3.5-VL at `evaluation/easi/EASI/VLMEvalKit/vlmeval/config.py:127-137`:
```python
class _SenseNovaNoThinking(GPT4V):
def generate_inner(self, inputs, **kwargs):
kwargs["chat_template_kwargs"] = {"enable_thinking": False}
return super().generate_inner(inputs, **kwargs)
```
### `max_tokens` and thinking mode
When thinking is on (default), the `<think>...</think>` block alone can run thousands of tokens. If generation hits `max_tokens` before `</think>`, the reasoning parser returns an **empty `content`** (all the text is trapped in `reasoning_content`, the final answer never gets emitted). For VLMEvalKit, this shows up as blank model outputs / parser failures.
- For thinking-mode benchmarks: set `max_tokens >= 8192` in the `GPT4V` partial, or higher for multi-hop reasoning benches. The EASI config example above uses `max_tokens=8192`.
- For benchmarks that don't need reasoning: disable thinking via the `_SenseNovaNoThinking` subclass pattern and drop `max_tokens` back to `2048` to save latency.
---
## 5) Troubleshooting
### `libnuma.so.1: cannot open shared object file`
`sgl_kernel` dynamic dep. Install system lib:
```bash
sudo apt-get install -y libnuma1 libnuma-dev
```
### `nixl` build fails / `cchardet` build fails
Strip them from the requirements file as shown in §2c. Neither is needed for single-node serving. `setup.sh` already does this.
### `ModuleNotFoundError: No module named 'pandas'`
Transitive dep of `lightllm/models/neo_chat_moe/vision_process.py` not declared in upstream `requirements.txt`. Install manually: `uv pip install pandas`. `setup.sh` handles this automatically.
### `jinja2.exceptions.UndefinedError: 'list object' has no attribute 'startswith'`
HF chat template called on OpenAI-style multimodal content list. Fixed by the `build_prompt_flatten_content.patch` in §6. If you freshly cloned LightLLM and skipped `setup.sh`, apply manually:
```bash
cd evaluation/easi/lightllm-stack/LightLLM
git apply ../patches/build_prompt_flatten_content.patch
```
### `iptables: Permission denied` (during `docker run`)
You are inside an unprivileged container (Kubernetes pod, LXC, chroot). Docker-in-Docker is blocked. Use this native recipe instead — no Docker needed.
### Server launches but first request hangs several minutes
Expected. Triton/CUDA kernel compilation on first invocation. Subsequent calls are cached. Confirm by tailing the server log for `compiling` / `compiled` messages.
### `CUDA out of memory`
Options, in order of impact:
- Reduce `MEM_FRAC` (e.g. `0.7` leaves more headroom).
- Lower `MAX_LEN` (`--max_req_total_len`) — kv-cache scales with it.
- Increase `TP` and give more GPUs.
- On VLMEvalKit side, lower `--api-nproc` to cap concurrent active requests.
### Model loads but `/v1/chat/completions` 404s
Check the model served: `curl http://localhost:8000/v1/models`. The `model` field in the request body must match the value there (by default `sensenova-u1`, set via `MODEL_NAME` env var).
### torch version conflicts when activating both venvs
They're deliberately separate. `.venv` uses torch 2.8 (for SenseNova-U1 transformers inference); `.venv-lightllm` uses torch 2.9.1 (LightLLM's pin). Activate only one at a time — never source both into the same shell.
---
## 6) Known patches
Local fixes for LightLLM bugs/gaps we've hit. Tracked under `evaluation/easi/lightllm-stack/patches/`. `setup.sh` applies these automatically and skips if already applied.
| Patch | Fixes | Target file |
| :--- | :--- | :--- |
| `build_prompt_flatten_content.patch` | OpenAI multimodal `content` lists crash the HF chat template (`'list object' has no attribute 'startswith'`). Adds `_flatten_multimodal_content` step that rewrites list-form content into a string with `<image>`/`<audio>` placeholders before `apply_chat_template`; `NeoChatTokenizer.encode` later expands them to `<img>...</img>` with injected image-token IDs. | `lightllm/server/build_prompt.py` |
### Recovering after `git submodule update` in `evaluation/easi/lightllm-stack/LightLLM/`
Git submodule updates reset the working tree to the pinned SHA, discarding any applied patches. Re-apply:
```bash
bash evaluation/easi/scripts/setup.sh # idempotent; re-applies any drifted patches
# or, manually:
cd evaluation/easi/lightllm-stack/LightLLM
for p in ../patches/*.patch; do
git apply --reverse --check "$p" 2>/dev/null || git apply "$p"
done
```
If upstream file moves cause `git apply --check` to fail, the patch needs regenerating against the new file — inspect the conflict, reapply the logic manually, and regenerate with `git diff <file> > ../patches/<name>.patch`.
### Contributing fixes upstream
These patches are candidates for upstream PRs to <https://github.com/ModelTC/LightLLM>. The multimodal flatten fix is model-agnostic and should benefit every `/v1/chat/completions` multimodal client.
---
## 7) Recap — shortest path
```bash
# one-time host prereq (needs sudo)
sudo apt-get install -y libnuma1 libnuma-dev
# full install: LightLLM stack + EASI client venv + endpoint registration
bash evaluation/easi/scripts/setup.sh
# launch (auto-downloads weights on first run)
bash evaluation/easi/scripts/serve.sh # 8b-mot → port 8000
# benchmark (second shell, after server up)
source evaluation/easi/EASI/.venv/bin/activate
cd evaluation/easi/EASI
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--benchmarks blink --api-nproc 16
```
# evaluation/easi/ — SenseNova-U1 visual understanding benchmarking
Self-contained subpackage for running benchmark harnesses (VLMEvalKit, EASI, lmms-eval) against a locally-served SenseNova-U1 model.
## Layout
| Path | What |
| :--- | :--- |
| `EASI/` | git submodule → `EvolvingLMMs-Lab/EASI@main`. Contains `VLMEvalKit` + `lmms-eval` as its own nested submodules |
| `lightllm-stack/LightLLM/` | git submodule → `ModelTC/LightLLM@neo_plus_clean` |
| `lightllm-stack/patches/` | local fixes applied on top of LightLLM |
| `config/sensenova_models.py` | **editable** VLMEvalKit model entries — edit to tweak endpoint URLs, ports, `max_tokens`, etc. |
| `patches/easi_sensenova_config.patch` | 7-line hook added to `VLMEvalKit/vlmeval/config.py` that imports the above |
| `scripts/setup.sh` | one-shot install: submodules, deps, venv, patches, VLMEvalKit wire-up, verify |
| `scripts/serve.sh` | launch LightLLM server. `DP=1` (default) → single instance. `DP>1` → N replicas + LB on the canonical port |
| `scripts/lb.py` | least-in-flight HTTP LB used by `serve.sh` when `DP>1` |
| `scripts/serve_lb.sh` | deprecated shim that forwards to `serve.sh` |
| `scripts/download_weights.sh` | standalone HF weight fetcher |
Weights land in `<repo_root>/models/` (gitignored). Serving venv `<repo_root>/.venv-lightllm/` (gitignored, separate from the main repo `.venv`).
## Quickstart
```bash
# one-time host lib (needs sudo)
sudo apt-get install -y libnuma1 libnuma-dev
# install EVERYTHING — LightLLM stack + EASI client + endpoint registration.
# Idempotent. First run takes several minutes (builds flash-attn for the EASI venv).
bash evaluation/easi/scripts/setup.sh
# launch server — auto-downloads weights on first run
MODEL=8b-mot GPUS=0,1 TP=2 bash evaluation/easi/scripts/serve.sh # → localhost:8000
# run a benchmark from a second shell
source evaluation/easi/EASI/.venv/bin/activate
cd evaluation/easi/EASI
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--benchmarks blink \
--api-nproc 16
```
Setup modes:
```bash
bash evaluation/easi/scripts/setup.sh # full install
bash evaluation/easi/scripts/setup.sh --skip-lightllm # bring your own endpoint
bash evaluation/easi/scripts/setup.sh --skip-easi # LightLLM side only
bash evaluation/easi/scripts/setup.sh --skip-register # no auto VLMEvalKit wiring
```
### Bring-your-own endpoint
If you already have a SenseNova-U1 OpenAI-compatible endpoint (docker container on another host, infra team API, production deployment, etc.), skip the LightLLM install entirely:
#### 1. Install only EASI + the VLMEvalKit wiring
```bash
bash evaluation/easi/scripts/setup.sh --skip-lightllm
```
Skips: host prereq checks, `.venv-lightllm` creation, LightLLM dep install, LightLLM patches, api_server CLI verification. LightLLM submodule is NOT initialized. Only the EASI submodule (+ its nested VLMEvalKit / lmms-eval) gets pulled.
#### 2. Point `config/sensenova_models.py` at your endpoint
Edit the `entries` dict — change `api_base` to wherever your endpoint actually lives. Rename the dict key if you want a clearer label in `run_easi_eval.py --model …`. Full schema + `GPT4V` kwargs documented in [`GUIDE.md`](GUIDE.md) — see the "Configuring a custom OpenAI-compatible endpoint" section.
Quick example — single remote endpoint:
```python
# evaluation/easi/config/sensenova_models.py
from functools import partial
from vlmeval.api.gpt import GPT4V # type: ignore[import-not-found]
entries = {
"SenseNova-U1-8B-MoT-Prod": partial(
GPT4V,
model="sensenova-u1-8b-mot",
api_base="https://your.host.example.com/v1/chat/completions",
key="sk-your-real-token-or-dummy",
temperature=0,
max_tokens=32768,
retry=10,
verbose=False,
),
}
```
#### 3. Propagate the edit into VLMEvalKit
```bash
bash evaluation/easi/scripts/setup.sh --skip-lightllm --skip-easi
```
This re-copies `sensenova_models.py` into `EASI/VLMEvalKit/vlmeval/`. Fast — no reinstalls.
#### 4. Run benchmarks
Same as the local-server path — just use whatever key you set in step 2:
```bash
source evaluation/easi/EASI/.venv/bin/activate
cd evaluation/easi/EASI
# single benchmark
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Prod \
--output-dir eval_results_prod_viewspatial \
--api-nproc 16 \
--benchmarks viewspatial
# full EASI-8 suite
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Prod \
--output-dir eval_results_prod \
--api-nproc 16
# multiple specific benchmarks
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Prod \
--benchmarks viewspatial,blink,3dsrbench \
--api-nproc 16 \
--output-dir eval_results_prod
```
Tune `--api-nproc` based on your endpoint's capacity. Remote endpoints with rate limits: start low (4-8) and ramp up. Strong production backends behind an LB: 32-64.
#### Notes
- `scripts/serve.sh` won't work under `--skip-lightllm` — that's the point. Your endpoint lives elsewhere.
- If the endpoint exposes multiple SenseNova-U1 variants (or you have several endpoints to benchmark side-by-side), add multiple entries to `sensenova_models.py` — each gets its own model key.
- The `GPT4V` wrapper auto-handles HTTP retries (on 5xx / timeout) and chunks images as base64 data URIs; no client-side prep needed.
## Model → port map
| `MODEL` arg | HF repo | Server port | Reasoning parser |
| :--- | :--- | :---: | :--- |
| `8b-mot` (default) | `sensenova/SenseNova-U1-8B-MoT` | 8000 | `qwen3` (strips `<think>`) |
Override the port with `PORT=<n>`.
## Multi-replica (DP) serving behind a load balancer
`serve.sh` auto-switches to multi-replica mode when `DP > 1`: launches N tp-sharded LightLLM replicas on backend ports + a Python load balancer on the canonical port. **Same port as `DP=1`** — VLMEvalKit config never needs to change when scaling up/down.
```bash
# 4 replicas × tp=2 on 8 GPUs — higher throughput for many short requests
DP=4 TP=2 bash evaluation/easi/scripts/serve.sh
# 2 replicas × tp=4
DP=2 TP=4 bash evaluation/easi/scripts/serve.sh
```
Only one serve.sh process per model at a time — both `DP=1` and `DP>1` bind the same canonical port.
Port layout:
| MODEL | LB (client-facing) | Backends |
| :--- | :---: | :--- |
| `8b-mot` | 8000 | 8100, 8110, 8120, 8130 (step 10) |
Override with `LB_PORT=...` or `BACKEND_BASE_PORT=...`.
Direct hits to a backend port (e.g. `http://localhost:18000/v1/models`) still work — useful for debugging one specific replica.
Sanity-check guardrails (fail fast, no partial launches):
- `DP * TP <= # visible GPUs` (from `nvidia-smi`) unless `GPUS=...` overrides
- If `GPUS=...` provided, must contain exactly `DP * TP` entries
- `LB_PORT` must not collide with any backend port in `[BACKEND_BASE_PORT, BACKEND_BASE_PORT + 10*(DP-1)]`
- `MODEL` must be `8b-mot`
- `DP`, `TP`, `LB_PORT`, `BACKEND_BASE_PORT` must be integers ≥ 1
- **Pre-flight port probe**: every port (LB + all backends) must be free. Stale processes from a previous run are detected before any replica is launched, with `ss -lntp` / `lsof` output naming the owner when possible
Balancing: least in-flight. Streaming passthrough. Per-request timeout 30 min (override `LB_REQUEST_TIMEOUT`).
Health: each replica probed every 10s via `GET /v1/models`. Unhealthy backends skipped but not evicted; rejoin when probes pass. Monitor at `GET http://localhost:<LB_PORT>/_lb/status`.
Per-replica logs land at `evaluation/easi/logs/lightllm-<MODEL>-<port>.log` (gitignored). Override with `LOG_DIR=...`. Ctrl-C / SIGTERM on the `serve.sh` shell cascade-kills all replicas + LB.
### Process hygiene
serve.sh won't leak zombie GPU procs:
- Each replica + LB is launched via `setsid` into its own process group. Cleanup hits `kill -TERM -$pgid` → 10 s grace → `kill -KILL -$pgid` → the whole tree (router, tp workers, visual server, zmq, detokenizer) goes down.
- Trap covers `EXIT INT TERM HUP`, not just signals — catches unexpected errors too.
- Belt + suspenders: `pkill -P $$` (our direct children) + `pkill -f "lightllm.server.api_server.*$MODEL_DIR"` (escaped orphans tied to this model) run after the grace period.
- PID file at `$LOG_DIR/serve.<MODEL>.pids` records pgids. On next `serve.sh` launch, any stale entries get TERM+KILL automatically before the new run starts.
If you still end up with zombies (container crashed / SIGKILL'd), run:
```bash
pkill -KILL -f "lightllm.server.api_server.*SenseNova-U1-Mini"
```
and for GPU mem held by processes in another PID namespace (container got torn down and recreated), only a host-side `kill` or pod restart can recover.
### Debugging / verbose logging
`serve.sh` accepts:
| Env | Effect |
| :--- | :--- |
| `DETAIL_LOG=1` | Adds `--detail_log` to LightLLM — logs per-request timing, prompt text, token IDs, and (for multimodal) per-image ViT inference timings |
| `LIGHTLLM_LOG_LEVEL=debug` | Drops LightLLM's root logger to DEBUG. Everything that's wrapped in `logger.debug(...)` now prints (lots of internal signal: KV cache state, router scheduling, detokenization). Default is `info` |
```bash
# verbose single instance
DETAIL_LOG=1 LIGHTLLM_LOG_LEVEL=debug \
MODEL=8b-mot GPUS=0,1 TP=2 bash evaluation/easi/scripts/serve.sh
# verbose multi-replica (env flows to every replica)
DETAIL_LOG=1 LIGHTLLM_LOG_LEVEL=debug \
DP=4 TP=2 MODEL=8b-mot bash evaluation/easi/scripts/serve.sh
# tail one replica's log
tail -f evaluation/easi/logs/lightllm-8b-mot-18000.log
```
Use when a benchmark reports 100% API failures or suspicious per-sample outputs — full HTTP/tokenization trace surfaces on the server side.
## Running benchmarks
Activate the EASI client venv and work from `evaluation/easi/EASI`:
```bash
source evaluation/easi/EASI/.venv/bin/activate
cd evaluation/easi/EASI
```
### Single benchmark
```bash
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--output-dir eval_results_sensenova-u1-8b-mot_viewspatial \
--api-nproc 16 \
--benchmarks viewspatial
```
### Full EASI-8 suite (omit `--benchmarks`)
```bash
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--output-dir eval_results_sensenova-u1-8b-mot \
--api-nproc 16
```
### Multiple benchmarks at once
```bash
python scripts/submissions/run_easi_eval.py \
--model SenseNova-U1-8B-MoT-Local \
--benchmarks viewspatial,blink,3dsrbench \
--api-nproc 16 \
--output-dir eval_results_sensenova-u1-8b-mot
```
### Benchmark keys (EASI-8 core)
`vsi_bench`, `mmsi_bench`, `mindcube_tiny`, `viewspatial`, `site_image`, `site_video`, `blink`, `3dsrbench`, `embspatial`.
Aliases: `site` = `site_image + site_video`. Group name: `sitebench`. Extra (opt-in via `--include-extra`): `mmsi_video_bench`, `omnispatial_(manual_cot)`, `spar_bench`, `vsi_debiased`.
### Useful flags
| Flag | Purpose |
| :--- | :--- |
| `--api-nproc N` | Concurrent HTTP requests to the LightLLM server. 16-32 on 8× H100 with `tp=8`. Lower if timeouts/500s |
| `--no-judge` | Skip LLM-judge re-eval, trust `exact_matching` scores. Faster; less accurate for free-form answers |
| `--rerun` | Force re-evaluation, bypass resume |
| `--verbose` | Print per-sample model responses |
| `--include-extra` | Also run the extra (non-EASI-8) benchmarks |
| `--submit` | Push results to EASI leaderboard (requires `HF_TOKEN`) |
| `--nproc N` | torchrun DP (for local-model backends only, ignored for API endpoints) |
## Tweaking VLMEvalKit endpoint config
Endpoints live in `config/sensenova_models.py` — a committed, in-repo Python module with the `partial(GPT4V, ...)` entries. Edit it, re-run `setup.sh` (idempotent, a few seconds if `--skip-easi`), and `supported_VLM` picks up the change on next interpreter start.
```bash
$EDITOR evaluation/easi/config/sensenova_models.py # change max_tokens / temperature / URLs
bash evaluation/easi/scripts/setup.sh --skip-easi # propagate
```
Full schema of the `entries` dict and `GPT4V` kwarg reference (including `img_detail`, `timeout`, `system_prompt`, thinking-mode subclass pattern, remote-endpoint examples): [`GUIDE.md` §4](GUIDE.md#configuring-a-custom-openai-compatible-endpoint).
The patch at `patches/easi_sensenova_config.patch` only adds a 7-line `from .sensenova_models import entries; supported_VLM.update(entries)` hook to `VLMEvalKit/vlmeval/config.py` — you shouldn't need to touch it.
## Full guide
See [`GUIDE.md`](GUIDE.md) — covers host prereqs, dependency filtering, patch workflow, VLMEvalKit wiring, thinking-mode handling, troubleshooting.
## Image generation benchmarks
Not covered here. Requires the full LightLLM + LightX2V stack (via the upstream Docker image `lightx2v/lightllm_lightx2v:20260407` — see [`../../docs/deployment_CN.md`](../../docs/deployment_CN.md)). LightX2V pins `torch<=2.8.0` which conflicts with LightLLM's `torch==2.9.1`, so image-gen serving lives in a separate environment.
"""Local SenseNova-U1 LightLLM endpoints for VLMEvalKit.
Edit this file to tweak endpoint URLs, ports, max_tokens, temperature, retry,
or to add new variants. Changes take effect the next time VLMEvalKit imports
`vlmeval.config` (typically on the next invocation of `run_easi_eval.py`).
Setup mechanism
---------------
`evaluation/easi/scripts/setup.sh` copies this file into the EASI submodule's
VLMEvalKit as `vlmeval/sensenova_models.py`, then applies a one-line patch to
`vlmeval/config.py` that does:
from .sensenova_models import entries as _sensenova_entries
supported_VLM.update(_sensenova_entries)
So any edits here just need a re-run of setup.sh (idempotent) to propagate.
Port assignments MUST match `evaluation/easi/scripts/serve.sh`:
8b-mot -> 8000 (thinking/reasoning variant)
"""
from functools import partial
# This import only resolves once setup.sh has copied this file into
# evaluation/easi/EASI/VLMEvalKit/vlmeval/. Linter warnings in-tree are expected.
from vlmeval.api.gpt import GPT4V # type: ignore[import-not-found]
entries = {
"SenseNova-U1-8B-MoT-Local": partial(
GPT4V,
model="sensenova-u1-8b-mot",
api_base="http://localhost:8000/v1/chat/completions",
key="dummy",
temperature=0,
max_tokens=8192, # thinking mode needs headroom
retry=10,
verbose=False,
),
}
diff --git a/lightllm/server/build_prompt.py b/lightllm/server/build_prompt.py
index 21c1cde6..843f2832 100644
--- a/lightllm/server/build_prompt.py
+++ b/lightllm/server/build_prompt.py
@@ -63,11 +63,37 @@ def _normalize_tool_call_arguments(messages: list) -> None:
pass
+def _flatten_multimodal_content(messages: list) -> None:
+ # OpenAI-style chat requests may send content as a list of parts
+ # ({"type": "text", ...}, {"type": "image_url", ...}). Most HF chat
+ # templates (Qwen, LLaMA, ...) call string methods on `content` directly
+ # and crash on lists. Flatten into a plain string with placeholder tokens
+ # so the template can render; the image data itself is already extracted
+ # upstream via `_get_images_and_audios` and fed through MultimodalParams.
+ for msg in messages:
+ content = msg.get("content")
+ if not isinstance(content, list):
+ continue
+ parts: list = []
+ for item in content:
+ if not isinstance(item, dict):
+ continue
+ t = item.get("type")
+ if t == "text":
+ parts.append(item.get("text", "") or "")
+ elif t == "image_url":
+ parts.append("<image>")
+ elif t == "audio_url":
+ parts.append("<audio>")
+ msg["content"] = "\n".join(p for p in parts if p != "")
+
+
async def build_prompt(request, tools) -> str:
global tokenizer
# pydantic格式转成dict, 否则,当根据tokenizer_config.json拼template时,Jinja判断无法识别
messages = [m.model_dump(by_alias=True, exclude_none=True) for m in request.messages]
_normalize_tool_call_arguments(messages)
+ _flatten_multimodal_content(messages)
kwargs = {"conversation": messages}
if request.character_settings:
diff --git a/vlmeval/config.py b/vlmeval/config.py
index e6ae86c..6f35947 100644
--- a/vlmeval/config.py
+++ b/vlmeval/config.py
@@ -2170,3 +2170,12 @@ model_groups.extend([bagel_series, spatial_related_models, sensenova_si_series])
for grp in model_groups:
supported_VLM.update(grp)
+
+# --- SenseNova-U1 local LightLLM endpoints (injected by SenseNova-U1 repo) ---
+# Edit: evaluation/easi/config/sensenova_models.py (then re-run setup.sh)
+try:
+ from .sensenova_models import entries as _sensenova_u1_entries
+ supported_VLM.update(_sensenova_u1_entries)
+except Exception as _sensenova_u1_err: # pragma: no cover
+ import sys as _sys
+ print(f"[sensenova-u1-config] failed to load local entries: {_sensenova_u1_err}", file=_sys.stderr)
#!/usr/bin/env bash
# Download SenseNova-U1 weights from HuggingFace into <repo_root>/models/.
#
# Usage:
# bash evaluation/easi/scripts/download_weights.sh 8b-mot # sensenova/SenseNova-U1-8B-MoT (reasoning)
#
# Requires: .venv-lightllm activated (has huggingface_hub installed).
# First-time use: `uv pip install huggingface_hub` if `hf` command not found.
# Optional: export HF_TOKEN=... if the repo gates downloads.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
VENV_DIR="${REPO_ROOT}/.venv-lightllm"
MODELS_DIR="${REPO_ROOT}/models"
mkdir -p "${MODELS_DIR}"
# Auto-activate .venv-lightllm if it exists and we're not already in it.
# Avoids picking up `hf` from an arbitrary venv that may lack hf_transfer etc.
if [ -d "${VENV_DIR}" ] && { [ -z "${VIRTUAL_ENV:-}" ] || [ "${VIRTUAL_ENV}" != "${VENV_DIR}" ]; }; then
# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
fi
# If HF_HUB_ENABLE_HF_TRANSFER=1 is set but hf_transfer isn't importable, fall
# back to plain HTTP downloads rather than crashing mid-file.
if [ "${HF_HUB_ENABLE_HF_TRANSFER:-0}" = "1" ]; then
if ! python -c "import hf_transfer" >/dev/null 2>&1; then
echo "[warn] HF_HUB_ENABLE_HF_TRANSFER=1 but hf_transfer not installed — disabling for this run" >&2
unset HF_HUB_ENABLE_HF_TRANSFER
fi
fi
# Prefer the new `hf` CLI (huggingface_hub >= 0.34). Fall back to huggingface-cli.
if command -v hf >/dev/null 2>&1; then
DL="hf download"
elif command -v huggingface-cli >/dev/null 2>&1; then
DL="huggingface-cli download"
else
echo "[error] neither 'hf' nor 'huggingface-cli' found. Activate .venv-lightllm and run: uv pip install huggingface_hub" >&2
exit 1
fi
download() {
local repo_id="$1"
local local_dir="${MODELS_DIR}/$(basename "${repo_id}")"
echo "[download] ${repo_id} -> ${local_dir}"
${DL} "${repo_id}" --local-dir "${local_dir}"
}
target="${1:-8b-mot}"
case "${target}" in
8b-mot) download "sensenova/SenseNova-U1-8B-MoT" ;;
*)
echo "[error] unknown target: ${target}. Use: 8b-mot" >&2
exit 1
;;
esac
echo "[done] weights at ${MODELS_DIR}"
"""Least-in-flight HTTP load balancer for multiple LightLLM server replicas.
Purpose: front N tp-sharded LightLLM instances behind a single OpenAI-compatible
endpoint, so VLMEvalKit / EASI can treat them as one higher-throughput server
without any client-side changes.
Backend selection: least in-flight requests (falls back to round-robin when tied).
Health checks: periodic GET /v1/models against each backend; unhealthy backends
are skipped (but never permanently removed — they rejoin when they recover).
Streaming: passthrough via httpx.stream() + StreamingResponse — long SSE works.
Configuration (env):
BACKENDS Comma-separated backend base URLs, e.g.
"http://localhost:8000,http://localhost:8010"
(REQUIRED)
LB_HOST Bind address (default: 0.0.0.0)
LB_PORT Listen port (default: 9000)
LB_REQUEST_TIMEOUT Seconds per proxied request (default: 1800 = 30 min)
LB_HEALTH_INTERVAL Seconds between health probes (default: 10)
LB_STARTUP_TIMEOUT Max wait for first healthy (default: 600)
Run via `serve_lb.sh` or directly:
BACKENDS=http://localhost:8000,http://localhost:8010 python lb.py
"""
from __future__ import annotations
import asyncio
import os
import sys
import time
from contextlib import asynccontextmanager
import httpx
import uvicorn
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse
BACKENDS: list[str] = [b.strip().rstrip("/") for b in os.environ.get("BACKENDS", "").split(",") if b.strip()]
if not BACKENDS:
print("[lb] ERROR: BACKENDS env var required (comma-separated base URLs)", file=sys.stderr)
sys.exit(1)
LB_HOST = os.environ.get("LB_HOST", "0.0.0.0")
LB_PORT = int(os.environ.get("LB_PORT", "9000"))
REQUEST_TIMEOUT = float(os.environ.get("LB_REQUEST_TIMEOUT", "1800"))
HEALTH_INTERVAL = float(os.environ.get("LB_HEALTH_INTERVAL", "10"))
STARTUP_TIMEOUT = float(os.environ.get("LB_STARTUP_TIMEOUT", "600"))
# Per-backend state: in-flight count + health flag.
inflight: dict[str, int] = {b: 0 for b in BACKENDS}
healthy: dict[str, bool] = {b: False for b in BACKENDS}
async def _probe_once(client: httpx.AsyncClient, backend: str) -> bool:
"""Return True if the backend looks like a LightLLM server.
Pitfall: any random HTTP listener on the same port would answer, so we
can't just check "did the connection succeed". Instead, hit /health (if
exposed) or /v1/models and require the response to be from LightLLM —
we key off the `Server: hypercorn-h11` header which LightLLM emits
(hypercorn is its ASGI server) and upstream OpenAI-compat servers
generally don't use hypercorn.
Falls back to "any response at /v1/chat/completions with OPTIONS" when
the server header check is ambiguous.
"""
try:
# /v1/models currently 500s on LightLLM due to an upstream pydantic
# bug in ModelCard (owned_by=None). That's actually fine for our
# probe — we just need a response.
resp = await client.get(f"{backend}/v1/models", timeout=5.0)
except (httpx.ConnectError, httpx.ConnectTimeout, httpx.ReadTimeout, OSError):
return False
except Exception:
return False
# Require the Server header to mention hypercorn — distinguishes a real
# LightLLM from a stray process on the same port.
server_hdr = resp.headers.get("server", "").lower()
if "hypercorn" in server_hdr:
return True
# Fallback: some reverse proxies strip Server headers. Accept any non-HTML
# response body shorter than 4 KB with a 2xx/4xx/5xx status (i.e. not 200
# HTML from a static file server).
ctype = resp.headers.get("content-type", "").lower()
if "html" in ctype:
return False
return True
async def health_loop(client: httpx.AsyncClient):
"""Periodically probe each backend; flip the healthy dict."""
while True:
for b in BACKENDS:
ok = await _probe_once(client, b)
if ok != healthy[b]:
print(f"[lb] backend {b}: {'UP' if ok else 'DOWN'}", flush=True)
healthy[b] = ok
await asyncio.sleep(HEALTH_INTERVAL)
async def wait_for_first_healthy(client: httpx.AsyncClient):
"""Block until at least one backend passes a probe, or STARTUP_TIMEOUT."""
deadline = time.monotonic() + STARTUP_TIMEOUT
print(f"[lb] waiting for at least one healthy backend (timeout {STARTUP_TIMEOUT:.0f}s)...", flush=True)
while time.monotonic() < deadline:
for b in BACKENDS:
if await _probe_once(client, b):
healthy[b] = True
print(f"[lb] first healthy backend: {b}", flush=True)
return
await asyncio.sleep(2.0)
print(
f"[lb] WARNING: no backend became healthy within {STARTUP_TIMEOUT:.0f}s — serving anyway",
file=sys.stderr,
flush=True,
)
def pick_backend() -> str | None:
"""Least in-flight among healthy backends; ties broken by first-seen order."""
candidates = [b for b in BACKENDS if healthy[b]]
if not candidates:
# All unhealthy — fall back to any backend (will error back to client).
candidates = BACKENDS
return min(candidates, key=lambda b: inflight[b])
@asynccontextmanager
async def lifespan(app: FastAPI):
# One shared client used for both proxying and health checks; large pool
# so we can overlap many vlmevalkit workers.
limits = httpx.Limits(max_keepalive_connections=128, max_connections=256)
client = httpx.AsyncClient(timeout=httpx.Timeout(REQUEST_TIMEOUT), limits=limits)
app.state.client = client
await wait_for_first_healthy(client)
health_task = asyncio.create_task(health_loop(client))
try:
yield
finally:
health_task.cancel()
await client.aclose()
app = FastAPI(lifespan=lifespan)
@app.get("/_lb/status")
async def lb_status():
return {
"backends": BACKENDS,
"healthy": healthy,
"inflight": inflight,
}
@app.api_route("/{full_path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS", "HEAD"])
async def proxy(full_path: str, request: Request):
backend = pick_backend()
if backend is None:
return JSONResponse({"error": "no backends configured"}, status_code=503)
url = f"{backend}/{full_path}"
if request.url.query:
url = f"{url}?{request.url.query}"
# Strip hop-by-hop + host headers; pass the rest through.
drop = {"host", "content-length", "connection", "transfer-encoding"}
fwd_headers = {k: v for k, v in request.headers.items() if k.lower() not in drop}
body = await request.body()
client: httpx.AsyncClient = request.app.state.client
inflight[backend] += 1
try:
upstream_req = client.build_request(
request.method,
url,
content=body if body else None,
headers=fwd_headers,
)
upstream_resp = await client.send(upstream_req, stream=True)
except Exception as e:
inflight[backend] -= 1
return JSONResponse({"error": f"backend {backend} unreachable: {e}"}, status_code=502)
async def body_iter():
try:
async for chunk in upstream_resp.aiter_raw():
yield chunk
finally:
await upstream_resp.aclose()
inflight[backend] -= 1
# Filter response headers (don't let chunked/transfer-encoding pass through verbatim).
out_headers = {k: v for k, v in upstream_resp.headers.items() if k.lower() not in drop}
return StreamingResponse(
body_iter(),
status_code=upstream_resp.status_code,
headers=out_headers,
media_type=upstream_resp.headers.get("content-type"),
)
if __name__ == "__main__":
print(f"[lb] backends: {BACKENDS}", flush=True)
print(f"[lb] listening on http://{LB_HOST}:{LB_PORT} (status: /_lb/status)", flush=True)
uvicorn.run(app, host=LB_HOST, port=LB_PORT, log_level="info", access_log=False)
#!/usr/bin/env bash
# Launch LightLLM OpenAI-compatible server(s) for SenseNova-U1.
#
# Two modes, picked automatically by DP:
#
# DP=1 (default) — single LightLLM instance bound directly to the canonical
# per-model port (8b-mot → 8000).
# No load balancer, no extra process.
#
# DP>1 — DP tp-sharded replicas on backend ports (8100+, step 10)
# + a Python load balancer on the canonical port.
# Clients always hit the same port regardless of DP, so
# VLMEvalKit config.py never needs to change.
#
# Total GPUs used = DP × TP, assigned contiguously from GPU 0 unless GPUS set.
#
# Usage:
# bash evaluation/easi/scripts/serve.sh # DP=1, 8b-mot, GPUs 0-1, port 8000
# TP=8 GPUS=0,1,2,3,4,5,6,7 bash evaluation/easi/scripts/serve.sh # single big instance
# DP=4 TP=2 bash evaluation/easi/scripts/serve.sh # 4 replicas × tp=2, LB on 8000
#
# Env vars:
# MODEL 8b-mot (default: 8b-mot)
# MODEL_DIR explicit path, overrides MODEL (default: ./models/SenseNova-U1-8B-MoT)
# DP # of replicas (default: 1)
# TP tensor parallel degree / replica (default: 2)
# GPUS CSV of CUDA_VISIBLE_DEVICES ids (default: 0,1,...,DP*TP-1)
# HOST bind address (default: 0.0.0.0)
# LB_PORT canonical client-facing port (default: 8000)
# (alias: PORT — honored for backcompat when DP=1)
# BACKEND_BASE_PORT first backend port when DP>1 (default: 8100, step 10)
# MAX_LEN --max_req_total_len (default: 32768)
# MEM_FRAC --mem_fraction (default: 0.85)
# MODEL_NAME advertised model name (default: per-model)
# REASONING --reasoning_parser (default: qwen3)
# qwen3: strips <think>...</think> into reasoning_content
# qwen3-thinking: force reasoning even w/o <think> tag
# "" (empty): disable parser (raw content)
# NO_AUTO_DL 1 = skip weight auto-download (default: unset)
# DETAIL_LOG 1 = --detail_log (per-request DEBUG: timing, prompt, tokens, ViT)
# LIGHTLLM_LOG_LEVEL debug|info|warning|error (default: info)
# LOG_DIR dir for replica log files (DP>1) (default: evaluation/easi/logs/)
#
# Guardrails:
# - DP*TP > #visible GPUs → fail fast (or set GPUS=... explicitly)
# - GPUS count must equal DP*TP
# - Any port (LB + backends) already in use → fail fast with owner PID when discoverable
# - DP=1 vs DP>1 are mutually exclusive on the canonical port — don't run both
# concurrently for the same MODEL
#
# Notes:
# - Ctrl-C / SIGTERM cascade-kills all replicas + LB when DP>1.
# - Per-replica logs at $LOG_DIR/lightllm-<MODEL>-<port>.log (DP>1 only).
# - LB passes through health probes + streaming; least in-flight balancing.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
EASI_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
REPO_ROOT="$(cd "${EASI_ROOT}/../.." && pwd)"
cd "${REPO_ROOT}"
VENV_DIR="${REPO_ROOT}/.venv-lightllm"
# ---------------------------------------------------------------------------
# 1) Resolve model defaults
# ---------------------------------------------------------------------------
MODEL="${MODEL:-8b-mot}"
case "${MODEL}" in
8b-mot)
DEFAULT_DIR="${REPO_ROOT}/models/SenseNova-U1-8B-MoT"
DEFAULT_LB_PORT=8000
DEFAULT_BACKEND_BASE=8100
DEFAULT_MODEL_NAME="sensenova-u1-8b-mot"
DEFAULT_REASONING="qwen3"
;;
*)
echo "[error] MODEL must be '8b-mot' (got: ${MODEL})" >&2
exit 1
;;
esac
MODEL_DIR="${MODEL_DIR:-${DEFAULT_DIR}}"
DP="${DP:-1}"
TP="${TP:-2}"
HOST="${HOST:-0.0.0.0}"
# PORT is kept as an alias for LB_PORT for backcompat. LB_PORT wins if both set.
LB_PORT="${LB_PORT:-${PORT:-${DEFAULT_LB_PORT}}}"
BACKEND_BASE_PORT="${BACKEND_BASE_PORT:-${DEFAULT_BACKEND_BASE}}"
MAX_LEN="${MAX_LEN:-32768}"
MEM_FRAC="${MEM_FRAC:-0.85}"
MODEL_NAME="${MODEL_NAME:-${DEFAULT_MODEL_NAME}}"
REASONING="${REASONING-${DEFAULT_REASONING}}" # unset → default; "" → user disable
DETAIL_LOG="${DETAIL_LOG:-0}"
LIGHTLLM_LOG_LEVEL="${LIGHTLLM_LOG_LEVEL:-info}"
LOG_DIR="${LOG_DIR:-${EASI_ROOT}/logs}"
# ---------------------------------------------------------------------------
# 2) Integer validation
# ---------------------------------------------------------------------------
case "${DP}${TP}${LB_PORT}${BACKEND_BASE_PORT}" in *[!0-9]*)
echo "[error] DP, TP, LB_PORT, BACKEND_BASE_PORT must all be integers" >&2
exit 1
;;
esac
if [ "${DP}" -lt 1 ] || [ "${TP}" -lt 1 ]; then
echo "[error] DP and TP must be >= 1 (got DP=${DP} TP=${TP})" >&2
exit 1
fi
# When DP>1, LB_PORT must not overlap the backend range.
if [ "${DP}" -gt 1 ]; then
backend_end=$((BACKEND_BASE_PORT + 10 * (DP - 1)))
if [ "${LB_PORT}" -ge "${BACKEND_BASE_PORT}" ] && [ "${LB_PORT}" -le "${backend_end}" ] \
&& [ $(( (LB_PORT - BACKEND_BASE_PORT) % 10 )) -eq 0 ]; then
echo "[error] LB_PORT=${LB_PORT} collides with backend port range ${BACKEND_BASE_PORT}..${backend_end} (step 10)" >&2
echo " move LB_PORT or BACKEND_BASE_PORT so they don't overlap" >&2
exit 1
fi
fi
# ---------------------------------------------------------------------------
# 3) GPU sanity
# ---------------------------------------------------------------------------
NEED=$(( DP * TP ))
if [ -n "${GPUS:-}" ]; then
gpu_count="$(echo "${GPUS}" | awk -F, '{print NF}')"
if [ "${gpu_count}" != "${NEED}" ]; then
echo "[error] GPUS has ${gpu_count} entries but DP*TP=${NEED}" >&2
echo " provide exactly ${NEED} GPU IDs, or unset GPUS to auto-assign" >&2
exit 1
fi
IFS=',' read -r -a GPU_ARR <<< "${GPUS}"
else
if ! command -v nvidia-smi >/dev/null 2>&1; then
echo "[error] nvidia-smi not found — can't auto-detect GPUs. Set GPUS=..." >&2
exit 1
fi
avail="$(nvidia-smi --query-gpu=index --format=csv,noheader | wc -l | tr -d ' ')"
if [ "${avail}" -lt "${NEED}" ]; then
echo "[error] DP*TP = ${NEED} GPUs required but only ${avail} visible to nvidia-smi" >&2
echo " reduce DP or TP, or set GPUS=... to a subset of available GPUs" >&2
exit 1
fi
GPU_ARR=()
for i in $(seq 0 $((NEED - 1))); do GPU_ARR+=("${i}"); done
fi
# ---------------------------------------------------------------------------
# 4) Venv activation
# ---------------------------------------------------------------------------
if [ ! -d "${VENV_DIR}" ]; then
echo "[error] venv not found at ${VENV_DIR}" >&2
echo " run: bash evaluation/easi/scripts/setup.sh" >&2
exit 1
fi
if [ -z "${VIRTUAL_ENV:-}" ] || [ "${VIRTUAL_ENV}" != "${VENV_DIR}" ]; then
echo "[serve] activating ${VENV_DIR}"
# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
fi
# ---------------------------------------------------------------------------
# 5) Weight check (once, before any replica forks)
# ---------------------------------------------------------------------------
if [ ! -f "${MODEL_DIR}/config.json" ]; then
if [ "${NO_AUTO_DL:-0}" = "1" ]; then
echo "[error] ${MODEL_DIR}/config.json missing" >&2
echo " set NO_AUTO_DL=0 or run: bash evaluation/easi/scripts/download_weights.sh ${MODEL}" >&2
exit 1
fi
echo "[serve] config.json missing at ${MODEL_DIR} — downloading ${MODEL}..."
bash "${SCRIPT_DIR}/download_weights.sh" "${MODEL}"
if [ ! -f "${MODEL_DIR}/config.json" ]; then
echo "[error] download appears to have failed — still no ${MODEL_DIR}/config.json" >&2
echo " check HF_TOKEN, network, and org membership for SenseNova" >&2
exit 1
fi
fi
# ---------------------------------------------------------------------------
# 6) Pre-flight port probe
# ---------------------------------------------------------------------------
# Catches stale servers on any port we plan to bind. Avoids the
# "replica crash-binds, but zombie on port fakes LB health" failure mode.
port_in_use() {
local port="$1"
if (exec 3<>"/dev/tcp/127.0.0.1/${port}") 2>/dev/null; then
exec 3<&-
return 0
fi
return 1
}
ports_to_check=()
if [ "${DP}" -eq 1 ]; then
# Single-instance mode: bind the canonical port directly.
ports_to_check+=("${LB_PORT}")
else
# Multi-replica mode: LB on canonical + each backend.
ports_to_check+=("${LB_PORT}")
for i in $(seq 0 $((DP - 1))); do
ports_to_check+=($((BACKEND_BASE_PORT + i * 10)))
done
fi
busy=()
for p in "${ports_to_check[@]}"; do
port_in_use "${p}" && busy+=("${p}")
done
if [ "${#busy[@]}" -gt 0 ]; then
echo "[error] port(s) already in use: ${busy[*]}" >&2
for p in "${busy[@]}"; do
echo " port ${p}:" >&2
ss -lntp 2>/dev/null | awk -v pat=":${p} " '$4 ~ pat {print " " $0}' >&2 \
|| lsof -i":${p}" 2>/dev/null | sed 's/^/ /' >&2 \
|| echo " (couldn't identify owner; try: sudo lsof -i:${p} or sudo ss -lntp | grep :${p})" >&2
done
echo "[error] stop the conflicting process(es) first, or override LB_PORT / BACKEND_BASE_PORT" >&2
exit 1
fi
# ---------------------------------------------------------------------------
# 7) Helper: launch one LightLLM instance (blocking unless backgrounded by caller)
# ---------------------------------------------------------------------------
launch_lightllm() {
local gpus="$1" port="$2"
local extra=()
[ -n "${REASONING}" ] && extra+=(--reasoning_parser "${REASONING}")
[ "${DETAIL_LOG}" = "1" ] && extra+=(--detail_log)
env CUDA_VISIBLE_DEVICES="${gpus}" \
LIGHTLLM_LOG_LEVEL="${LIGHTLLM_LOG_LEVEL}" \
python -m lightllm.server.api_server \
--model_dir "${MODEL_DIR}" \
--model_name "${MODEL_NAME}" \
--model_owner "sensenova" \
--host "${HOST}" \
--port "${port}" \
--tp "${TP}" \
--max_req_total_len "${MAX_LEN}" \
--mem_fraction "${MEM_FRAC}" \
--trust_remote_code \
"${extra[@]}"
}
# ---------------------------------------------------------------------------
# 8) Dispatch: single-instance or multi-replica + LB
# ---------------------------------------------------------------------------
if [ "${DP}" -eq 1 ]; then
# --- single instance, direct on LB_PORT ---
mkdir -p "${LOG_DIR}"
PIDFILE="${LOG_DIR}/serve.${MODEL}.pids"
set -m
# Self-heal stale PIDs from prior run (same as multi-replica path).
if [ -f "${PIDFILE}" ]; then
stale="$(cat "${PIDFILE}" 2>/dev/null | tr '\n' ' ')"
for pg in ${stale}; do
[ -z "${pg}" ] && continue
kill -TERM -"${pg}" 2>/dev/null || true
done
sleep 1
for pg in ${stale}; do
[ -z "${pg}" ] && continue
kill -KILL -"${pg}" 2>/dev/null || true
done
pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
rm -f "${PIDFILE}"
fi
echo "[serve] model=${MODEL_NAME} dir=${MODEL_DIR}"
echo "[serve] GPUS=${GPUS:-$(IFS=,; echo "${GPU_ARR[*]}")} TP=${TP} port=${LB_PORT} reasoning=${REASONING:-off} detail_log=${DETAIL_LOG} log_level=${LIGHTLLM_LOG_LEVEL}"
_gpus_csv="$(IFS=,; echo "${GPU_ARR[*]}")"
( launch_lightllm "${_gpus_csv}" "${LB_PORT}" ) &
LIGHTLLM_PID=$!
echo "${LIGHTLLM_PID}" > "${PIDFILE}"
_cleaning_up=0
cleanup() {
[ "${_cleaning_up}" = "1" ] && return
_cleaning_up=1
echo ""
echo "[serve] shutting down..."
kill -TERM -"${LIGHTLLM_PID}" 2>/dev/null || kill -TERM "${LIGHTLLM_PID}" 2>/dev/null || true
local waited=0
while [ "${waited}" -lt 10 ]; do
kill -0 "${LIGHTLLM_PID}" 2>/dev/null || break
sleep 1; waited=$((waited + 1))
done
kill -KILL -"${LIGHTLLM_PID}" 2>/dev/null || true
kill -KILL "${LIGHTLLM_PID}" 2>/dev/null || true
pkill -KILL -P $$ 2>/dev/null || true
pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
rm -f "${PIDFILE}"
echo "[serve] cleanup done"
}
trap cleanup EXIT INT TERM HUP
wait "${LIGHTLLM_PID}"
exit $?
fi
# --- multi-replica + LB ---
mkdir -p "${LOG_DIR}"
PIDFILE="${LOG_DIR}/serve.${MODEL}.pids"
# Enable job control so each backgrounded subshell gets its own process group
# (pgid == subshell pid). We kill the whole group on cleanup — catches every
# LightLLM worker / router / zmq / visual server / detokenizer child.
set -m
# ---------------------------------------------------------------------------
# Self-heal: if a previous run left a PID file, try to clean its processes
# ---------------------------------------------------------------------------
if [ -f "${PIDFILE}" ]; then
echo "[serve] stale PID file found: ${PIDFILE}"
stale_pgs="$(cat "${PIDFILE}" 2>/dev/null | tr '\n' ' ')"
for pg in ${stale_pgs}; do
[ -z "${pg}" ] && continue
# Negative PID = signal entire process group
kill -TERM -"${pg}" 2>/dev/null || true
done
sleep 1
for pg in ${stale_pgs}; do
[ -z "${pg}" ] && continue
kill -KILL -"${pg}" 2>/dev/null || true
done
# Belt + suspenders: kill any lightllm server still tied to this MODEL_DIR
pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
rm -f "${PIDFILE}"
echo "[serve] stale processes cleaned; continuing..."
fi
echo "[serve] multi-replica mode: DP=${DP} TP=${TP} total_gpus=${NEED}"
echo "[serve] GPU assignment: ${GPU_ARR[*]}"
echo "[serve] backend ports (step 10): $(for i in $(seq 0 $((DP-1))); do echo -n "$((BACKEND_BASE_PORT + i*10)) "; done)"
echo "[serve] LB listens on :${LB_PORT} (canonical port for ${MODEL})"
echo "[serve] replica logs: ${LOG_DIR}/"
echo "[serve] reasoning=${REASONING:-off} detail_log=${DETAIL_LOG} log_level=${LIGHTLLM_LOG_LEVEL}"
REPLICA_PIDS=() # pid of each backgrounded subshell (== pgid with job control)
BACKENDS=()
for i in $(seq 0 $((DP - 1))); do
port=$((BACKEND_BASE_PORT + i * 10))
start=$((i * TP))
end=$((start + TP - 1))
gpus=""
for j in $(seq "${start}" "${end}"); do
gpus="${gpus}${gpus:+,}${GPU_ARR[$j]}"
done
log="${LOG_DIR}/lightllm-${MODEL}-${port}.log"
echo "[serve] launching replica $((i+1))/${DP}: GPUS=${gpus} PORT=${port}${log}"
# With `set -m` above, each backgrounded subshell gets its own pgid (== $!).
# LightLLM's fork-spawned children inherit that pgid, so `kill -- -$pid` on
# cleanup hits the whole tree (router, tp workers, visual server, zmq, ...).
( launch_lightllm "${gpus}" "${port}" ) >"${log}" 2>&1 &
REPLICA_PIDS+=("$!")
BACKENDS+=("http://localhost:${port}")
done
BACKENDS_CSV="$(IFS=,; echo "${BACKENDS[*]}")"
# Persist PIDs/PGIDs for self-heal on next run.
{ for pid in "${REPLICA_PIDS[@]}"; do echo "${pid}"; done; } > "${PIDFILE}"
# ---------------------------------------------------------------------------
# Hardened cleanup: SIGTERM process groups → wait → SIGKILL → pattern sweep
# ---------------------------------------------------------------------------
_cleaning_up=0
cleanup() {
[ "${_cleaning_up}" = "1" ] && return
_cleaning_up=1
echo ""
echo "[serve] shutting down..."
local all_pids=("${REPLICA_PIDS[@]}" "${LB_PID:-}")
# 1) SIGTERM the entire process group of each replica + LB.
for pid in "${all_pids[@]}"; do
[ -z "${pid}" ] && continue
kill -TERM -"${pid}" 2>/dev/null \
|| kill -TERM "${pid}" 2>/dev/null || true
done
# 2) Poll up to 10s for graceful exit.
local waited=0
while [ "${waited}" -lt 10 ]; do
local alive=0
for pid in "${all_pids[@]}"; do
[ -z "${pid}" ] && continue
kill -0 "${pid}" 2>/dev/null && { alive=1; break; }
done
[ "${alive}" = "0" ] && break
sleep 1
waited=$((waited + 1))
done
# 3) SIGKILL any survivors (groups, then individual pids).
for pid in "${all_pids[@]}"; do
[ -z "${pid}" ] && continue
kill -KILL -"${pid}" 2>/dev/null || true
kill -KILL "${pid}" 2>/dev/null || true
done
# 4) Belt + suspenders: sweep any remaining children of this shell and any
# lightllm procs tied to our MODEL_DIR (catches orphans that escaped pg).
pkill -KILL -P $$ 2>/dev/null || true
pkill -KILL -f "lightllm\.server\.api_server.*${MODEL_DIR}" 2>/dev/null || true
pkill -KILL -f "${SCRIPT_DIR}/lb.py" 2>/dev/null || true
rm -f "${PIDFILE}"
echo "[serve] cleanup done"
}
# EXIT covers normal exit + unexpected errors (set -e); INT/TERM/HUP covers signals.
trap cleanup EXIT INT TERM HUP
echo "[serve] starting LB (backends: ${BACKENDS_CSV})"
( env BACKENDS="${BACKENDS_CSV}" LB_PORT="${LB_PORT}" python "${SCRIPT_DIR}/lb.py" ) &
LB_PID=$!
echo "${LB_PID}" >> "${PIDFILE}"
wait "${LB_PID}"
# EXIT trap fires cleanup
#!/usr/bin/env bash
# Deprecated: merged into serve.sh. serve.sh now handles both DP=1 (direct)
# and DP>1 (multi-replica + LB) based on the DP env var.
#
# This shim forwards to serve.sh with a warning. Update your commands:
#
# Old: DP=4 TP=2 bash evaluation/easi/scripts/serve_lb.sh
# New: DP=4 TP=2 bash evaluation/easi/scripts/serve.sh
set -euo pipefail
echo "[serve_lb] DEPRECATED: use serve.sh instead — it now supports DP>1 natively." >&2
echo "[serve_lb] forwarding to serve.sh with the same env..." >&2
exec bash "$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/serve.sh" "$@"
#!/usr/bin/env bash
# Set up everything needed to run SenseNova-U1 visual understanding benchmarks:
# LightLLM serving stack + EASI benchmark client + VLMEvalKit endpoint
# registration.
#
# Idempotent: safe to re-run. Each step checks whether it already ran.
#
# What it does (full run):
# 1. Sanity-checks host prereqs (uv, libnuma, nvidia driver). [lightllm]
# 2. Initializes submodules (LightLLM, EASI, EASI/VLMEvalKit, EASI/lmms-eval).
# 3. Creates .venv-lightllm/ Python 3.10 venv at the repo root. [lightllm]
# 4. Installs pinned LightLLM deps (strips unbuildable nixl + cchardet). [lightllm]
# 5. Installs vllm 0.16.0 + LightLLM editable + pandas. [lightllm]
# 6. Applies local patches from evaluation/easi/lightllm-stack/patches/. [lightllm]
# 7. Verifies LightLLM imports + api_server CLI. [lightllm]
# 8. Installs EASI benchmark client venv (delegates to EASI's own setup.sh,
# which creates evaluation/easi/EASI/.venv with Python 3.11 and installs
# both VLMEvalKit and lmms-eval backends + flash-attn).
# 9. Registers SenseNova-U1 endpoints in VLMEvalKit:
# - copies evaluation/easi/config/sensenova_models.py into
# evaluation/easi/EASI/VLMEvalKit/vlmeval/sensenova_models.py
# - applies evaluation/easi/patches/easi_sensenova_config.patch (7-line hook).
# Edit config/sensenova_models.py then re-run this script to propagate.
# Point `api_base` at your own endpoint (localhost, docker host, remote infra
# team endpoint, etc.) — nothing in EASI itself assumes the server is local.
#
# Flags:
# --skip-lightllm skip steps 1, 3-7 — DON'T install the LightLLM serving stack
# (use when you already have a SenseNova-U1 OpenAI-compatible
# endpoint elsewhere: docker, remote infra team, etc.)
# --skip-easi skip step 8 (EASI client venv install — slow, builds flash-attn)
# --skip-register skip step 9 (VLMEvalKit endpoint registration)
#
# Usage:
# bash evaluation/easi/scripts/setup.sh # full install
# bash evaluation/easi/scripts/setup.sh --skip-lightllm # bring your own endpoint
# bash evaluation/easi/scripts/setup.sh --skip-easi # lightllm only
# bash evaluation/easi/scripts/setup.sh --skip-register # no config.py patch
set -euo pipefail
SKIP_LIGHTLLM=0
SKIP_EASI=0
SKIP_REGISTER=0
for arg in "$@"; do
case "${arg}" in
--skip-lightllm) SKIP_LIGHTLLM=1 ;;
--skip-easi) SKIP_EASI=1 ;;
--skip-register) SKIP_REGISTER=1 ;;
-h|--help)
sed -n '1,/^set -euo/p' "${BASH_SOURCE[0]}" | sed 's/^# \{0,1\}//'
exit 0
;;
*)
echo "[error] unknown flag: ${arg}" >&2
exit 1
;;
esac
done
if [ "${SKIP_LIGHTLLM}" = "1" ] && [ "${SKIP_EASI}" = "1" ] && [ "${SKIP_REGISTER}" = "1" ]; then
echo "[error] --skip-lightllm + --skip-easi + --skip-register leaves nothing to do" >&2
exit 1
fi
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
EASI_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
REPO_ROOT="$(cd "${EASI_ROOT}/../.." && pwd)"
STACK_DIR="${EASI_ROOT}/lightllm-stack"
LIGHTLLM_DIR="${STACK_DIR}/LightLLM"
VENV_DIR="${REPO_ROOT}/.venv-lightllm"
PATCHES_DIR="${STACK_DIR}/patches"
REQ_OUT="/tmp/lightllm-req.txt"
EASI_DIR="${EASI_ROOT}/EASI"
EASI_VENV="${EASI_DIR}/.venv"
EASI_VLMEVAL_DIR="${EASI_DIR}/VLMEvalKit"
EASI_PATCHES_DIR="${EASI_ROOT}/patches"
EASI_CONFIG_DIR="${EASI_ROOT}/config"
log() { echo "[setup] $*"; }
err() { echo "[error] $*" >&2; }
# -------------------------------------------------------------------------
# 1) Host prereqs (LightLLM-side only)
# -------------------------------------------------------------------------
if [ "${SKIP_LIGHTLLM}" = "1" ]; then
log "skipping LightLLM setup (--skip-lightllm)"
log " (assumes you already have a SenseNova-U1 OpenAI-compatible endpoint —"
log " point config/sensenova_models.py at it before running step 10)"
# Still need uv for the EASI venv delegation below, and git to init EASI submodule.
if [ "${SKIP_EASI}" = "0" ] && ! command -v uv >/dev/null 2>&1; then
err "uv not found. Install: https://docs.astral.sh/uv/getting-started/installation/"
exit 1
fi
else
log "checking host prereqs..."
if ! command -v uv >/dev/null 2>&1; then
err "uv not found. Install: https://docs.astral.sh/uv/getting-started/installation/"
exit 1
fi
if ! ldconfig -p 2>/dev/null | grep -q libnuma.so.1; then
err "libnuma.so.1 not found. Install system package:"
err " sudo apt-get install -y libnuma1 libnuma-dev"
exit 1
fi
if ! command -v nvidia-smi >/dev/null 2>&1; then
err "nvidia-smi not found. NVIDIA driver required."
exit 1
fi
driver_ok="$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1 | awk -F. '{print ($1 >= 550)}')"
if [ "${driver_ok}" != "1" ]; then
err "NVIDIA driver < 550.x detected. torch 2.9.1+cu128 requires >= 550."
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
exit 1
fi
fi
# -------------------------------------------------------------------------
# 2) Initialize submodules
# -------------------------------------------------------------------------
# With --skip-lightllm we only need EASI (+ its nested VLMEvalKit/lmms-eval).
# Otherwise init everything recursively.
if [ "${SKIP_LIGHTLLM}" = "1" ]; then
log "initializing EASI submodule (skipping LightLLM)..."
(cd "${REPO_ROOT}" && git submodule update --init --recursive evaluation/easi/EASI)
else
log "initializing submodules (recursive)..."
(cd "${REPO_ROOT}" && git submodule update --init --recursive)
if [ ! -f "${LIGHTLLM_DIR}/setup.py" ]; then
err "LightLLM submodule still empty after init: ${LIGHTLLM_DIR}"
err " run: cd ${REPO_ROOT} && git submodule status"
exit 1
fi
fi
# -------------------------------------------------------------------------
# 3) Create LightLLM venv (.venv-lightllm)
# -------------------------------------------------------------------------
if [ "${SKIP_LIGHTLLM}" = "0" ]; then
if [ ! -d "${VENV_DIR}" ]; then
log "creating Python 3.10 venv at ${VENV_DIR}..."
uv venv -p 3.10 "${VENV_DIR}"
else
log "venv already exists at ${VENV_DIR}"
fi
# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
uv pip install --quiet --upgrade pip
fi
# -------------------------------------------------------------------------
# 4-6) LightLLM deps, patches, verify
# -------------------------------------------------------------------------
if [ "${SKIP_LIGHTLLM}" = "0" ]; then
log "filtering upstream requirements (dropping nixl, cchardet)..."
grep -v "^nixl\|^cchardet" "${LIGHTLLM_DIR}/requirements.txt" > "${REQ_OUT}"
# Skip reinstall if lightllm is already importable and key pins match.
# Also verify the editable install points at the current LIGHTLLM_DIR (not a
# stale pre-move path).
lightllm_path="$(python -c "import lightllm, os; print(os.path.dirname(os.path.dirname(lightllm.__file__)))" 2>/dev/null || echo "")"
if python -c "import lightllm, vllm, torch, flashinfer, sgl_kernel, xformers, pandas" >/dev/null 2>&1 \
&& [ "${lightllm_path}" = "${LIGHTLLM_DIR}" ]; then
log "all required packages already installed — skipping pip phase"
else
if [ -n "${lightllm_path}" ] && [ "${lightllm_path}" != "${LIGHTLLM_DIR}" ]; then
log "existing lightllm install points at stale path ${lightllm_path} — reinstalling"
fi
log "installing LightLLM requirements (torch 2.9.1+cu128, flashinfer, sgl-kernel, xformers, ...)"
uv pip install -r "${REQ_OUT}"
log "installing vllm 0.16.0..."
uv pip install --no-cache-dir vllm==0.16.0
log "installing LightLLM (editable)..."
uv pip install --no-cache-dir -e "${LIGHTLLM_DIR}"
log "installing missing transitive deps (pandas)..."
uv pip install --quiet pandas
fi
if [ -d "${PATCHES_DIR}" ] && ls "${PATCHES_DIR}"/*.patch >/dev/null 2>&1; then
log "applying patches in ${PATCHES_DIR}..."
for p in "${PATCHES_DIR}"/*.patch; do
name="$(basename "${p}")"
if (cd "${LIGHTLLM_DIR}" && git apply --reverse --check "${p}" >/dev/null 2>&1); then
log " ${name}: already applied"
continue
fi
if (cd "${LIGHTLLM_DIR}" && git apply --check "${p}" >/dev/null 2>&1); then
(cd "${LIGHTLLM_DIR}" && git apply "${p}")
log " ${name}: applied"
else
err " ${name}: patch does not apply cleanly — file may have drifted upstream"
err " inspect manually: cd ${LIGHTLLM_DIR} && git apply --check ${p}"
exit 1
fi
done
else
log "no patches to apply"
fi
log "verifying imports..."
python - <<'PY'
import torch, flashinfer, sgl_kernel, xformers, vllm, lightllm
print(f" torch {torch.__version__} cuda={torch.cuda.is_available()}")
print(f" vllm {vllm.__version__}")
print(f" flashinfer ok")
print(f" sgl_kernel ok")
print(f" xformers ok")
print(f" lightllm ok")
PY
log "verifying LightLLM CLI..."
python -m lightllm.server.api_server --help | head -3
fi
# -------------------------------------------------------------------------
# 7) Install EASI benchmark client venv (separate from .venv-lightllm)
# -------------------------------------------------------------------------
# Delegates to EASI's own setup.sh, which creates EASI/.venv with Python 3.11
# and installs -e VLMEvalKit -e lmms-eval + pinned deps + flash-attn.
if [ "${SKIP_EASI}" = "1" ]; then
log "skipping EASI client install (--skip-easi)"
elif [ -d "${EASI_VENV}" ] && "${EASI_VENV}/bin/python" -c "import vlmeval" >/dev/null 2>&1; then
log "EASI client venv already has vlmeval — skipping install"
else
log "installing EASI client deps (creates ${EASI_VENV}, may take several minutes)..."
log " (delegating to ${EASI_DIR}/scripts/setup.sh)"
# Deactivate LightLLM venv if active so EASI's setup.sh creates its own cleanly.
deactivate 2>/dev/null || true
(cd "${EASI_DIR}" && bash scripts/setup.sh)
# Re-activate LightLLM venv only if we set it up earlier.
if [ "${SKIP_LIGHTLLM}" = "0" ] && [ -d "${VENV_DIR}" ]; then
# shellcheck disable=SC1091
source "${VENV_DIR}/bin/activate"
fi
fi
# -------------------------------------------------------------------------
# 8) Register local LightLLM endpoints with EASI's VLMEvalKit
# -------------------------------------------------------------------------
# Two-step wire-up, both idempotent:
# (a) copy evaluation/easi/config/sensenova_models.py into VLMEvalKit/vlmeval/
# (b) apply evaluation/easi/patches/easi_sensenova_config.patch to
# VLMEvalKit/vlmeval/config.py so it imports those entries and updates
# supported_VLM at module load.
# Edit the endpoint/port/max_tokens values in config/sensenova_models.py
# then re-run this script to propagate.
if [ "${SKIP_REGISTER}" = "1" ]; then
log "skipping endpoint registration (--skip-register)"
elif [ ! -d "${EASI_VLMEVAL_DIR}" ]; then
log "skipping endpoint registration (VLMEvalKit submodule not initialized)"
else
log "registering SenseNova-U1 endpoints in VLMEvalKit..."
# (a) copy the editable config module
src="${EASI_CONFIG_DIR}/sensenova_models.py"
dst="${EASI_VLMEVAL_DIR}/vlmeval/sensenova_models.py"
if [ ! -f "${src}" ]; then
err " missing ${src} — can't register endpoints"
exit 1
fi
cp -f "${src}" "${dst}"
log " copied sensenova_models.py -> ${dst}"
# (b) apply the config.py hook patch (idempotent)
patch_file="${EASI_PATCHES_DIR}/easi_sensenova_config.patch"
if [ ! -f "${patch_file}" ]; then
err " missing ${patch_file}"
exit 1
fi
if (cd "${EASI_VLMEVAL_DIR}" && git apply --reverse --check "${patch_file}" >/dev/null 2>&1); then
log " easi_sensenova_config.patch: already applied"
elif (cd "${EASI_VLMEVAL_DIR}" && git apply --check "${patch_file}" >/dev/null 2>&1); then
(cd "${EASI_VLMEVAL_DIR}" && git apply "${patch_file}")
log " easi_sensenova_config.patch: applied"
else
err " easi_sensenova_config.patch does not apply cleanly — inspect manually:"
err " cd ${EASI_VLMEVAL_DIR} && git apply --check ${patch_file}"
exit 1
fi
fi
log "done."
log ""
log "next steps:"
if [ "${SKIP_LIGHTLLM}" = "1" ]; then
log " - point config/sensenova_models.py at YOUR endpoint, then propagate:"
log " \$EDITOR evaluation/easi/config/sensenova_models.py"
log " bash evaluation/easi/scripts/setup.sh --skip-lightllm --skip-easi"
log ""
log " - run a benchmark:"
log " source evaluation/easi/EASI/.venv/bin/activate"
log " cd evaluation/easi/EASI"
log " python scripts/submissions/run_easi_eval.py --model SenseNova-U1-8B-MoT-Local --benchmarks blink"
else
log " - launch server (weights auto-downloaded on first call):"
log " bash evaluation/easi/scripts/serve.sh # 8b-mot → port 8000"
log " # or multi-replica DP (same script, DP env flips mode):"
log " DP=4 TP=2 bash evaluation/easi/scripts/serve.sh"
log ""
log " - run a benchmark (from a second shell, after server is up):"
log " source evaluation/easi/EASI/.venv/bin/activate"
log " cd evaluation/easi/EASI"
log " python scripts/submissions/run_easi_eval.py --model SenseNova-U1-8B-MoT-Local --benchmarks blink"
fi
This source diff could not be displayed because it is too large. You can view the blob instead.
from __future__ import annotations
import argparse
import json
import re
import sys
import threading
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from statistics import mean
from typing import Any
if __package__ in {None, ""}:
repo_root = Path(__file__).resolve().parents[3]
sys.path.insert(0, str(repo_root))
from evaluation.gen.bizgeneval.eval_prompt import EVAL_GENERATION_PROMPTS as EVAL_PROMPTS
from evaluation.gen.common.judge import JudgeClient
try:
from tqdm import tqdm
except ImportError:
tqdm = None
DEFAULT_DATA_PATH = Path(__file__).resolve().parent / "data" / "test.jsonl"
ERROR_ALPHA = 0.1
# Reference:
# BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
# https://arxiv.org/abs/2603.25732
def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Evaluate BizGenEval images with Gemini/OpenAI-compatible judge.")
parser.add_argument("--image-dir", required=True, help="Directory containing generated BizGenEval images.")
parser.add_argument("--output-dir", required=True, help="Directory to save per-item results and summary.")
parser.add_argument("--data-path", default=str(DEFAULT_DATA_PATH), help="BizGenEval JSONL path.")
parser.add_argument("--api-base", required=True)
parser.add_argument("--api-key", required=True)
parser.add_argument("--judge-model", required=True)
parser.add_argument("--timeout", type=int, default=240)
parser.add_argument("--concurrency", type=int, default=8)
parser.add_argument("--force-rerun", action="store_true")
return parser.parse_args()
def _to_bool(value: Any) -> bool | None:
if isinstance(value, bool):
return value
if isinstance(value, str):
text = value.strip().lower()
if text in {"true", "1", "yes"}:
return True
if text in {"false", "0", "no"}:
return False
if isinstance(value, int) and value in {0, 1}:
return bool(value)
return None
def _strip_json_fence(text: str) -> str:
content = (text or "").strip()
if content.startswith("```json"):
content = content[7:]
elif content.startswith("```"):
content = content[3:]
if content.endswith("```"):
content = content[:-3]
return content.strip()
def _parse_json_safe(text: str) -> dict[str, Any]:
content = _strip_json_fence(text)
try:
return json.loads(content)
except Exception:
pass
match = re.search(r"\{.*\}", content, flags=re.DOTALL)
if match:
extracted = match.group(0)
try:
return json.loads(extracted)
except Exception:
pass
normalized = re.sub(r"\bTrue\b", "true", extracted)
normalized = re.sub(r"\bFalse\b", "false", normalized)
normalized = re.sub(r"\bNone\b", "null", normalized)
try:
return json.loads(normalized)
except Exception:
pass
raise ValueError("failed to parse judge JSON")
def _extract_results_only(raw_text: str, n_questions: int) -> dict[str, Any] | None:
if not isinstance(raw_text, str) or n_questions <= 0:
return None
matches = re.findall(
r'"result"\s*:\s*(true|false|True|False|"true"|"false"|1|0)',
raw_text,
flags=re.DOTALL,
)
if len(matches) < n_questions:
return None
parsed: dict[str, Any] = {}
for idx, match in enumerate(matches[:n_questions], start=1):
val = _to_bool(str(match).strip().strip('"'))
if val is None:
return None
parsed[str(idx)] = {"result": val}
return parsed
def _format_checklist(questions: list[str]) -> str:
return "\n".join(f"{idx}. {question}" for idx, question in enumerate(questions, start=1))
def _render_prompt(user_template: str, questions: list[str]) -> str:
kwargs = {"checklist": _format_checklist(questions)}
if "{expected_count}" in user_template:
kwargs["expected_count"] = len(questions)
if "{required_keys}" in user_template:
kwargs["required_keys"] = ", ".join(str(i) for i in range(1, len(questions) + 1))
return user_template.format(**kwargs)
def _compute_item_scores(
item: dict[str, Any],
meta_info: dict[str, dict[str, Any]],
error_alpha: float,
qidxs_key: str | None = None,
) -> dict[str, float | int]:
questions = item.get("questions", [])
qidxs = list(item.get(qidxs_key, []) or []) if qidxs_key else list(range(1, len(questions) + 1))
if not qidxs:
return {"accuracy": 0.0, "error_score": 0.0, "errors": 0, "n_questions": 0}
errors = 0
for qidx in qidxs:
if meta_info.get(str(qidx), {}).get("result") is not True:
errors += 1
n_questions = len(qidxs)
accuracy = (n_questions - errors) / n_questions
error_score = max(0.0, 1.0 - error_alpha * errors)
return {
"accuracy": accuracy,
"error_score": error_score,
"errors": errors,
"n_questions": n_questions,
}
def _aggregate(records: list[dict[str, Any]], group_key: str) -> dict[str, dict[str, float | int]]:
grouped: dict[str, list[dict[str, Any]]] = defaultdict(list)
for record in records:
grouped[str(record[group_key])].append(record)
summary: dict[str, dict[str, float | int]] = {}
for key, items in sorted(grouped.items()):
summary[key] = {
"count": len(items),
"accuracy": mean([float(item["accuracy"]) for item in items]) if items else 0.0,
"error_score": mean([float(item["error_score"]) for item in items]) if items else 0.0,
}
return summary
def _load_items(data_path: Path) -> list[dict[str, Any]]:
items = []
with data_path.open(encoding="utf-8") as f:
for prompt_id, line in enumerate(f):
line = line.strip()
if not line:
continue
item = json.loads(line)
item["_prompt_id"] = prompt_id
items.append(item)
return items
def _resolve_image_path(image_dir: Path, prompt_id: int) -> Path | None:
direct = image_dir / f"{prompt_id:04d}.png"
if direct.exists():
return direct
repeat0 = image_dir / f"{prompt_id:04d}_0.png"
if repeat0.exists():
return repeat0
return None
def _is_complete(path: Path) -> bool:
if not path.exists():
return False
try:
with path.open(encoding="utf-8") as f:
data = json.load(f)
meta = data.get("meta_info") or {}
n_questions = int(data.get("n_questions", 0))
return (
data.get("accuracy") is not None
and len(meta) == n_questions
and all(isinstance(v, dict) and "result" in v for v in meta.values())
)
except Exception:
return False
def _record_error(prompt_id: int, exc: Exception) -> None:
print(f"[warn] prompt_id={prompt_id} failed: {type(exc).__name__}: {exc}")
def eval_one(
item: dict[str, Any],
*,
image_dir: Path,
items_dir: Path,
client: JudgeClient,
error_alpha: float,
force_rerun: bool,
write_lock: threading.Lock,
) -> dict[str, Any] | None:
prompt_id = int(item["_prompt_id"])
dataset_id = item.get("id")
image_path = _resolve_image_path(image_dir, prompt_id)
if image_path is None:
print(f"[warn] missing image for prompt_id={prompt_id}")
return None
cache_path = items_dir / f"{prompt_id:04d}.json"
if not force_rerun and _is_complete(cache_path):
with cache_path.open(encoding="utf-8") as f:
cached = json.load(f)
cached["_cached"] = True
return cached
eval_tag = str(item.get("eval_tag") or item.get("dimension") or "").strip()
if eval_tag not in EVAL_PROMPTS:
print(f"[warn] skip prompt_id={prompt_id}: unsupported eval_tag={eval_tag!r}")
return None
questions = list(item.get("questions") or [])
if not questions:
print(f"[warn] skip prompt_id={prompt_id}: empty questions")
return None
system_prompt, user_template = EVAL_PROMPTS[eval_tag]
raw_output = client.judge_image_text(
image_path=image_path,
system_prompt=system_prompt,
user_prompt=_render_prompt(user_template, questions),
).strip()
try:
parsed = _parse_json_safe(raw_output)
except Exception:
parsed = _extract_results_only(raw_output, len(questions)) or {}
meta_info: dict[str, dict[str, Any]] = {}
for idx, question in enumerate(questions, start=1):
rec = parsed.get(str(idx)) if isinstance(parsed, dict) else None
if not isinstance(rec, dict):
meta_info[str(idx)] = {
"result": False,
"raw_description": question,
"reason": "missing_from_output",
}
continue
val = _to_bool(rec.get("result"))
meta_info[str(idx)] = {
"result": bool(val) if isinstance(val, bool) else False,
"raw_description": question,
"reason": rec.get("reason", ""),
}
overall = _compute_item_scores(item, meta_info, error_alpha)
easy = _compute_item_scores(item, meta_info, error_alpha * 2, "easy_qidxs")
hard = _compute_item_scores(item, meta_info, error_alpha * 2, "hard_qidxs")
result = {
"prompt_id": prompt_id,
"dataset_id": dataset_id,
"domain": item.get("domain", ""),
"dimension": item.get("dimension", ""),
"eval_tag": eval_tag,
"prompt": item.get("prompt", ""),
"image_path": str(image_path),
"n_questions": len(questions),
"accuracy": overall["accuracy"],
"error_score": overall["error_score"],
"easy_accuracy": easy["accuracy"],
"easy_error_score": easy["error_score"],
"hard_accuracy": hard["accuracy"],
"hard_error_score": hard["error_score"],
"meta_info": meta_info,
"raw_model_response": raw_output,
}
with write_lock:
with cache_path.open("w", encoding="utf-8") as f:
json.dump(result, f, ensure_ascii=False, indent=2)
return result
def main() -> None:
args = _parse_args()
image_dir = Path(args.image_dir).resolve()
output_dir = Path(args.output_dir).resolve()
items_dir = output_dir / "items"
output_dir.mkdir(parents=True, exist_ok=True)
items_dir.mkdir(parents=True, exist_ok=True)
client = JudgeClient(
api_base=args.api_base,
api_key=args.api_key,
model=args.judge_model,
timeout=args.timeout,
)
items = _load_items(Path(args.data_path).resolve())
print(
f"[bizgeneval] items={len(items)} concurrency={args.concurrency} "
f"judge_model={client.model} image_dir={image_dir}"
)
write_lock = threading.Lock()
tasks = list(items)
results: list[dict[str, Any] | None] = []
def _result_status(result: dict[str, Any] | None) -> str:
if result is None:
return "skipped"
if result.get("_cached"):
return "cached"
return "done"
max_workers = max(1, args.concurrency)
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {
pool.submit(
eval_one,
item,
image_dir=image_dir,
items_dir=items_dir,
client=client,
error_alpha=ERROR_ALPHA,
force_rerun=args.force_rerun,
write_lock=write_lock,
): item
for item in tasks
}
total = len(futures)
progress = tqdm(total=total, desc="bizgeneval eval", dynamic_ncols=True) if tqdm else None
try:
for done, future in enumerate(as_completed(futures), start=1):
item = futures[future]
try:
result = future.result()
except Exception as exc:
_record_error(int(item["_prompt_id"]), exc)
result = None
results.append(result)
status = _result_status(result)
if progress is not None:
progress.update(1)
progress.set_postfix_str(f"prompt_id={item['_prompt_id']} {status}")
else:
print(f"[{done}/{total}] prompt_id={item['_prompt_id']} {status}")
finally:
if progress is not None:
progress.close()
valid_results = [result for result in results if result]
if not valid_results:
raise RuntimeError("BizGenEval produced no valid evaluation results.")
final_records = [
{
"prompt_id": int(result["prompt_id"]),
"dataset_id": result["dataset_id"],
"domain": result["domain"],
"dimension": result["dimension"],
"accuracy": float(result["accuracy"]),
"error_score": float(result["error_score"]),
"easy_accuracy": float(result["easy_accuracy"]),
"easy_error_score": float(result["easy_error_score"]),
"hard_accuracy": float(result["hard_accuracy"]),
"hard_error_score": float(result["hard_error_score"]),
"n_questions": int(result["n_questions"]),
}
for result in sorted(valid_results, key=lambda item: int(item["prompt_id"]))
]
by_domain = _aggregate(final_records, "domain")
by_dimension = _aggregate(final_records, "dimension")
overall_accuracy = mean([record["accuracy"] for record in final_records]) if final_records else 0.0
overall_error_score = mean([record["error_score"] for record in final_records]) if final_records else 0.0
easy_accuracy = mean([record["easy_accuracy"] for record in final_records]) if final_records else 0.0
easy_error_score = mean([record["easy_error_score"] for record in final_records]) if final_records else 0.0
hard_accuracy = mean([record["hard_accuracy"] for record in final_records]) if final_records else 0.0
hard_error_score = mean([record["hard_error_score"] for record in final_records]) if final_records else 0.0
summary = {
"benchmark": "bizgeneval",
"data_path": str(Path(args.data_path).resolve()),
"eval_provider": "judge_client",
"judge_model": client.model,
"error_alpha": ERROR_ALPHA,
"easy_hard_error_alpha": ERROR_ALPHA * 2,
"items": len(final_records),
"overall_accuracy": overall_accuracy,
"overall_error_score": overall_error_score,
"easy_accuracy": easy_accuracy,
"easy_error_score": easy_error_score,
"hard_accuracy": hard_accuracy,
"hard_error_score": hard_error_score,
"by_domain": by_domain,
"by_dimension": by_dimension,
"records": final_records,
}
summary_path = output_dir / "bizgeneval_summary.json"
with summary_path.open("w", encoding="utf-8") as f:
json.dump(summary, f, ensure_ascii=False, indent=2)
print(
f"[bizgeneval] overall_error_score={overall_error_score:.4f} "
f"overall_accuracy={overall_accuracy:.4f} "
f"easy_error_score={easy_error_score:.4f} "
f"hard_error_score={hard_error_score:.4f}"
)
print(f"[bizgeneval] summary saved -> {summary_path}")
if __name__ == "__main__":
main()
# Reference:
# Evaluation prompt adapted from BizGenEval:
# BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
# https://arxiv.org/abs/2603.25732
ATTRIBUTE_PROMPT_SYSTEM_V2 = """
You are an expert visual attribute evaluator.
Your task is to determine whether each given description is true or false based strictly on the provided image.
Evaluation Rules:
1. Base your judgment ONLY on visible evidence in the image.
2. Do NOT rely on assumptions, common sense, or inferred intent.
3. If the attribute is clearly satisfied, return True.
4. If the attribute is clearly violated, return False.
5. If the attribute cannot be determined with certainty from the image, return False.
6. Be strict about:
- Exact colors (approximate matches are ok for similar colors, e.g., dark gray vs black, light blue vs blue, but not for distinct colors like red vs green)
- Exact counts
- Relative sizes and proportions
- Shape types
- Line styles (solid, dashed, dotted)
- Font types (if distinguishable)
# ⚠️ Output Format (Strict JSON Only)
Your output must be valid JSON and follow this structure exactly:
{
"1": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise visual evidence explanation>"
},
"2": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise visual evidence explanation>"
}
}
Requirements:
- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
- Do not include extra commentary or speculation.
- Output valid JSON only. No extra fields. Do not output anything outside JSON.
"""
ATTRIBUTE_USER_TEMPLATE_V2 = """
Evaluate the following descriptions based on the image:
{checklist}
"""
LAYOUT_PROMPT_SYSTEM_V1 = """
You are an expert layout evaluator.
Your task is to determine whether each given description is true or false based strictly on the provided image.
Evaluation Rules:
1. Base your judgment ONLY on visible spatial evidence in the image.
2. Do NOT rely on assumptions, common sense, or inferred intent.
3. If the layout relationship is clearly satisfied, return True.
4. If the layout relationship is clearly violated, return False.
5. Be strict about:
- Relative position (above, below, left, right)
- Arrangement direction (horizontal, vertical, grid)
- Section hierarchy (header at top, footer at bottom, sidebar on left)
- Alignment (left-aligned, centered, right-aligned)
- Grouping and containment (elements inside a container)
- Discrete structural counts (two columns, three stacked cards)
# ⚠️ Output Format (Strict JSON Only)
Your output must be valid JSON and follow this structure exactly:
{
"1": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise spatial evidence explanation>"
},
"2": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise spatial evidence explanation>"
}
}
Requirements:
- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
- Do not include extra commentary or speculation.
- Output valid JSON only. No extra fields. Do not output anything outside JSON.
"""
LAYOUT_USER_TEMPLATE = """
Evaluate the following layout descriptions based on the image:
{checklist}
"""
TEXT_EVALUATION_PROMPT_SYSTEM_V1 = """
You are an expert character-level text evaluator.
Your task is to determine whether each given description is true or false based strictly on the provided image and its textual content.
Evaluation Rules:
1. Base your judgment ONLY on **visible text in the image**, including all letters, numbers, symbols, punctuation, and whitespace.
2. Do NOT rely on assumptions, common sense, or inferred intent.
3. For each description:
- If the text in the specified block exactly matches the description, return True.
- If there is any mismatch (even a single character, symbol, number, or space), return False.
4. Be strict about:
- Exact character match (case-sensitive, punctuation-sensitive, spacing-sensitive)
- Formulas and scientific symbols (Greek letters, superscripts/subscripts, operators)
- Numbers and table values
- Entire text block content (paragraph, list, table row/column, formula)
- Absolute position and context (as specified in the description)
# ⚠️ Output Format (Strict JSON Only)
Your output must be valid JSON and follow this structure exactly:
{
"1": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise text-based evidence explanation>"
},
"2": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise text-based evidence explanation>"
}
}
Requirements:
- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
- Do not include extra commentary or speculation.
- Output valid JSON only. No extra fields. Do not output anything outside JSON.
"""
TEXT_EVALUATION_PROMPT_SYSTEM_V2 = """
You are an expert character-level text evaluator.
Your task is to determine whether each given description is true or false based strictly on the provided image and its textual content.
Evaluation Rules:
1. Base your judgment ONLY on **visible text in the image**, including all letters, numbers, symbols, punctuation, and whitespace.
2. Do NOT rely on assumptions, common sense, or inferred intent.
3. For each description:
- If the text in the specified block exactly matches the description, return True.
- If there is any mismatch within the core sentence or word content (even a single character, symbol, or number inside a word or sentence), return False.
- Minor formatting differences at the boundaries (e.g., leading bullet points such as "-" or "•", and a trailing period ".", "?", "!") should be ignored and still considered True.
4. Be strict about:
- Exact character match (case-sensitive, punctuation-sensitive, spacing-sensitive)
- Formulas and scientific symbols (Greek letters, superscripts/subscripts, operators)
- Numbers and table values
- Entire text block content (paragraph, list, table row/column, formula)
- Absolute position and context (as specified in the description)
# ⚠️ Output Format (Strict JSON Only)
Your output must be valid JSON and follow this structure exactly:
{
"1": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise text-based evidence explanation>"
},
"2": {
"result": True/False,
"raw_description": "<original question>",
"reason": "<concise text-based evidence explanation>"
}
}
Requirements:
- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
- Do not include extra commentary or speculation.
- Output valid JSON only. No extra fields. Do not output anything outside JSON.
"""
TEXT_EVALUATION_USER_TEMPLATE = """
Evaluate the following text descriptions based on the image content and absolute block positions:
{checklist}
"""
KNOWLEDGE_PROMPT_SYSTEM_V1 = """
You are an expert factual-and-reasoning evaluator for chart/diagram/poster/webpage/slides images.
Your task is to determine whether each given Yes/No checklist question is true or false based on the provided image.
Evaluation Rules:
1. Judge using visible image evidence plus standard domain knowledge (math, physics, chemistry, history, engineering, etc.).
2. For each question:
- Return True only if the statement is clearly correct.
- Return False if it is incorrect, inconsistent, implausible, or not verifiable from the image.
3. Be strict about:
- Numeric correctness (arithmetic, units, ranges, proportions)
- Equation correctness (balance, signs, symbols, consistency with text/chart)
- Cross-panel/internal consistency (chart vs table vs annotation vs diagram)
- Historical/scientific plausibility
4. If content is missing/ambiguous/illegible, return False.
5. Do not give partial credit.
# Output Format (Strict JSON Only)
Your output must be valid JSON and follow this structure exactly:
{
"1": {
"result": True/False,
"raw_question": "<original question>",
"reason": "<concise evidence-based explanation>"
},
"2": {
"result": True/False,
"raw_question": "<original question>",
"reason": "<concise evidence-based explanation>"
}
}
Requirements:
- Keep reasons short and evidence-based, citing exact characters, lines, or cells when possible.
- Do not include extra commentary or speculation.
- Output valid JSON only. No extra fields. Do not output anything outside JSON.
"""
KNOWLEDGE_USER_TEMPLATE_V1 = """
Evaluate the following Yes/No knowledge/reasoning questions based on the image:
{checklist}
"""
CHART_USER_TEMPLATE = """
Evaluate the following chart statements based on the image:
{checklist}
Strict output coverage requirement:
- There are exactly {expected_count} statements above.
- Return a JSON object containing ALL keys from 1 to {expected_count} (no missing indices).
- Required keys: {required_keys}
"""
EVAL_GENERATION_PROMPTS = {
"attribute": (ATTRIBUTE_PROMPT_SYSTEM_V2, ATTRIBUTE_USER_TEMPLATE_V2),
"layout": (LAYOUT_PROMPT_SYSTEM_V1, LAYOUT_USER_TEMPLATE),
"text": (TEXT_EVALUATION_PROMPT_SYSTEM_V2, TEXT_EVALUATION_USER_TEMPLATE),
"knowledge": (KNOWLEDGE_PROMPT_SYSTEM_V1, KNOWLEDGE_USER_TEMPLATE_V1),
"text1": (TEXT_EVALUATION_PROMPT_SYSTEM_V1, TEXT_EVALUATION_USER_TEMPLATE),
}
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
from typing import Any
if __package__ in {None, ""}:
repo_root = Path(__file__).resolve().parents[3]
sys.path.insert(0, str(repo_root))
sys.path.insert(0, str(repo_root / "src"))
import torch
import sensenova_u1
from examples.t2i.inference import SenseNovaU1T2I
DEFAULT_DATA_PATH = Path(__file__).resolve().parent / "data" / "test.jsonl"
DEFAULT_ASPECT_RATIO = "1:1"
RATIO_LONG_SIDE = 2048
DEFAULT_ATTN_BACKEND = "auto"
def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Generate BizGenEval images with SenseNova-U1.")
parser.add_argument("--model-path", required=True, help="Local checkpoint path or HF model id.")
parser.add_argument("--output-dir", required=True, help="Directory to save generated images.")
parser.add_argument("--data-path", default=str(DEFAULT_DATA_PATH), help="BizGenEval JSONL path.")
parser.add_argument("--device", default="cuda")
parser.add_argument("--dtype", default="bfloat16", choices=["bfloat16", "float16", "float32"])
parser.add_argument("--cfg-scale", type=float, default=4.0)
parser.add_argument("--cfg-norm", default="none", choices=["none", "global", "channel", "cfg_zero_star"])
parser.add_argument("--timestep-shift", type=float, default=3.0)
parser.add_argument("--num-steps", type=int, default=50)
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--attn-backend", default=DEFAULT_ATTN_BACKEND, choices=["auto", "flash", "sdpa"])
return parser.parse_args()
def _load_items(*, data_path: Path) -> list[dict[str, Any]]:
items = []
with data_path.open(encoding="utf-8") as f:
for prompt_id, line in enumerate(f):
if not line.strip():
continue
item = json.loads(line)
item["_prompt_id"] = prompt_id
items.append(item)
return items
def _resolve_dtype(name: str) -> torch.dtype:
return {
"bfloat16": torch.bfloat16,
"float16": torch.float16,
"float32": torch.float32,
}[name]
def _round_by_factor(number: int, factor: int) -> int:
return round(number / factor) * factor
def _parse_ratio(value: object) -> tuple[int, int] | None:
try:
if isinstance(value, str):
lower = value.lower()
if "x" in lower:
rw, rh = [int(x) for x in lower.split("x", 1)]
elif ":" in value:
rw, rh = [int(x) for x in value.split(":", 1)]
else:
return None
elif isinstance(value, (list, tuple)) and len(value) >= 2:
rw, rh = int(value[0]), int(value[1])
else:
return None
except Exception:
return None
if rw <= 0 or rh <= 0:
return None
return rw, rh
def _dims_from_ratio_long_side(
ratio: tuple[int, int],
long_side: int,
factor: int = 32,
) -> tuple[int, int]:
rw, rh = ratio
if rw >= rh:
width = max(factor, _round_by_factor(long_side, factor))
height = max(factor, _round_by_factor(long_side * rh / rw, factor))
else:
height = max(factor, _round_by_factor(long_side, factor))
width = max(factor, _round_by_factor(long_side * rw / rh, factor))
return int(width), int(height)
def _resolve_image_size(item: dict[str, Any], *, default_aspect_ratio: str, ratio_long_side: int) -> tuple[int, int]:
ratio = _parse_ratio(item.get("aspect_ratio")) or _parse_ratio(default_aspect_ratio)
if ratio is None:
raise ValueError(
f"Failed to resolve aspect ratio for prompt_id={item.get('_prompt_id')}: "
f"aspect_ratio={item.get('aspect_ratio')!r}, default_aspect_ratio={default_aspect_ratio!r}"
)
return _dims_from_ratio_long_side(ratio, ratio_long_side)
def main() -> None:
args = _parse_args()
data_path = Path(args.data_path).resolve()
output_dir = Path(args.output_dir).resolve()
output_dir.mkdir(parents=True, exist_ok=True)
items = _load_items(data_path=data_path)
print(f"[bizgeneval] loaded {len(items)} items from {data_path}")
sensenova_u1.set_attn_backend(args.attn_backend)
print(f"[bizgeneval] attn_backend={args.attn_backend!r} effective={sensenova_u1.effective_attn_backend()!r}")
engine = SenseNovaU1T2I(
model_path=args.model_path,
device=args.device,
dtype=_resolve_dtype(args.dtype),
)
generated = 0
skipped = 0
for item in items:
prompt_id = int(item["_prompt_id"])
width, height = _resolve_image_size(
item,
default_aspect_ratio=DEFAULT_ASPECT_RATIO,
ratio_long_side=RATIO_LONG_SIDE,
)
out_path = output_dir / f"{prompt_id:04d}.png"
if out_path.exists():
skipped += 1
continue
images = engine.generate(
item["prompt"],
image_size=(width, height),
cfg_scale=args.cfg_scale,
cfg_norm=args.cfg_norm,
timestep_shift=args.timestep_shift,
num_steps=args.num_steps,
batch_size=1,
seed=args.seed,
)
images[0].save(out_path)
generated += 1
print(
f"[saved] prompt_id={prompt_id} "
f"size={width}x{height} domain={item.get('domain')} "
f"dimension={item.get('dimension')} -> {out_path}"
)
print(f"[bizgeneval] done: items={len(items)} generated={generated} skipped={skipped} output_dir={output_dir}")
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment