# Visual Understanding Evaluation Reproduction guide for SenseNova-U1 on visual understanding benchmarks. The pipeline is built on top of [EvalScope](https://github.com/OpenSenseNova/evalscope/tree/neo) (Native backend). EvalScope calls the model through an OpenAI-compatible HTTP endpoint and, for open-ended benchmarks, scores the predictions with an LLM judge. Reference config and launcher live under `evaluation/understanding/`: - `evaluation/understanding/config.yaml` — evaluation configuration - `evaluation/understanding/es.py` — single-entry launcher ## 1. Overview ``` ┌──────────────┐ OpenAI-compatible ┌─────────────┐ │ es.py │ ───── HTTP requests ────▶ │ Model API │ │ (EvalScope) │ │ (lightllm) │ └──────┬───────┘ ◀──── generations ─────── └─────────────┘ │ ▼ results/ (predictions, judge scores, aggregated metrics) ``` 1. Deploy SenseNova-U1 behind an OpenAI-compatible endpoint (the reference setup uses lightllm). 2. Fill in `config.yaml` with the endpoint, model name, datasets, and generation parameters. 3. Run `python es.py` — it calls `evalscope.run.run_task(task_cfg="config.yaml")`, which loops over the datasets, issues requests in parallel, and writes predictions plus scores under `results/`. ## 2. Launcher `evaluation/understanding/es.py` is deliberately tiny: ```python from evalscope.run import run_task run_task(task_cfg="config.yaml") ``` Everything else is driven by `config.yaml`. ## 3. Benchmarks The reference run evaluates on: - `mmmu_pro` - `mmlu_pro` - `mm_bench` - `ai2d` - `math_vista` - `ifeval` Add or remove items under `datasets:` to extend the evaluation. ## 4. Main Generation Parameters These are the parameters the model is sampled with. They live under `generation_config:` in `config.yaml` and are forwarded to the OpenAI-compatible API. | Parameter | Value | Meaning | | --- | --- | --- | | `stream` | `false` | Return the full response in one shot; simpler to score and log. | | `temperature` | `0.6` | Sampling temperature — recommended setting for thinking-enabled models. | | `top_p` | `0.95` | Nucleus sampling cutoff; used together with `top_k`. | | `max_tokens` | `32768` | Upper bound on generated tokens per sample. Large because `` traces can be long. | | `timeout` | `300` | Per-request timeout in seconds. | | `extra_body.top_k` | `20` | Restrict sampling to the top-20 tokens at each step. | | `extra_body.repetition_penalty` | `1.05` | Mild penalty to suppress loops in long reasoning traces. | | `extra_body.chat_template_kwargs.enable_thinking` | `true` | Let the chat template emit a `` section before the final answer. | Post-processing on the prediction: - `dataset_args.remove_until: ` — everything up to and including the closing `` tag is stripped before grading, so only the final answer is scored. - `ignore_errors: true` — transient single-sample API failures do not abort the whole run. ## 5. Judge Model Open-ended benchmarks are scored by an LLM judge. | Field | Value | | --- | --- | | `judge_worker_num` | `64` (parallel judge calls) | | `judge_model_args.model_id` | `gpt-4o-mini-2024-07-18` | | `judge_model_args.api_key` | *(fill in)* | | `judge_model_args.api_url` | *(fill in — OpenAI-compatible judge endpoint)* | | `judge_model_args.generation_config.max_tokens` | `4096` | | `judge_model_args.generation_config.timeout` | `300` | The reference `config.yaml` leaves the judge `api_key` / `api_url` blank; fill them before running judge-dependent tasks. ## 6. Runtime Settings | Field | Value | Meaning | | --- | --- | --- | | `eval_backend` | `Native` | EvalScope native backend. | | `eval_type` | `openai_api` | Drive the model through an OpenAI-compatible endpoint. | | `eval_batch_size` | `64` | In-flight concurrent requests sent to the model server. | | `api_url` | `http://:8000/v1/` | OpenAI-compatible serving endpoint (lightllm in the reference setup). | | `model` | `SenseNova-U1` | Model name as exposed by the serving endpoint. | | `use_cache` | `results/` | Reuse previously generated answers — supports resume / retry. | | `work_dir` | `results/` | Output root for predictions, judgments, and scores. | | `no_timestamp` | `true` | Write into a stable directory (plays well with `use_cache`). | ## 7. Reference `config.yaml` ```yaml eval_backend: Native eval_type: openai_api eval_batch_size: 64 api_url: http://:8000/v1/ # lightllm deployment model: SenseNova-U1 datasets: - mmmu_pro - mmlu_pro - mm_bench - ai2d - math_vista - ifeval dataset_args: remove_until: ignore_errors: true generation_config: stream: false temperature: 0.6 timeout: 300 max_tokens: 32768 top_p: 0.95 extra_body: top_k: 20 repetition_penalty: 1.05 chat_template_kwargs: enable_thinking: true judge_worker_num: 64 judge_model_args: api_key: "" api_url: "" model_id: gpt-4o-mini-2024-07-18 generation_config: max_tokens: 4096 timeout: 300 use_cache: results/ work_dir: results/ no_timestamp: true ``` ## 8. Running the Evaluation 1. Deploy SenseNova-U1 on an OpenAI-compatible endpoint and confirm connectivity: ```bash curl -sSf -m 5 "$api_url" ``` 2. Edit `evaluation/understanding/config.yaml`: set `api_url`, `model`, and the judge `api_key` / `api_url` if needed. 3. Launch: ```bash cd evaluation/understanding python es.py ``` Predictions, judge outputs, and final scores are written under `results/`. Because `use_cache: results/` and `no_timestamp: true` are set, rerunning the command skips already-answered samples, so interrupting and resuming is safe.