understanding.md 5.78 KB
Newer Older
raojy's avatar
first  
raojy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# Visual Understanding Evaluation

Reproduction guide for SenseNova-U1 on visual understanding benchmarks.

The pipeline is built on top of [EvalScope](https://github.com/OpenSenseNova/evalscope/tree/neo) (Native backend). EvalScope calls the model through an OpenAI-compatible HTTP endpoint and, for open-ended benchmarks, scores the predictions with an LLM judge.

Reference config and launcher live under `evaluation/understanding/`:

- `evaluation/understanding/config.yaml` — evaluation configuration
- `evaluation/understanding/es.py` — single-entry launcher

## 1. Overview

```
┌──────────────┐     OpenAI-compatible     ┌─────────────┐
│  es.py       │ ───── HTTP requests ────▶ │ Model API   │
│  (EvalScope) │                           │ (lightllm)  │
└──────┬───────┘ ◀──── generations ─────── └─────────────┘


   results/                 (predictions, judge scores, aggregated metrics)
```

1. Deploy SenseNova-U1 behind an OpenAI-compatible endpoint (the reference setup uses lightllm).
2. Fill in `config.yaml` with the endpoint, model name, datasets, and generation parameters.
3. Run `python es.py` — it calls `evalscope.run.run_task(task_cfg="config.yaml")`, which loops over the datasets, issues requests in parallel, and writes predictions plus scores under `results/`.

## 2. Launcher

`evaluation/understanding/es.py` is deliberately tiny:

```python
from evalscope.run import run_task

run_task(task_cfg="config.yaml")
```

Everything else is driven by `config.yaml`.

## 3. Benchmarks

The reference run evaluates on:

- `mmmu_pro`
- `mmlu_pro`
- `mm_bench`
- `ai2d`
- `math_vista`
- `ifeval`

Add or remove items under `datasets:` to extend the evaluation.

## 4. Main Generation Parameters

These are the parameters the model is sampled with. They live under `generation_config:` in `config.yaml` and are forwarded to the OpenAI-compatible API.

| Parameter | Value | Meaning |
| --- | --- | --- |
| `stream` | `false` | Return the full response in one shot; simpler to score and log. |
| `temperature` | `0.6` | Sampling temperature — recommended setting for thinking-enabled models. |
| `top_p` | `0.95` | Nucleus sampling cutoff; used together with `top_k`. |
| `max_tokens` | `32768` | Upper bound on generated tokens per sample. Large because `<think>…</think>` traces can be long. |
| `timeout` | `300` | Per-request timeout in seconds. |
| `extra_body.top_k` | `20` | Restrict sampling to the top-20 tokens at each step. |
| `extra_body.repetition_penalty` | `1.05` | Mild penalty to suppress loops in long reasoning traces. |
| `extra_body.chat_template_kwargs.enable_thinking` | `true` | Let the chat template emit a `<think>…</think>` section before the final answer. |

Post-processing on the prediction:

- `dataset_args.remove_until: </think>` — everything up to and including the closing `</think>` tag is stripped before grading, so only the final answer is scored.
- `ignore_errors: true` — transient single-sample API failures do not abort the whole run.

## 5. Judge Model

Open-ended benchmarks are scored by an LLM judge.

| Field | Value |
| --- | --- |
| `judge_worker_num` | `64` (parallel judge calls) |
| `judge_model_args.model_id` | `gpt-4o-mini-2024-07-18` |
| `judge_model_args.api_key` | *(fill in)* |
| `judge_model_args.api_url` | *(fill in — OpenAI-compatible judge endpoint)* |
| `judge_model_args.generation_config.max_tokens` | `4096` |
| `judge_model_args.generation_config.timeout` | `300` |

The reference `config.yaml` leaves the judge `api_key` / `api_url` blank; fill them before running judge-dependent tasks.

## 6. Runtime Settings

| Field | Value | Meaning |
| --- | --- | --- |
| `eval_backend` | `Native` | EvalScope native backend. |
| `eval_type` | `openai_api` | Drive the model through an OpenAI-compatible endpoint. |
| `eval_batch_size` | `64` | In-flight concurrent requests sent to the model server. |
| `api_url` | `http://<host>:8000/v1/` | OpenAI-compatible serving endpoint (lightllm in the reference setup). |
| `model` | `SenseNova-U1` | Model name as exposed by the serving endpoint. |
| `use_cache` | `results/` | Reuse previously generated answers — supports resume / retry. |
| `work_dir` | `results/` | Output root for predictions, judgments, and scores. |
| `no_timestamp` | `true` | Write into a stable directory (plays well with `use_cache`). |

## 7. Reference `config.yaml`

```yaml
eval_backend: Native
eval_type: openai_api
eval_batch_size: 64
api_url: http://<host>:8000/v1/   # lightllm deployment
model: SenseNova-U1
datasets:
  - mmmu_pro
  - mmlu_pro
  - mm_bench
  - ai2d
  - math_vista
  - ifeval
dataset_args:
  remove_until: </think>
ignore_errors: true
generation_config:
  stream: false
  temperature: 0.6
  timeout: 300
  max_tokens: 32768
  top_p: 0.95
  extra_body:
    top_k: 20
    repetition_penalty: 1.05
    chat_template_kwargs:
      enable_thinking: true

judge_worker_num: 64
judge_model_args:
  api_key: ""
  api_url: ""
  model_id: gpt-4o-mini-2024-07-18
  generation_config:
    max_tokens: 4096
    timeout: 300
use_cache: results/
work_dir: results/
no_timestamp: true
```

## 8. Running the Evaluation

1. Deploy SenseNova-U1 on an OpenAI-compatible endpoint and confirm connectivity:

   ```bash
   curl -sSf -m 5 "$api_url"
   ```

2. Edit `evaluation/understanding/config.yaml`: set `api_url`, `model`, and the judge `api_key` / `api_url` if needed.
3. Launch:

   ```bash
   cd evaluation/understanding
   python es.py
   ```

Predictions, judge outputs, and final scores are written under `results/`. Because `use_cache: results/` and `no_timestamp: true` are set, rerunning the command skips already-answered samples, so interrupting and resuming is safe.