subjective_evaluation.md 6.63 KB
Newer Older
1
2
3
4
5
6
# Subjective Evaluation Guidance

## Introduction

Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.

7
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
8

9
10
11
12
A popular evaluation method involves

- Compare Mode: comparing model responses pairwise to calculate their win rate
- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
13

14
We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
15

16
17
18
19
20
21
22
23
24
25
## Subjective Evaluation with Custom Dataset

The specific process includes:

1. Data preparation
2. Model response generation
3. Evaluate the response with a JudgeLLM
4. Generate JudgeLLM's response and calculate the metric

### Step-1: Data Preparation
26

27
We provide mini test-set for **Compare Mode** and **Score Mode** as below:
28

29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
```python
###COREV2
[
    {
        "question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
        "capability": "知识-社会常识",
        "others": {
            "question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
            "evaluating_guidance": "",
            "reference_answer": "上"
        }
    },...]

###CreationV0.1
[
    {
        "question": "请你扮演一个邮件管家,我让你给谁发送什么主题的邮件,你就帮我扩充好邮件正文,并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题,来斟酌用词,并使用合适的敬语。现在请给导师发送邮件,询问他是否可以下周三下午15:00进行科研同步会,大约200字。",
        "capability": "邮件通知",
        "others": ""
    },
```
50

51
The json must includes the following fields:
52
53
54

- 'question': Question description
- 'capability': The capability dimension of the question.
55
56
57
- 'others': Other needed information.

If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
58

59
### Step-2: Evaluation Configuration(Compare Mode)
60

61
For `config/eval_subjective_compare.py`, we provide some annotations to help users understand the configuration file.
62
63
64

```python

65
66
67
68
69
70
71
72
73
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI

from opencompass.partitioners import NaivePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
74
from opencompass.summarizers import Corev2Summarizer
75

76
77
78
79
80
81
82
with read_base():
    # Pre-defined models
    from .models.qwen.hf_qwen_7b_chat import models as hf_qwen_7b_chat
    from .models.chatglm.hf_chatglm3_6b import models as hf_chatglm3_6b
    from .models.qwen.hf_qwen_14b_chat import models as hf_qwen_14b_chat
    from .models.openai.gpt_4 import models as gpt4_model
    from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets
83

84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# Evaluation datasets
datasets = [*subjective_datasets]

# Model to be evaluated
models = [*hf_qwen_7b_chat, *hf_chatglm3_6b]

# Inference configuration
infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(
        type=SlurmSequentialRunner,
        partition='llmeval',
        quotatype='auto',
        max_num_workers=256,
        task=dict(type=OpenICLInferTask)),
)
# Evaluation configuration
101
102
103
eval = dict(
    partitioner=dict(
        type=SubjectiveNaivePartitioner,
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
        mode='m2n', # m-model v.s n-model
        # Under m2n setting
        # must specify base_models and compare_models, program will generate pairs between base_models compare_models.
        base_models = [*hf_qwen_14b_chat], # Baseline model
        compare_models = [*hf_baichuan2_7b, *hf_chatglm3_6b] # model to be evaluated
    ),
    runner=dict(
        type=SlurmSequentialRunner,
        partition='llmeval',
        quotatype='auto',
        max_num_workers=256,
        task=dict(
            type=SubjectiveEvalTask,
        judge_cfg=gpt4_model # Judge model
        )),
)
work_dir = './outputs/subjective/'
121
122

summarizer = dict(
123
124
    type=Corev2Summarizer,  # Custom summarizer
    match_method='smart', # Answer extraction
125
126
127
)
```

128
In addition, you can also change the response order of the two models, please refer to `config/eval_subjective_compare.py`,
129
130
131
when `infer_order` is setting to `random`, the response will be random ordered,
when `infer_order` is setting to `double`, the response of two models will be doubled in two ways.

132
### Step-2: Evaluation Configuration(Score Mode)
133

134
For `config/eval_subjective_score.py`, it is mainly same with `config/eval_subjective_compare.py`, and you just need to modify the eval mode to `singlescore`.
135

136
### Step-3: Launch the Evaluation
137
138

```shell
139
python run.py config/eval_subjective_score.py -r
140
141
142
143
```

The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.

144
145
The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205

## Practice: AlignBench Evaluation

### Dataset

```bash
mkdir -p ./data/subjective/

cd ./data/subjective
git clone https://github.com/THUDM/AlignBench.git

# data format conversion
python ../../../tools/convert_alignmentbench.py --mode json --jsonl data/data_release.jsonl

```

### Configuration

Please edit the config `configs/eval_subjective_alignbench.py` according to your demand.

### Evaluation

```bash
HF_EVALUATE_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python run.py workspace/eval_subjective_alignbench.py
```

### Submit to Official Leaderboard(Optional)

If you need to submit your prediction into official leaderboard, you can use `tools/convert_alignmentbench.py` for format conversion.

- Make sure you have the following results

```bash
outputs/
└── 20231214_173632
    ├── configs
    ├── logs
    ├── predictions # model's response
    ├── results
    └── summary
```

- Convert the data

```bash
python tools/convert_alignmentbench.py --mode csv --exp-folder outputs/20231214_173632
```

- Get `.csv`  in `submission/` for submission

```bash
outputs/
└── 20231214_173632
    ├── configs
    ├── logs
    ├── predictions
    ├── results
    ├── submission # 可提交文件
    └── summary
```