subjective_evaluation.md 3.67 KB
Newer Older
1
2
3
4
5
6
# Subjective Evaluation Guidance

## Introduction

Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.

7
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
8

9
A popular evaluation method involves comparing model responses pairwise to calculate their win rate, another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
10

11
We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
12
13
14

## Data Preparation

15
We provide demo test set as below:
16

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
```python
###COREV2
[
    {
        "question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
        "capability": "知识-社会常识",
        "others": {
            "question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
            "evaluating_guidance": "",
            "reference_answer": "上"
        }
    },...]

###CreationV0.1
[
    {
        "question": "请你扮演一个邮件管家,我让你给谁发送什么主题的邮件,你就帮我扩充好邮件正文,并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题,来斟酌用词,并使用合适的敬语。现在请给导师发送邮件,询问他是否可以下周三下午15:00进行科研同步会,大约200字。",
        "capability": "邮件通知",
        "others": ""
    },
```
38

39
The json must includes the following fields:
40
41
42

- 'question': Question description
- 'capability': The capability dimension of the question.
43
44
45
- 'others': Other needed information.

If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
46
47
48
49
50
51

## Evaluation Configuration

The specific process includes:

1. Model response reasoning
52
2. JudgeLLM evaluation comparisons
53
54
3. Generating evaluation reports

55
56
57
### Two Model Compare Configuration

For `config/subjective_compare.py`, we provide some annotations to help users understand the configuration file's meaning.
58
59
60
61

```python
from mmengine.config import read_base
with read_base():
62
    from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets
63

64
from opencompass.summarizers import Corev2Summarizer
65

66
67
68
datasets = [*subjective_datasets] #set dataset
models = [...] #set models to be evaluated
judge_model = [...] #set JudgeLLM
69
70
71
72

eval = dict(
    partitioner=dict(
        type=SubjectiveNaivePartitioner,
73
74
75
76
77
78
79
80
81
82
        mode='m2n',  #choose eval mode, in m2n mode,you need to set base_models and compare_models, it will generate the pairs between base_models and compare_models
        base_models = [...],
        compare_models = [...]
    ))

work_dir = 'Your work dir' #set your workdir, in this workdir, if you use '--reuse', it will reuse all existing results in this workdir automatically

summarizer = dict(
    type=Corev2Summarizer, #Your dataset Summarizer
    match_method='smart', #Your answer extract method
83
84
85
)
```

86
87
88
89
### Single Model Scoring Configuration

For `config/subjective_score.py`, it is mainly same with `config/subjective_compare.py`, and you just need to modify the eval mode to `singlescore`.

90
91
92
93
94
95
96
97
98
99
## Launching the Evaluation

```shell
python run.py config/subjective.py -r
```

The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.

## Evaluation Report

100
101
The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.