dataset_prepare.md 4.32 KB
Newer Older
Ezra-Yu's avatar
Ezra-Yu committed
1
# Configure DataSets
gaotongxiao's avatar
gaotongxiao committed
2
3
4
5
6
7
8
9
10

This section of the tutorial mainly focuses on how to prepare the datasets supported by OpenCompass and build configuration files to complete dataset selection.

## Directory Structure of Dataset Configuration Files

First, let's introduce the structure under the `configs/datasets` directory in OpenCompass, as shown below:

```
configs/datasets/
Hubert's avatar
Hubert committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
├── agieval
├── apps
├── ARC_c
├── ...
├── CLUE_afqmc  # dataset
│   ├── CLUE_afqmc_gen_901306.py  # different version of config
│   ├── CLUE_afqmc_gen.py
│   ├── CLUE_afqmc_ppl_378c5b.py
│   ├── CLUE_afqmc_ppl_6507d7.py
│   ├── CLUE_afqmc_ppl_7b0c1e.py
│   └── CLUE_afqmc_ppl.py
├── ...
├── XLSum
├── Xsum
└── z_bench
gaotongxiao's avatar
gaotongxiao committed
26
27
```

Hubert's avatar
Hubert committed
28
In the `configs/datasets` directory structure, we flatten all datasets directly, and there are multiple dataset configurations within the corresponding folders for each dataset.
gaotongxiao's avatar
gaotongxiao committed
29

Hubert's avatar
Hubert committed
30
The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`.
gaotongxiao's avatar
gaotongxiao committed
31
32
33
34
35
36
37
38
39
40
41

In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.

## Dataset Preparation

The datasets supported by OpenCompass mainly include two parts:

1. Huggingface Dataset

[Huggingface Dataset](https://huggingface.co/datasets) provides a large number of datasets. OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.

Tong Gao's avatar
Tong Gao committed
42
2. Third-party Datasets
gaotongxiao's avatar
gaotongxiao committed
43

Tong Gao's avatar
Tong Gao committed
44
In addition to supporting Huggingface's existing datasets, OpenCompass also provides some third-party and self-built datasets. Run the following commands to download and place the datasets in the `./data` directory can complete dataset preparation.
gaotongxiao's avatar
gaotongxiao committed
45

Tong Gao's avatar
Tong Gao committed
46
47
48
49
50
51
52
```bash
# Run in the OpenCompass directory
wget https://github.com/InternLM/opencompass/releases/download/0.1.0/OpenCompassData.zip
unzip OpenCompassData.zip
```

Note that the Repo not only contains self-built datasets, but also includes some HF-supported datasets for testing convenience.
gaotongxiao's avatar
gaotongxiao committed
53
54
55

## Dataset Selection

Hubert's avatar
Hubert committed
56
In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
gaotongxiao's avatar
gaotongxiao committed
57
58
59
60
61
62
63
64
65
66
67
68
69
70

```python
afqmc_datasets = [
    dict(
        abbr="afqmc-dev",
        type=AFQMCDataset_V2,
        path="./data/CLUE/AFQMC/dev.json",
        reader_cfg=afqmc_reader_cfg,
        infer_cfg=afqmc_infer_cfg,
        eval_cfg=afqmc_eval_cfg,
    ),
]
```

Hubert's avatar
Hubert committed
71
And `cmnli_datasets` in `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`.
gaotongxiao's avatar
gaotongxiao committed
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

```python
cmnli_datasets = [
    dict(
        type=HFDataset,
        abbr='cmnli',
        path='json',
        split='train',
        data_files='./data/CLUE/cmnli/cmnli_public/dev.json',
        reader_cfg=cmnli_reader_cfg,
        infer_cfg=cmnli_infer_cfg,
        eval_cfg=cmnli_eval_cfg)
]
```

Take these two datasets as examples. If users want to evaluate these two datasets at the same time, they can create a new configuration file in the `configs` directory. We use the import mechanism in the `mmengine` configuration to build the part of the dataset parameters in the evaluation script, as shown below:

```python
from mmengine.config import read_base

with read_base():
    from .datasets.CLUE_afqmc.CLUE_afqmc_gen_db509b import afqmc_datasets
    from .datasets.CLUE_cmnli.CLUE_cmnli_ppl_b78ad4 import cmnli_datasets

datasets = []
datasets += afqmc_datasets
datasets += cmnli_datasets
```

Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.

For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.