get_started.md 10.2 KB
Newer Older
gaotongxiao's avatar
gaotongxiao committed
1
2
# Installation

Ezra-Yu's avatar
Ezra-Yu committed
3
1. Set up the OpenCompass environment:
gaotongxiao's avatar
gaotongxiao committed
4
5

```bash
Ma Zerun's avatar
Ma Zerun committed
6
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
gaotongxiao's avatar
gaotongxiao committed
7
8
9
conda activate opencompass
```

Ma Zerun's avatar
Ma Zerun committed
10
11
If you want to customize the PyTorch version or related CUDA version, please refer to the [official documentation](https://pytorch.org/get-started/locally/) to set up the PyTorch environment. Note that OpenCompass requires `pytorch>=1.13`.

gaotongxiao's avatar
gaotongxiao committed
12
13
14
2. Install OpenCompass:

```bash
Ezra-Yu's avatar
Ezra-Yu committed
15
git clone https://github.com/InternLM/opencompass.git
gaotongxiao's avatar
gaotongxiao committed
16
17
18
19
cd opencompass
pip install -e .
```

Ma Zerun's avatar
Ma Zerun committed
20
3. Install humaneval (Optional)
gaotongxiao's avatar
gaotongxiao committed
21

Ezra-Yu's avatar
Ezra-Yu committed
22
23
24
25
If you want to **evaluate your models coding ability on the humaneval dataset**, execute this step otherwise skip it.

<details>
<summary><b>click to show the details</b></summary>
gaotongxiao's avatar
gaotongxiao committed
26

Ezra-Yu's avatar
Ezra-Yu committed
27
```bash
gaotongxiao's avatar
gaotongxiao committed
28
29
git clone https://github.com/openai/human-eval.git
cd human-eval
Leymore's avatar
Leymore committed
30
pip install -r requirements.txt
gaotongxiao's avatar
gaotongxiao committed
31
pip install -e .
Ma Zerun's avatar
Ma Zerun committed
32
cd ..
gaotongxiao's avatar
gaotongxiao committed
33
34
```

Ma Zerun's avatar
Ma Zerun committed
35
36
Please read the comments in `human_eval/execution.py` **lines 48-57** to understand the potential risks of executing the model generation code. If you accept these risks, uncomment **line 58** to enable code execution evaluation.

Ezra-Yu's avatar
Ezra-Yu committed
37
</details>
Ma Zerun's avatar
Ma Zerun committed
38

Ezra-Yu's avatar
Ezra-Yu committed
39
# Dataset Preparation
Ma Zerun's avatar
Ma Zerun committed
40

Ezra-Yu's avatar
Ezra-Yu committed
41
The datasets supported by OpenCompass mainly include two parts:
Ma Zerun's avatar
Ma Zerun committed
42

Ezra-Yu's avatar
Ezra-Yu committed
43
44
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
2. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
Ezra-Yu's avatar
Ezra-Yu committed
45

Ezra-Yu's avatar
Ezra-Yu committed
46
Run the following commands to download and place the datasets in the '${OpenCompass}/data' directory can complete dataset preparation.
Ezra-Yu's avatar
Ezra-Yu committed
47

Ezra-Yu's avatar
Ezra-Yu committed
48
49
50
51
52
```bash
# Run in the OpenCompass directory
wget https://github.com/InternLM/opencompass/releases/download/0.1.0/OpenCompassData.zip
unzip OpenCompassData.zip
```
Ezra-Yu's avatar
Ezra-Yu committed
53

Ezra-Yu's avatar
Ezra-Yu committed
54
55
56
57
58
59
60
61
62
63
64
65
66
OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.

# Quick Start

The evaluation of OpenCompass relies on configuration files which must contain fields **`datasets`** and **`models`**.
The configurations specify the models and datasets to evaluate using **"run.py"**.

We will demonstrate some basic features of OpenCompass through evaluating pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on both [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winogrande) benchmark tasks with their config file located at [configs/eval_demo.py](https://github.com/InternLM/opencompass/blob/main/configs/eval_demo.py).

Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/InternLM/opencompass/tree/main/configs).

To start the evaluation task, use the following command:
Ezra-Yu's avatar
Ezra-Yu committed
67
68

```bash
Ezra-Yu's avatar
Ezra-Yu committed
69
python run.py configs/eval_demo.py --debug
Ezra-Yu's avatar
Ezra-Yu committed
70
71
```

Ezra-Yu's avatar
Ezra-Yu committed
72
While running the demo, let's go over the details of the configuration content and launch options used in this case.
Ezra-Yu's avatar
Ezra-Yu committed
73
74
75
76

## Step by step

<details>
Ezra-Yu's avatar
Ezra-Yu committed
77
<summary><b>Learn about `datasets`</b></summary>
Ma Zerun's avatar
Ma Zerun committed
78
79
80
81
82
83

```python
from mmengine.config import read_base

with read_base():
    # Read the required dataset configurations directly from the preset dataset configurations
Ezra-Yu's avatar
Ezra-Yu committed
84
85
    from .datasets.winograd.winograd_ppl import winograd_datasets   # ppl inference
    from .datasets.siqa.siqa_gen import siqa_datasets               # gen inference
Ma Zerun's avatar
Ma Zerun committed
86

Ezra-Yu's avatar
Ezra-Yu committed
87
datasets = [*siqa_datasets, *winograd_datasets]   # Concatenate the datasets to be evaluated into the datasets field
Ezra-Yu's avatar
Ezra-Yu committed
88
89
```

Ezra-Yu's avatar
Ezra-Yu committed
90
91
92
93
94
Various dataset configurations are available in [configs/datasets](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets).
Some datasets have two types of configuration files within their folders named `'ppl'` and `'gen'`, representing different evaluation methods. Specifically, `'ppl'` represents discriminative evaluation, while `'gen'` stands for generative evaluation.

[configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) contains various collections of datasets for comprehensive evaluation purposes.

Ezra-Yu's avatar
Ezra-Yu committed
95
96
97
</details>

<details>
Ezra-Yu's avatar
Ezra-Yu committed
98
<summary><b>Learn about `models`</b></summary>
Ma Zerun's avatar
Ma Zerun committed
99

Ezra-Yu's avatar
Ezra-Yu committed
100
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' from HuggingFace supports automatic downloading.
Ezra-Yu's avatar
Ezra-Yu committed
101
102

```python
Ma Zerun's avatar
Ma Zerun committed
103
104
105
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
from opencompass.models import HuggingFaceCausalLM

Ezra-Yu's avatar
Ezra-Yu committed
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# OPT-350M
opt350m = dict(
       type=HuggingFaceCausalLM,
       # Initialization parameters for `HuggingFaceCausalLM`
       path='facebook/opt-350m',
       tokenizer_path='facebook/opt-350m',
       tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           proxies=None,
           trust_remote_code=True),
       model_kwargs=dict(device_map='auto'),
       max_seq_len=2048,
       # Common parameters for all models, not specific to HuggingFaceCausalLM's initialization parameters
       abbr='opt350m',                    # Model abbreviation for result display
       max_out_len=100,                   # Maximum number of generated tokens
       batch_size=64,                     # batchsize
       run_cfg=dict(num_gpus=1),          # Run configuration for specifying resource requirements
Ma Zerun's avatar
Ma Zerun committed
124
    )
Ezra-Yu's avatar
Ezra-Yu committed
125

Ezra-Yu's avatar
Ezra-Yu committed
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# OPT-125M
opt125m = dict(
       type=HuggingFaceCausalLM,
       # Initialization parameters for `HuggingFaceCausalLM`
       path='facebook/opt-125m',
       tokenizer_path='facebook/opt-125m',
       tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           proxies=None,
           trust_remote_code=True),
       model_kwargs=dict(device_map='auto'),
       max_seq_len=2048,
       # Common parameters for all models, not specific to HuggingFaceCausalLM's initialization parameters
       abbr='opt125m',                # Model abbreviation for result display
       max_out_len=100,               # Maximum number of generated tokens
       batch_size=128,                # batchsize
       run_cfg=dict(num_gpus=1),      # Run configuration for specifying resource requirements
    )

models = [opt350m, opt125m]
Ma Zerun's avatar
Ma Zerun committed
147
```
gaotongxiao's avatar
gaotongxiao committed
148

Ezra-Yu's avatar
Ezra-Yu committed
149
150
151
</details>

<details>
Ezra-Yu's avatar
Ezra-Yu committed
152
<summary><b>Launch Evaluation</b></summary>
gaotongxiao's avatar
gaotongxiao committed
153

Ma Zerun's avatar
Ma Zerun committed
154
155
156
First, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage.

```shell
Ezra-Yu's avatar
Ezra-Yu committed
157
python run.py configs/eval_demo.py -w outputs/demo --debug
Ma Zerun's avatar
Ma Zerun committed
158
159
160
161
162
163
```

However, in `--debug` mode, tasks are executed sequentially. After confirming that everything is correct, you
can disable the `--debug` mode to fully utilize multiple GPUs.

```shell
Ezra-Yu's avatar
Ezra-Yu committed
164
python run.py configs/eval_demo.py -w outputs/demo
Ma Zerun's avatar
Ma Zerun committed
165
166
167
168
```

Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:

Ezra-Yu's avatar
Ezra-Yu committed
169
- `-w outputs/demo`: Directory to save evaluation logs and results.
Ma Zerun's avatar
Ma Zerun committed
170
171
172
173
174
175
176
177
178
179
180
181
- `-r`: Restart the previous (interrupted) evaluation.
- `--mode all`: Specify a specific stage of the task.
  - all: Perform a complete evaluation, including inference and evaluation.
  - infer: Perform inference on each dataset.
  - eval: Perform evaluation based on the inference results.
  - viz: Display evaluation results only.
- `--max-partition-size 2000`: Dataset partition size. Some datasets may be large, and using this parameter can split them into multiple sub-tasks to efficiently utilize resources. However, if the partition is too fine, the overall speed may be slower due to longer model loading times.
- `--max-num-workers 32`: Maximum number of parallel tasks. In distributed environments such as Slurm, this parameter specifies the maximum number of submitted tasks. In a local environment, it specifies the maximum number of tasks executed in parallel. Note that the actual number of parallel tasks depends on the available GPU resources and may not be equal to this number.

If you are not performing the evaluation on your local machine but using a Slurm cluster, you can specify the following parameters:

- `--slurm`: Submit tasks using Slurm on the cluster.
Ezra-Yu's avatar
Ezra-Yu committed
182
- `--partition(-p) my_part`: Slurm cluster partition.
Ma Zerun's avatar
Ma Zerun committed
183
184
- `--retry 2`: Number of retries for failed tasks.

Ezra-Yu's avatar
Ezra-Yu committed
185
186
187
188
```{tip}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task) for details.
```

Ezra-Yu's avatar
Ezra-Yu committed
189
190
</details>

Ma Zerun's avatar
Ma Zerun committed
191
192
193
194
195
## Obtaining Evaluation Results

After the evaluation is complete, the evaluation results table will be printed as follows:

```text
Ezra-Yu's avatar
Ezra-Yu committed
196
197
198
199
dataset    version    metric    mode      opt350m    opt125m
---------  ---------  --------  ------  ---------  ---------
siqa       e78df3     accuracy  gen         21.55      12.44
winograd   b6c7ed     accuracy  ppl         51.23      49.82
Ma Zerun's avatar
Ma Zerun committed
200
```
gaotongxiao's avatar
gaotongxiao committed
201

Ezra-Yu's avatar
Ezra-Yu committed
202
203
All run outputs will default to `outputs/default/` directory with following structure:

Ezra-Yu's avatar
Ezra-Yu committed
204
```text
Ezra-Yu's avatar
Ezra-Yu committed
205
206
outputs/default/
├── 20200220_120000
Ezra-Yu's avatar
Ezra-Yu committed
207
208
209
├── 20230220_183030     # one experiment pre folder
│   ├── configs         # replicable config files
│   ├── logs            # log files for both inference and evaluation stages
Ezra-Yu's avatar
Ezra-Yu committed
210
211
│   │   ├── eval
│   │   └── infer
Ezra-Yu's avatar
Ezra-Yu committed
212
213
214
│   ├── predictions     # json format of per data point inference result
│   └── results         # numerical conclusions of each evaluation session
├── ...
Ezra-Yu's avatar
Ezra-Yu committed
215
216
```

Ezra-Yu's avatar
Ezra-Yu committed
217
218
219
220
221
222
223
224
225
226
Each timestamp folder represents one experiment with the following contents:

- `configs`: configuration file storage;
- `logs`: log file storage for both **inference** and **evaluation** stages;
- `predictions`: json format output of inference result per data points;
- `results`: json format output of numerical conclusion on each evaluation session.

## Additional Tutorials

To learn more about using OpenCompass, explore the following tutorials:
Ezra-Yu's avatar
Ezra-Yu committed
227

Ezra-Yu's avatar
Ezra-Yu committed
228
229
230
231
- [Preparing Datasets](./user\_guides/dataset\_prepare.md)
- [Customizing Models](./user\_guides/models.md)
- [Exploring Experimentation Workflows](./user\_guides/experimentation.md)
- [Understanding Prompts](./prompt/overview.md)