get_started.md 16.5 KB
Newer Older
gaotongxiao's avatar
gaotongxiao committed
1
2
# Installation

Ezra-Yu's avatar
Ezra-Yu committed
3
1. Set up the OpenCompass environment:
gaotongxiao's avatar
gaotongxiao committed
4

Leymore's avatar
Leymore committed
5
6
7
8
   ```bash
   conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
   conda activate opencompass
   ```
gaotongxiao's avatar
gaotongxiao committed
9

Leymore's avatar
Leymore committed
10
   If you want to customize the PyTorch version or related CUDA version, please refer to the [official documentation](https://pytorch.org/get-started/locally/) to set up the PyTorch environment. Note that OpenCompass requires `pytorch>=1.13`.
Ma Zerun's avatar
Ma Zerun committed
11

gaotongxiao's avatar
gaotongxiao committed
12
13
2. Install OpenCompass:

Leymore's avatar
Leymore committed
14
   ```bash
Songyang Zhang's avatar
Songyang Zhang committed
15
   git clone https://github.com/open-compass/opencompass.git
Leymore's avatar
Leymore committed
16
17
18
   cd opencompass
   pip install -e .
   ```
gaotongxiao's avatar
gaotongxiao committed
19

Ma Zerun's avatar
Ma Zerun committed
20
3. Install humaneval (Optional)
gaotongxiao's avatar
gaotongxiao committed
21

Leymore's avatar
Leymore committed
22
   If you want to **evaluate your models coding ability on the humaneval dataset**, follow this step.
Ezra-Yu's avatar
Ezra-Yu committed
23

Leymore's avatar
Leymore committed
24
25
   <details>
   <summary><b>click to show the details</b></summary>
gaotongxiao's avatar
gaotongxiao committed
26

Leymore's avatar
Leymore committed
27
28
29
30
31
32
33
   ```bash
   git clone https://github.com/openai/human-eval.git
   cd human-eval
   pip install -r requirements.txt
   pip install -e .
   cd ..
   ```
gaotongxiao's avatar
gaotongxiao committed
34

Leymore's avatar
Leymore committed
35
   Please read the comments in `human_eval/execution.py` **lines 48-57** to understand the potential risks of executing the model generation code. If you accept these risks, uncomment **line 58** to enable code execution evaluation.
Ma Zerun's avatar
Ma Zerun committed
36

Leymore's avatar
Leymore committed
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
   </details>

4. Install Llama (Optional)

   If you want to **evaluate Llama / Llama-2 / Llama-2-chat with its official implementation**, follow this step.

   <details>
   <summary><b>click to show the details</b></summary>

   ```bash
   git clone https://github.com/facebookresearch/llama.git
   cd llama
   pip install -r requirements.txt
   pip install -e .
   cd ..
   ```

Songyang Zhang's avatar
Songyang Zhang committed
54
   You can find example configs in `configs/models`. ([example](https://github.com/open-compass/opencompass/blob/eb4822a94d624a4e16db03adeb7a59bbd10c2012/configs/models/llama2_7b_chat.py))
Leymore's avatar
Leymore committed
55
56

   </details>
Ma Zerun's avatar
Ma Zerun committed
57

Ezra-Yu's avatar
Ezra-Yu committed
58
# Dataset Preparation
Ma Zerun's avatar
Ma Zerun committed
59

Ezra-Yu's avatar
Ezra-Yu committed
60
The datasets supported by OpenCompass mainly include two parts:
Ma Zerun's avatar
Ma Zerun committed
61

Ezra-Yu's avatar
Ezra-Yu committed
62
63
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
2. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
Ezra-Yu's avatar
Ezra-Yu committed
64

Tong Gao's avatar
Tong Gao committed
65
Run the following commands to download and place the datasets in the `${OpenCompass}/data` directory can complete dataset preparation.
Ezra-Yu's avatar
Ezra-Yu committed
66

Ezra-Yu's avatar
Ezra-Yu committed
67
68
```bash
# Run in the OpenCompass directory
Songyang Zhang's avatar
Songyang Zhang committed
69
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
Ezra-Yu's avatar
Ezra-Yu committed
70
71
unzip OpenCompassData.zip
```
Ezra-Yu's avatar
Ezra-Yu committed
72

Ezra-Yu's avatar
Ezra-Yu committed
73
74
75
76
OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.

# Quick Start

Songyang Zhang's avatar
Songyang Zhang committed
77
We will demonstrate some basic features of OpenCompass through evaluating pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on both [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winogrande) benchmark tasks with their config file located at [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py).
Ezra-Yu's avatar
Ezra-Yu committed
78
79

Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
Songyang Zhang's avatar
Songyang Zhang committed
80
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/open-compass/opencompass/tree/main/configs).
Ezra-Yu's avatar
Ezra-Yu committed
81

82
83
84
85
86
87
88
89
90
## Configure an Evaluation Task

In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is `run.py`. Users can select the model and dataset to be tested either via command line or configuration files.

`````{tabs}

````{tab} Command Line

Users can combine the models and datasets they want to test using `--models` and `--datasets`.
Ezra-Yu's avatar
Ezra-Yu committed
91
92

```bash
93
python run.py --models opt_125m opt_350m --datasets siqa_gen winograd_ppl
Ezra-Yu's avatar
Ezra-Yu committed
94
95
```

96
The models and datasets are pre-stored in the form of configuration files in `configs/models` and `configs/datasets`. Users can view or filter the currently available model and dataset configurations using `tools/list_configs.py`.
Tong Gao's avatar
Tong Gao committed
97
98

```bash
99
100
101
102
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
Tong Gao's avatar
Tong Gao committed
103
```
Ezra-Yu's avatar
Ezra-Yu committed
104

105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
Some sample outputs are:

```text
+-----------------+-----------------------------------+
| Model           | Config Path                       |
|-----------------+-----------------------------------|
| hf_llama2_13b   | configs/models/hf_llama2_13b.py   |
| hf_llama2_70b   | configs/models/hf_llama2_70b.py   |
| ...             | ...                               |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset           | Config Path                                       |
|-------------------+---------------------------------------------------|
| cmmlu_gen         | configs/datasets/cmmlu/cmmlu_gen.py               |
| cmmlu_gen_ffe7c0  | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py        |
| ...               | ...                                               |
+-------------------+---------------------------------------------------+
```

Users can use the names in the first column as input parameters for `--models` and `--datasets` in `python run.py`. For datasets, the same name with different suffixes generally indicates that its prompts or evaluation methods are different.

For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the `facebook/opt-125m` model, you can evaluate it with the following command:
Ezra-Yu's avatar
Ezra-Yu committed
127

Tong Gao's avatar
Tong Gao committed
128
```bash
129
130
131
132
133
134
135
136
python run.py --datasets siqa_gen winograd_ppl \
--hf-model facebook/opt-125m \
--model-kwargs device_map='auto' \
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \
--max-seq-len 2048 \
--max-out-len 100 \
--batch-size 128  \
--num-gpus 1
Tong Gao's avatar
Tong Gao committed
137
138
```

139
140
141
```{tip}
For all HuggingFace related parameters supported by `run.py`, please read [Initiating Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task).
```
Tong Gao's avatar
Tong Gao committed
142
143


144
145
146
````

````{tab} Configuration File
Tong Gao's avatar
Tong Gao committed
147

148
149
150
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. This method of configuration allows users to easily modify experimental parameters, provides a more flexible configuration, and simplifies the run command. The configuration file is organized in Python format and must include the `datasets` and `models` fields.

The test configuration for this time is [configs/eval_demo.py](/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](./user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
Ma Zerun's avatar
Ma Zerun committed
151
152

```python
153
from mmengine.config import read_base
Ma Zerun's avatar
Ma Zerun committed
154
155

with read_base():
156
157
    from .datasets.siqa.siqa_gen import siqa_datasets
    from .datasets.winograd.winograd_ppl import winograd_datasets
158
159
    from .models.opt.hf_opt_125m import opt125m
    from .models.opt.hf_opt_350m import opt350m
Ma Zerun's avatar
Ma Zerun committed
160

161
162
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
Ezra-Yu's avatar
Ezra-Yu committed
163
164
```

165
When running tasks, we just need to pass the path of the configuration file to `run.py`:
Ezra-Yu's avatar
Ezra-Yu committed
166

167
168
169
```bash
python run.py configs/eval_demo.py
```
Ezra-Yu's avatar
Ezra-Yu committed
170

171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
````

`````

The configuration file evaluation method is more concise. The following sections will use this method as an example to explain the other features.

## Run Evaluation

Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation for the first run and check if there is any prblem. In debugging mode, the tasks will be executed sequentially and the status will be printed in real time.

```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```

If everything is fine, you should see "Starting inference process" on screen:

```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```

Then you can press `ctrl+c` to interrupt the program, and then run the following command to start the parallel evaluation:

```bash
python run.py configs/eval_demo.py -w outputs/demo
```

Now let's go over the configuration file and the launch options used in this case.

## Explanations
Ezra-Yu's avatar
Ezra-Yu committed
200

Tong Gao's avatar
Tong Gao committed
201
### Model list - `models`
Ma Zerun's avatar
Ma Zerun committed
202

203
OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](/configs/models/hf_opt_350m.py) (`configs/models/hf_opt_350m.py`):
Ezra-Yu's avatar
Ezra-Yu committed
204
205

```python
Ma Zerun's avatar
Ma Zerun committed
206
207
208
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
from opencompass.models import HuggingFaceCausalLM

Ezra-Yu's avatar
Ezra-Yu committed
209
210
211
212
213
214
215
216
217
218
219
220
# OPT-350M
opt350m = dict(
       type=HuggingFaceCausalLM,
       # Initialization parameters for `HuggingFaceCausalLM`
       path='facebook/opt-350m',
       tokenizer_path='facebook/opt-350m',
       tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           proxies=None,
           trust_remote_code=True),
       model_kwargs=dict(device_map='auto'),
Tong Gao's avatar
Tong Gao committed
221
222
223
224
225
       # Below are common parameters for all models, not specific to HuggingFaceCausalLM
       abbr='opt350m',               # Model abbreviation for result display
       max_seq_len=2048,             # The maximum length of the entire sequence
       max_out_len=100,              # Maximum number of generated tokens
       batch_size=64,                # batchsize
226
       run_cfg=dict(num_gpus=1),     # The required GPU numbers for this model
Ma Zerun's avatar
Ma Zerun committed
227
    )
228
```
Ezra-Yu's avatar
Ezra-Yu committed
229

230
When using configurations, we can specify the relevant files through the command-line argument ``` --models`` or import the model configurations into the  ```models\` list in the configuration file using the inheritance mechanism.
Ezra-Yu's avatar
Ezra-Yu committed
231

232
233
234
235
236
237
238
239
240
241
242
243
If the HuggingFace model you want to test is not among them, you can also directly specify the related parameters in the command line.

```bash
python run.py \
--hf-model facebook/opt-350m \  # HuggingFace model path
--tokenizer-path facebook/opt-350m \  # HuggingFace tokenizer path (if the same as the model path, can be omitted)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \  # Arguments to construct the tokenizer
--model-kwargs device_map='auto' \  # Arguments to construct the model
--max-seq-len 2048 \  # Maximum sequence length the model can accept
--max-out-len 100 \  # Maximum number of tokens to generate
--batch-size 64  \  # Batch size
--num-gpus 1  # Number of GPUs required to run the model
Ma Zerun's avatar
Ma Zerun committed
244
```
gaotongxiao's avatar
gaotongxiao committed
245

Tong Gao's avatar
Tong Gao committed
246
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
Ezra-Yu's avatar
Ezra-Yu committed
247

248
```{note}
Tong Gao's avatar
Tong Gao committed
249
More information about model configuration can be found in [Prepare Models](./user_guides/models.md).
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
```

### Dataset list - `datasets`

The translation is:

Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance.

Below is a dataset-related configuration snippet from `configs/eval_demo.py`:

```python
from mmengine.config import read_base  # Use mmengine.read_base() to read the base configuration

with read_base():
    # Directly read the required dataset configurations from the preset dataset configurations
    from .datasets.winograd.winograd_ppl import winograd_datasets  # Read Winograd configuration, evaluated based on PPL (perplexity)
    from .datasets.siqa.siqa_gen import siqa_datasets  # Read SIQA configuration, evaluated based on generation

datasets = [*siqa_datasets, *winograd_datasets]       # The final config needs to contain the required evaluation dataset list 'datasets'
```

Dataset configurations are typically of two types: 'ppl' and 'gen', indicating the evaluation method used. Where `ppl` means discriminative evaluation and `gen` means generative evaluation.

Songyang Zhang's avatar
Songyang Zhang committed
273
Moreover, [configs/datasets/collections](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections) houses various dataset collections, making it convenient for comprehensive evaluations. OpenCompass often uses [`base_medium.py`](/configs/datasets/collections/base_medium.py) for full-scale model testing. To replicate results, simply import that file, for example:
274
275
276
277
278
279
280
281

```bash
python run.py --models hf_llama_7b --datasets base_medium
```

```{note}
You can find more information from [Dataset Preparation](./user_guides/dataset_prepare.md).
```
gaotongxiao's avatar
gaotongxiao committed
282

Tong Gao's avatar
Tong Gao committed
283
284
285
### Launch Evaluation

When the config file is ready, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage.
Ma Zerun's avatar
Ma Zerun committed
286
287

```shell
Ezra-Yu's avatar
Ezra-Yu committed
288
python run.py configs/eval_demo.py -w outputs/demo --debug
Ma Zerun's avatar
Ma Zerun committed
289
290
291
292
293
294
```

However, in `--debug` mode, tasks are executed sequentially. After confirming that everything is correct, you
can disable the `--debug` mode to fully utilize multiple GPUs.

```shell
Ezra-Yu's avatar
Ezra-Yu committed
295
python run.py configs/eval_demo.py -w outputs/demo
Ma Zerun's avatar
Ma Zerun committed
296
297
298
299
```

Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:

Ezra-Yu's avatar
Ezra-Yu committed
300
- `-w outputs/demo`: Directory to save evaluation logs and results.
Ma Zerun's avatar
Ma Zerun committed
301
302
303
304
305
306
307
308
309
310
311
312
- `-r`: Restart the previous (interrupted) evaluation.
- `--mode all`: Specify a specific stage of the task.
  - all: Perform a complete evaluation, including inference and evaluation.
  - infer: Perform inference on each dataset.
  - eval: Perform evaluation based on the inference results.
  - viz: Display evaluation results only.
- `--max-partition-size 2000`: Dataset partition size. Some datasets may be large, and using this parameter can split them into multiple sub-tasks to efficiently utilize resources. However, if the partition is too fine, the overall speed may be slower due to longer model loading times.
- `--max-num-workers 32`: Maximum number of parallel tasks. In distributed environments such as Slurm, this parameter specifies the maximum number of submitted tasks. In a local environment, it specifies the maximum number of tasks executed in parallel. Note that the actual number of parallel tasks depends on the available GPU resources and may not be equal to this number.

If you are not performing the evaluation on your local machine but using a Slurm cluster, you can specify the following parameters:

- `--slurm`: Submit tasks using Slurm on the cluster.
Ezra-Yu's avatar
Ezra-Yu committed
313
- `--partition(-p) my_part`: Slurm cluster partition.
Ma Zerun's avatar
Ma Zerun committed
314
315
- `--retry 2`: Number of retries for failed tasks.

Ezra-Yu's avatar
Ezra-Yu committed
316
317
318
319
```{tip}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task) for details.
```

Ma Zerun's avatar
Ma Zerun committed
320
321
322
323
324
## Obtaining Evaluation Results

After the evaluation is complete, the evaluation results table will be printed as follows:

```text
Ezra-Yu's avatar
Ezra-Yu committed
325
326
327
328
dataset    version    metric    mode      opt350m    opt125m
---------  ---------  --------  ------  ---------  ---------
siqa       e78df3     accuracy  gen         21.55      12.44
winograd   b6c7ed     accuracy  ppl         51.23      49.82
Ma Zerun's avatar
Ma Zerun committed
329
```
gaotongxiao's avatar
gaotongxiao committed
330

Tong Gao's avatar
Tong Gao committed
331
All run outputs will be directed to `outputs/demo/` directory with following structure:
Ezra-Yu's avatar
Ezra-Yu committed
332

Ezra-Yu's avatar
Ezra-Yu committed
333
```text
Ezra-Yu's avatar
Ezra-Yu committed
334
335
outputs/default/
├── 20200220_120000
Ezra-Yu's avatar
Ezra-Yu committed
336
├── 20230220_183030     # one experiment pre folder
Tong Gao's avatar
Tong Gao committed
337
│   ├── configs         # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
Ezra-Yu's avatar
Ezra-Yu committed
338
│   ├── logs            # log files for both inference and evaluation stages
Ezra-Yu's avatar
Ezra-Yu committed
339
340
│   │   ├── eval
│   │   └── infer
Tong Gao's avatar
Tong Gao committed
341
342
343
│   ├── predictions   # Prediction results for each task
│   ├── results       # Evaluation results for each task
│   └── summary       # Summarized evaluation results for a single experiment
Ezra-Yu's avatar
Ezra-Yu committed
344
├── ...
Ezra-Yu's avatar
Ezra-Yu committed
345
346
```

Leymore's avatar
Leymore committed
347
348
349
350
The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).

More information about obtaining evaluation results can be found in [Results Summary](./user_guides/summarizer.md).

Ezra-Yu's avatar
Ezra-Yu committed
351
352
353
## Additional Tutorials

To learn more about using OpenCompass, explore the following tutorials:
Ezra-Yu's avatar
Ezra-Yu committed
354

Tong Gao's avatar
Tong Gao committed
355
356
357
358
- [Prepare Datasets](./user_guides/dataset_prepare.md)
- [Prepare Models](./user_guides/models.md)
- [Task Execution and Monitoring](./user_guides/experimentation.md)
- [Understand Prompts](./prompt/overview.md)
Leymore's avatar
Leymore committed
359
- [Results Summary](./user_guides/summarizer.md)
Tong Gao's avatar
Tong Gao committed
360
- [Learn about Config](./user_guides/config.md)