quick_start.md 16.5 KB
Newer Older
Tong Gao's avatar
Tong Gao committed
1
# Quick Start
Leymore's avatar
Leymore committed
2

Tong Gao's avatar
Tong Gao committed
3
![image](https://github.com/open-compass/opencompass/assets/22607038/d063cae0-3297-4fd2-921a-366e0a24890b)
Leymore's avatar
Leymore committed
4

Tong Gao's avatar
Tong Gao committed
5
## Overview
Leymore's avatar
Leymore committed
6

Tong Gao's avatar
Tong Gao committed
7
OpenCompass provides a streamlined workflow for evaluating a model, which consists of the following stages: **Configure** -> **Inference** -> **Evaluation** -> **Visualization**.
Ma Zerun's avatar
Ma Zerun committed
8

Tong Gao's avatar
Tong Gao committed
9
**Configure**: This is your starting point. Here, you'll set up the entire evaluation process, choosing the model(s) and dataset(s) to assess. You also have the option to select an evaluation strategy, the computation backend, and define how you'd like the results displayed.
Ma Zerun's avatar
Ma Zerun committed
10

Tong Gao's avatar
Tong Gao committed
11
**Inference & Evaluation**: OpenCompass efficiently manages the heavy lifting, conducting parallel inference and evaluation on your chosen model(s) and dataset(s). The **Inference** phase is all about producing outputs from your datasets, whereas the **Evaluation** phase measures how well these outputs align with the gold standard answers. While this procedure is broken down into multiple "tasks" that run concurrently for greater efficiency, be aware that working with limited computational resources might introduce some unexpected overheads, and resulting in generally slower evaluation. To understand this issue and know how to solve it, check out [FAQ: Efficiency](faq.md#efficiency).
Ma Zerun's avatar
Ma Zerun committed
12

Tong Gao's avatar
Tong Gao committed
13
**Visualization**: Once the evaluation is done, OpenCompass collates the results into an easy-to-read table and saves them as both CSV and TXT files. If you need real-time updates, you can activate lark reporting and get immediate status reports in your Lark clients.
Ezra-Yu's avatar
Ezra-Yu committed
14

Tong Gao's avatar
Tong Gao committed
15
Coming up, we'll walk you through the basics of OpenCompass, showcasing evaluations of pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on the [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winograd_wsc) benchmark tasks. Their configuration files can be found at [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py).
Ezra-Yu's avatar
Ezra-Yu committed
16
17

Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
Songyang Zhang's avatar
Songyang Zhang committed
18
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/open-compass/opencompass/tree/main/configs).
Ezra-Yu's avatar
Ezra-Yu committed
19

Tong Gao's avatar
Tong Gao committed
20
## Configuring an Evaluation Task
21
22
23
24
25
26
27
28

In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is `run.py`. Users can select the model and dataset to be tested either via command line or configuration files.

`````{tabs}

````{tab} Command Line

Users can combine the models and datasets they want to test using `--models` and `--datasets`.
Ezra-Yu's avatar
Ezra-Yu committed
29
30

```bash
31
python run.py --models hf_opt_125m hf_opt_350m --datasets siqa_gen winograd_ppl
Ezra-Yu's avatar
Ezra-Yu committed
32
33
```

34
The models and datasets are pre-stored in the form of configuration files in `configs/models` and `configs/datasets`. Users can view or filter the currently available model and dataset configurations using `tools/list_configs.py`.
Tong Gao's avatar
Tong Gao committed
35
36

```bash
37
38
39
40
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
Tong Gao's avatar
Tong Gao committed
41
```
Ezra-Yu's avatar
Ezra-Yu committed
42

Tong Gao's avatar
Tong Gao committed
43
44
45
46
:::{dropdown} More about `list_configs`
:animate: fade-in-slide-down

Running `python tools/list_configs.py llama mmlu` gives the output like:
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

```text
+-----------------+-----------------------------------+
| Model           | Config Path                       |
|-----------------+-----------------------------------|
| hf_llama2_13b   | configs/models/hf_llama2_13b.py   |
| hf_llama2_70b   | configs/models/hf_llama2_70b.py   |
| ...             | ...                               |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset           | Config Path                                       |
|-------------------+---------------------------------------------------|
| cmmlu_gen         | configs/datasets/cmmlu/cmmlu_gen.py               |
| cmmlu_gen_ffe7c0  | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py        |
| ...               | ...                                               |
+-------------------+---------------------------------------------------+
```

Users can use the names in the first column as input parameters for `--models` and `--datasets` in `python run.py`. For datasets, the same name with different suffixes generally indicates that its prompts or evaluation methods are different.
Tong Gao's avatar
Tong Gao committed
66
67
68
69
70
71
72
73
74
75
76
77
:::

:::{dropdown} Model not on the list?
:animate: fade-in-slide-down

If you want to evaluate other models, please check out the "Command Line (Custom HF Model)" tab for the way to construct a custom HF model without a configuration file, or "Configuration File" tab to learn the general way to prepare your model configurations.

:::

````

````{tab} Command Line (Custom HF Model)
78
79

For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the `facebook/opt-125m` model, you can evaluate it with the following command:
Ezra-Yu's avatar
Ezra-Yu committed
80

Tong Gao's avatar
Tong Gao committed
81
```bash
82
python run.py --datasets siqa_gen winograd_ppl \
liushz's avatar
liushz committed
83
--hf-path facebook/opt-125m \
84
85
86
87
88
--model-kwargs device_map='auto' \
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \
--max-seq-len 2048 \
--max-out-len 100 \
--batch-size 128  \
Tong Gao's avatar
Tong Gao committed
89
--num-gpus 1  # Number of minimum required GPUs
Tong Gao's avatar
Tong Gao committed
90
91
```

Tong Gao's avatar
Tong Gao committed
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
Note that in this way, OpenCompass only evaluates one model at a time, while other ways can evaluate multiple models at once.

```{caution}
`--num-gpus` does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. [More](faq.md#how-does-opencompass-allocate-gpus)
```

:::{dropdown} More detailed example
:animate: fade-in-slide-down
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path facebook/opt-125m \  # HuggingFace model path
--tokenizer-path facebook/opt-125m \  # HuggingFace tokenizer path (if the same as the model path, can be omitted)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \  # Arguments to construct the tokenizer
--model-kwargs device_map='auto' \  # Arguments to construct the model
--max-seq-len 2048 \  # Maximum sequence length the model can accept
--max-out-len 100 \  # Maximum number of tokens to generate
--batch-size 64  \  # Batch size
--num-gpus 1  # Number of GPUs required to run the model
```
```{seealso}
For all HuggingFace related parameters supported by `run.py`, please read [Launching Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task).
113
```
Tong Gao's avatar
Tong Gao committed
114
:::
Tong Gao's avatar
Tong Gao committed
115
116


117
118
````
````{tab} Configuration File
Tong Gao's avatar
Tong Gao committed
119

Tong Gao's avatar
Tong Gao committed
120
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. The configuration file is organized in Python format and must include the `datasets` and `models` fields.
121

Tong Gao's avatar
Tong Gao committed
122
The test configuration for this time is [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](../user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
Ma Zerun's avatar
Ma Zerun committed
123
124

```python
125
from mmengine.config import read_base
Ma Zerun's avatar
Ma Zerun committed
126
127

with read_base():
128
129
    from .datasets.siqa.siqa_gen import siqa_datasets
    from .datasets.winograd.winograd_ppl import winograd_datasets
130
131
    from .models.opt.hf_opt_125m import opt125m
    from .models.opt.hf_opt_350m import opt350m
Ma Zerun's avatar
Ma Zerun committed
132

133
134
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
Ezra-Yu's avatar
Ezra-Yu committed
135
136
```

137
When running tasks, we just need to pass the path of the configuration file to `run.py`:
Ezra-Yu's avatar
Ezra-Yu committed
138

139
140
141
```bash
python run.py configs/eval_demo.py
```
Ezra-Yu's avatar
Ezra-Yu committed
142

Tong Gao's avatar
Tong Gao committed
143
144
:::{dropdown} More about `models`
:animate: fade-in-slide-down
Ezra-Yu's avatar
Ezra-Yu committed
145

Tong Gao's avatar
Tong Gao committed
146
OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](https://github.com/open-compass/opencompass/blob/main/configs/models/opt/hf_opt_350m.py) (`configs/models/opt/hf_opt_350m.py`):
Ezra-Yu's avatar
Ezra-Yu committed
147
148

```python
Ma Zerun's avatar
Ma Zerun committed
149
150
151
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
from opencompass.models import HuggingFaceCausalLM

Ezra-Yu's avatar
Ezra-Yu committed
152
153
154
155
156
157
158
159
160
161
162
163
# OPT-350M
opt350m = dict(
       type=HuggingFaceCausalLM,
       # Initialization parameters for `HuggingFaceCausalLM`
       path='facebook/opt-350m',
       tokenizer_path='facebook/opt-350m',
       tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           proxies=None,
           trust_remote_code=True),
       model_kwargs=dict(device_map='auto'),
Tong Gao's avatar
Tong Gao committed
164
165
166
167
168
       # Below are common parameters for all models, not specific to HuggingFaceCausalLM
       abbr='opt350m',               # Model abbreviation for result display
       max_seq_len=2048,             # The maximum length of the entire sequence
       max_out_len=100,              # Maximum number of generated tokens
       batch_size=64,                # batchsize
169
       run_cfg=dict(num_gpus=1),     # The required GPU numbers for this model
Ma Zerun's avatar
Ma Zerun committed
170
    )
171
```
Ezra-Yu's avatar
Ezra-Yu committed
172

Tong Gao's avatar
Tong Gao committed
173
When using configurations, we can specify the relevant files through the command-line argument ` --models` or import the model configurations into the  `models` list in the configuration file using the inheritance mechanism.
Ezra-Yu's avatar
Ezra-Yu committed
174

Tong Gao's avatar
Tong Gao committed
175
176
```{seealso}
More information about model configuration can be found in [Prepare Models](../user_guides/models.md).
177
```
Tong Gao's avatar
Tong Gao committed
178
:::
179

Tong Gao's avatar
Tong Gao committed
180
181
:::{dropdown} More about `datasets`
:animate: fade-in-slide-down
182

Tong Gao's avatar
Tong Gao committed
183
Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199

Below is a dataset-related configuration snippet from `configs/eval_demo.py`:

```python
from mmengine.config import read_base  # Use mmengine.read_base() to read the base configuration

with read_base():
    # Directly read the required dataset configurations from the preset dataset configurations
    from .datasets.winograd.winograd_ppl import winograd_datasets  # Read Winograd configuration, evaluated based on PPL (perplexity)
    from .datasets.siqa.siqa_gen import siqa_datasets  # Read SIQA configuration, evaluated based on generation

datasets = [*siqa_datasets, *winograd_datasets]       # The final config needs to contain the required evaluation dataset list 'datasets'
```

Dataset configurations are typically of two types: 'ppl' and 'gen', indicating the evaluation method used. Where `ppl` means discriminative evaluation and `gen` means generative evaluation.

Songyang Zhang's avatar
Songyang Zhang committed
200
Moreover, [configs/datasets/collections](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections) houses various dataset collections, making it convenient for comprehensive evaluations. OpenCompass often uses [`base_medium.py`](/configs/datasets/collections/base_medium.py) for full-scale model testing. To replicate results, simply import that file, for example:
201
202
203
204
205

```bash
python run.py --models hf_llama_7b --datasets base_medium
```

Tong Gao's avatar
Tong Gao committed
206
207
208
209
210
211
212
213
214
215
216
217
```{seealso}
You can find more information from [Dataset Preparation](../user_guides/datasets.md).
```
:::


````

`````

```{warning}
OpenCompass usually assumes network is available. If you encounter network issues or wish to run OpenCompass in an offline environment, please refer to [FAQ - Network - Q1](./faq.md#network) for solutions.
218
```
gaotongxiao's avatar
gaotongxiao committed
219

Tong Gao's avatar
Tong Gao committed
220
221
222
The following sections will use configuration-based method as an example to explain the other features.

## Launching Evaluation
Tong Gao's avatar
Tong Gao committed
223

Tong Gao's avatar
Tong Gao committed
224
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation in `--debug` mode for the first run and check if there is any problem. In `--debug` mode, the tasks will be executed sequentially and output will be printed in real time.
Ma Zerun's avatar
Ma Zerun committed
225

Tong Gao's avatar
Tong Gao committed
226
```bash
Ezra-Yu's avatar
Ezra-Yu committed
227
python run.py configs/eval_demo.py -w outputs/demo --debug
Ma Zerun's avatar
Ma Zerun committed
228
229
```

Tong Gao's avatar
Tong Gao committed
230
231
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
If everything is fine, you should see "Starting inference process" on screen:
Ma Zerun's avatar
Ma Zerun committed
232

Tong Gao's avatar
Tong Gao committed
233
234
235
236
237
238
239
```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```

Then you can press `ctrl+c` to interrupt the program, and run the following command in normal mode:

```bash
Ezra-Yu's avatar
Ezra-Yu committed
240
python run.py configs/eval_demo.py -w outputs/demo
Ma Zerun's avatar
Ma Zerun committed
241
242
```

Tong Gao's avatar
Tong Gao committed
243
244
245
246
In normal mode, the evaluation tasks will be executed parallelly in the background, and their output will be redirected to the output directory `outputs/demo/{TIMESTAMP}`. The progress bar on the frontend only indicates the number of completed tasks, regardless of their success or failure. **Any backend task failures will only trigger a warning message in the terminal.**

:::{dropdown} More parameters in `run.py`
:animate: fade-in-slide-down
Ma Zerun's avatar
Ma Zerun committed
247
248
Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:

Tong Gao's avatar
Tong Gao committed
249
250
- `-w outputs/demo`: Work directory to save evaluation logs and results. In this case, the experiment result will be saved to `outputs/demo/{TIMESTAMP}`.
- `-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
Ma Zerun's avatar
Ma Zerun committed
251
- `--mode all`: Specify a specific stage of the task.
Tong Gao's avatar
Tong Gao committed
252
  - all: (Default) Perform a complete evaluation, including inference and evaluation.
Ma Zerun's avatar
Ma Zerun committed
253
254
255
256
257
258
259
260
261
  - infer: Perform inference on each dataset.
  - eval: Perform evaluation based on the inference results.
  - viz: Display evaluation results only.
- `--max-partition-size 2000`: Dataset partition size. Some datasets may be large, and using this parameter can split them into multiple sub-tasks to efficiently utilize resources. However, if the partition is too fine, the overall speed may be slower due to longer model loading times.
- `--max-num-workers 32`: Maximum number of parallel tasks. In distributed environments such as Slurm, this parameter specifies the maximum number of submitted tasks. In a local environment, it specifies the maximum number of tasks executed in parallel. Note that the actual number of parallel tasks depends on the available GPU resources and may not be equal to this number.

If you are not performing the evaluation on your local machine but using a Slurm cluster, you can specify the following parameters:

- `--slurm`: Submit tasks using Slurm on the cluster.
Ezra-Yu's avatar
Ezra-Yu committed
262
- `--partition(-p) my_part`: Slurm cluster partition.
Ma Zerun's avatar
Ma Zerun committed
263
264
- `--retry 2`: Number of retries for failed tasks.

Tong Gao's avatar
Tong Gao committed
265
266
```{seealso}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task) for details.
Ezra-Yu's avatar
Ezra-Yu committed
267
268
```

Tong Gao's avatar
Tong Gao committed
269
270
271
:::

## Visualizing Evaluation Results
Ma Zerun's avatar
Ma Zerun committed
272
273
274
275

After the evaluation is complete, the evaluation results table will be printed as follows:

```text
Ezra-Yu's avatar
Ezra-Yu committed
276
277
278
279
dataset    version    metric    mode      opt350m    opt125m
---------  ---------  --------  ------  ---------  ---------
siqa       e78df3     accuracy  gen         21.55      12.44
winograd   b6c7ed     accuracy  ppl         51.23      49.82
Ma Zerun's avatar
Ma Zerun committed
280
```
gaotongxiao's avatar
gaotongxiao committed
281

Tong Gao's avatar
Tong Gao committed
282
All run outputs will be directed to `outputs/demo/` directory with following structure:
Ezra-Yu's avatar
Ezra-Yu committed
283

Ezra-Yu's avatar
Ezra-Yu committed
284
```text
Ezra-Yu's avatar
Ezra-Yu committed
285
286
outputs/default/
├── 20200220_120000
Ezra-Yu's avatar
Ezra-Yu committed
287
├── 20230220_183030     # one experiment pre folder
Tong Gao's avatar
Tong Gao committed
288
│   ├── configs         # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
Ezra-Yu's avatar
Ezra-Yu committed
289
│   ├── logs            # log files for both inference and evaluation stages
Ezra-Yu's avatar
Ezra-Yu committed
290
291
│   │   ├── eval
│   │   └── infer
Tong Gao's avatar
Tong Gao committed
292
293
294
│   ├── predictions   # Prediction results for each task
│   ├── results       # Evaluation results for each task
│   └── summary       # Summarized evaluation results for a single experiment
Ezra-Yu's avatar
Ezra-Yu committed
295
├── ...
Ezra-Yu's avatar
Ezra-Yu committed
296
297
```

Leymore's avatar
Leymore committed
298
299
The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).

Tong Gao's avatar
Tong Gao committed
300
More information about obtaining evaluation results can be found in [Results Summary](../user_guides/summarizer.md).
Leymore's avatar
Leymore committed
301

Ezra-Yu's avatar
Ezra-Yu committed
302
303
304
## Additional Tutorials

To learn more about using OpenCompass, explore the following tutorials:
Ezra-Yu's avatar
Ezra-Yu committed
305

Tong Gao's avatar
Tong Gao committed
306
307
308
309
310
311
- [Prepare Datasets](../user_guides/datasets.md)
- [Prepare Models](../user_guides/models.md)
- [Task Execution and Monitoring](../user_guides/experimentation.md)
- [Understand Prompts](../prompt/overview.md)
- [Results Summary](../user_guides/summarizer.md)
- [Learn about Config](../user_guides/config.md)