"vscode:/vscode.git/clone" did not exist on "ae3980603452c7445810271c9411f7ec3a5683a2"
get_started.md 11.8 KB
Newer Older
gaotongxiao's avatar
gaotongxiao committed
1
2
# Installation

Ezra-Yu's avatar
Ezra-Yu committed
3
1. Set up the OpenCompass environment:
gaotongxiao's avatar
gaotongxiao committed
4

Leymore's avatar
Leymore committed
5
6
7
8
   ```bash
   conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
   conda activate opencompass
   ```
gaotongxiao's avatar
gaotongxiao committed
9

Leymore's avatar
Leymore committed
10
   If you want to customize the PyTorch version or related CUDA version, please refer to the [official documentation](https://pytorch.org/get-started/locally/) to set up the PyTorch environment. Note that OpenCompass requires `pytorch>=1.13`.
Ma Zerun's avatar
Ma Zerun committed
11

gaotongxiao's avatar
gaotongxiao committed
12
13
2. Install OpenCompass:

Leymore's avatar
Leymore committed
14
15
16
17
18
   ```bash
   git clone https://github.com/InternLM/opencompass.git
   cd opencompass
   pip install -e .
   ```
gaotongxiao's avatar
gaotongxiao committed
19

Ma Zerun's avatar
Ma Zerun committed
20
3. Install humaneval (Optional)
gaotongxiao's avatar
gaotongxiao committed
21

Leymore's avatar
Leymore committed
22
   If you want to **evaluate your models coding ability on the humaneval dataset**, follow this step.
Ezra-Yu's avatar
Ezra-Yu committed
23

Leymore's avatar
Leymore committed
24
25
   <details>
   <summary><b>click to show the details</b></summary>
gaotongxiao's avatar
gaotongxiao committed
26

Leymore's avatar
Leymore committed
27
28
29
30
31
32
33
   ```bash
   git clone https://github.com/openai/human-eval.git
   cd human-eval
   pip install -r requirements.txt
   pip install -e .
   cd ..
   ```
gaotongxiao's avatar
gaotongxiao committed
34

Leymore's avatar
Leymore committed
35
   Please read the comments in `human_eval/execution.py` **lines 48-57** to understand the potential risks of executing the model generation code. If you accept these risks, uncomment **line 58** to enable code execution evaluation.
Ma Zerun's avatar
Ma Zerun committed
36

Leymore's avatar
Leymore committed
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
   </details>

4. Install Llama (Optional)

   If you want to **evaluate Llama / Llama-2 / Llama-2-chat with its official implementation**, follow this step.

   <details>
   <summary><b>click to show the details</b></summary>

   ```bash
   git clone https://github.com/facebookresearch/llama.git
   cd llama
   pip install -r requirements.txt
   pip install -e .
   cd ..
   ```

   You can find example configs in `configs/models`. ([example](https://github.com/InternLM/opencompass/blob/eb4822a94d624a4e16db03adeb7a59bbd10c2012/configs/models/llama2_7b_chat.py))

   </details>
Ma Zerun's avatar
Ma Zerun committed
57

Ezra-Yu's avatar
Ezra-Yu committed
58
# Dataset Preparation
Ma Zerun's avatar
Ma Zerun committed
59

Ezra-Yu's avatar
Ezra-Yu committed
60
The datasets supported by OpenCompass mainly include two parts:
Ma Zerun's avatar
Ma Zerun committed
61

Ezra-Yu's avatar
Ezra-Yu committed
62
63
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
2. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
Ezra-Yu's avatar
Ezra-Yu committed
64

Tong Gao's avatar
Tong Gao committed
65
Run the following commands to download and place the datasets in the `${OpenCompass}/data` directory can complete dataset preparation.
Ezra-Yu's avatar
Ezra-Yu committed
66

Ezra-Yu's avatar
Ezra-Yu committed
67
68
```bash
# Run in the OpenCompass directory
69
wget https://github.com/InternLM/opencompass/releases/download/0.1.1/OpenCompassData.zip
Ezra-Yu's avatar
Ezra-Yu committed
70
71
unzip OpenCompassData.zip
```
Ezra-Yu's avatar
Ezra-Yu committed
72

Ezra-Yu's avatar
Ezra-Yu committed
73
74
75
76
77
78
79
80
81
82
83
84
OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.

# Quick Start

The evaluation of OpenCompass relies on configuration files which must contain fields **`datasets`** and **`models`**.
The configurations specify the models and datasets to evaluate using **"run.py"**.

We will demonstrate some basic features of OpenCompass through evaluating pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on both [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winogrande) benchmark tasks with their config file located at [configs/eval_demo.py](https://github.com/InternLM/opencompass/blob/main/configs/eval_demo.py).

Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/InternLM/opencompass/tree/main/configs).

Tong Gao's avatar
Tong Gao committed
85
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation for the first run and check if there is any prblem. In debugging mode, the tasks will be executed sequentially and the status will be printed in real time.
Ezra-Yu's avatar
Ezra-Yu committed
86
87

```bash
Tong Gao's avatar
Tong Gao committed
88
python run.py configs/eval_demo.py -w outputs/demo --debug
Ezra-Yu's avatar
Ezra-Yu committed
89
90
```

Tong Gao's avatar
Tong Gao committed
91
92
93
94
95
If everything is fine, you should see "Starting inference process" on screen:

```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
Ezra-Yu's avatar
Ezra-Yu committed
96

Tong Gao's avatar
Tong Gao committed
97
Then you can press `ctrl+c` to interrupt the program, and then run the following command to start the parallel evaluation:
Ezra-Yu's avatar
Ezra-Yu committed
98

Tong Gao's avatar
Tong Gao committed
99
100
101
102
103
104
105
106
107
108
109
```bash
python run.py configs/eval_demo.py -w outputs/demo
```

Now let's go over the configuration file and the launch options used in this case.

## Explanations

### Dataset list - `datasets`

Below is the configuration snippet related to datasets in `configs/eval_demo.py`:
Ma Zerun's avatar
Ma Zerun committed
110
111

```python
Tong Gao's avatar
Tong Gao committed
112
from mmengine.config import read_base  # Use mmengine.read_base() to load base configs
Ma Zerun's avatar
Ma Zerun committed
113
114
115

with read_base():
    # Read the required dataset configurations directly from the preset dataset configurations
Tong Gao's avatar
Tong Gao committed
116
117
    from .datasets.winograd.winograd_ppl import winograd_datasets   # Load Winograd's configuration, which uses perplexity-based inference
    from .datasets.siqa.siqa_gen import siqa_datasets               # Load SIQA's configuration, which uses generation-based inference
Ma Zerun's avatar
Ma Zerun committed
118

Ezra-Yu's avatar
Ezra-Yu committed
119
datasets = [*siqa_datasets, *winograd_datasets]   # Concatenate the datasets to be evaluated into the datasets field
Ezra-Yu's avatar
Ezra-Yu committed
120
121
```

Ezra-Yu's avatar
Ezra-Yu committed
122
Various dataset configurations are available in [configs/datasets](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets).
Tong Gao's avatar
Tong Gao committed
123
Some datasets have two types of configuration files within their folders named `ppl` and `gen`, representing different evaluation methods. Specifically, `ppl` represents discriminative evaluation, while `gen` stands for generative evaluation.
Ezra-Yu's avatar
Ezra-Yu committed
124
125
126

[configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) contains various collections of datasets for comprehensive evaluation purposes.

Tong Gao's avatar
Tong Gao committed
127
You can find more information from [Dataset Preparation](./user_guides/dataset_prepare.md).
Ezra-Yu's avatar
Ezra-Yu committed
128

Tong Gao's avatar
Tong Gao committed
129
### Model list - `models`
Ma Zerun's avatar
Ma Zerun committed
130

Tong Gao's avatar
Tong Gao committed
131
OpenCompass supports directly specifying the list of models to be tested in the configuration. For HuggingFace models, users usually do not need to modify the code. The following is the relevant configuration snippet:
Ezra-Yu's avatar
Ezra-Yu committed
132
133

```python
Ma Zerun's avatar
Ma Zerun committed
134
135
136
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
from opencompass.models import HuggingFaceCausalLM

Ezra-Yu's avatar
Ezra-Yu committed
137
138
139
140
141
142
143
144
145
146
147
148
# OPT-350M
opt350m = dict(
       type=HuggingFaceCausalLM,
       # Initialization parameters for `HuggingFaceCausalLM`
       path='facebook/opt-350m',
       tokenizer_path='facebook/opt-350m',
       tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           proxies=None,
           trust_remote_code=True),
       model_kwargs=dict(device_map='auto'),
Tong Gao's avatar
Tong Gao committed
149
150
151
152
153
154
       # Below are common parameters for all models, not specific to HuggingFaceCausalLM
       abbr='opt350m',               # Model abbreviation for result display
       max_seq_len=2048,             # The maximum length of the entire sequence
       max_out_len=100,              # Maximum number of generated tokens
       batch_size=64,                # batchsize
       run_cfg=dict(num_gpus=1),     # Run configuration for specifying resource requirements
Ma Zerun's avatar
Ma Zerun committed
155
    )
Ezra-Yu's avatar
Ezra-Yu committed
156

Ezra-Yu's avatar
Ezra-Yu committed
157
158
159
160
161
162
163
164
165
166
167
168
# OPT-125M
opt125m = dict(
       type=HuggingFaceCausalLM,
       # Initialization parameters for `HuggingFaceCausalLM`
       path='facebook/opt-125m',
       tokenizer_path='facebook/opt-125m',
       tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
           proxies=None,
           trust_remote_code=True),
       model_kwargs=dict(device_map='auto'),
Tong Gao's avatar
Tong Gao committed
169
       # Below are common parameters for all models, not specific to HuggingFaceCausalLM
Ezra-Yu's avatar
Ezra-Yu committed
170
       abbr='opt125m',                # Model abbreviation for result display
Tong Gao's avatar
Tong Gao committed
171
       max_seq_len=2048,              # The maximum length of the entire sequence
Ezra-Yu's avatar
Ezra-Yu committed
172
173
174
175
176
177
       max_out_len=100,               # Maximum number of generated tokens
       batch_size=128,                # batchsize
       run_cfg=dict(num_gpus=1),      # Run configuration for specifying resource requirements
    )

models = [opt350m, opt125m]
Ma Zerun's avatar
Ma Zerun committed
178
```
gaotongxiao's avatar
gaotongxiao committed
179

Tong Gao's avatar
Tong Gao committed
180
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
Ezra-Yu's avatar
Ezra-Yu committed
181

Tong Gao's avatar
Tong Gao committed
182
More information about model configuration can be found in [Prepare Models](./user_guides/models.md).
gaotongxiao's avatar
gaotongxiao committed
183

Tong Gao's avatar
Tong Gao committed
184
185
186
### Launch Evaluation

When the config file is ready, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage.
Ma Zerun's avatar
Ma Zerun committed
187
188

```shell
Ezra-Yu's avatar
Ezra-Yu committed
189
python run.py configs/eval_demo.py -w outputs/demo --debug
Ma Zerun's avatar
Ma Zerun committed
190
191
192
193
194
195
```

However, in `--debug` mode, tasks are executed sequentially. After confirming that everything is correct, you
can disable the `--debug` mode to fully utilize multiple GPUs.

```shell
Ezra-Yu's avatar
Ezra-Yu committed
196
python run.py configs/eval_demo.py -w outputs/demo
Ma Zerun's avatar
Ma Zerun committed
197
198
199
200
```

Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:

Ezra-Yu's avatar
Ezra-Yu committed
201
- `-w outputs/demo`: Directory to save evaluation logs and results.
Ma Zerun's avatar
Ma Zerun committed
202
203
204
205
206
207
208
209
210
211
212
213
- `-r`: Restart the previous (interrupted) evaluation.
- `--mode all`: Specify a specific stage of the task.
  - all: Perform a complete evaluation, including inference and evaluation.
  - infer: Perform inference on each dataset.
  - eval: Perform evaluation based on the inference results.
  - viz: Display evaluation results only.
- `--max-partition-size 2000`: Dataset partition size. Some datasets may be large, and using this parameter can split them into multiple sub-tasks to efficiently utilize resources. However, if the partition is too fine, the overall speed may be slower due to longer model loading times.
- `--max-num-workers 32`: Maximum number of parallel tasks. In distributed environments such as Slurm, this parameter specifies the maximum number of submitted tasks. In a local environment, it specifies the maximum number of tasks executed in parallel. Note that the actual number of parallel tasks depends on the available GPU resources and may not be equal to this number.

If you are not performing the evaluation on your local machine but using a Slurm cluster, you can specify the following parameters:

- `--slurm`: Submit tasks using Slurm on the cluster.
Ezra-Yu's avatar
Ezra-Yu committed
214
- `--partition(-p) my_part`: Slurm cluster partition.
Ma Zerun's avatar
Ma Zerun committed
215
216
- `--retry 2`: Number of retries for failed tasks.

Ezra-Yu's avatar
Ezra-Yu committed
217
218
219
220
```{tip}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task) for details.
```

Ma Zerun's avatar
Ma Zerun committed
221
222
223
224
225
## Obtaining Evaluation Results

After the evaluation is complete, the evaluation results table will be printed as follows:

```text
Ezra-Yu's avatar
Ezra-Yu committed
226
227
228
229
dataset    version    metric    mode      opt350m    opt125m
---------  ---------  --------  ------  ---------  ---------
siqa       e78df3     accuracy  gen         21.55      12.44
winograd   b6c7ed     accuracy  ppl         51.23      49.82
Ma Zerun's avatar
Ma Zerun committed
230
```
gaotongxiao's avatar
gaotongxiao committed
231

Tong Gao's avatar
Tong Gao committed
232
All run outputs will be directed to `outputs/demo/` directory with following structure:
Ezra-Yu's avatar
Ezra-Yu committed
233

Ezra-Yu's avatar
Ezra-Yu committed
234
```text
Ezra-Yu's avatar
Ezra-Yu committed
235
236
outputs/default/
├── 20200220_120000
Ezra-Yu's avatar
Ezra-Yu committed
237
├── 20230220_183030     # one experiment pre folder
Tong Gao's avatar
Tong Gao committed
238
│   ├── configs         # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
Ezra-Yu's avatar
Ezra-Yu committed
239
│   ├── logs            # log files for both inference and evaluation stages
Ezra-Yu's avatar
Ezra-Yu committed
240
241
│   │   ├── eval
│   │   └── infer
Tong Gao's avatar
Tong Gao committed
242
243
244
│   ├── predictions   # Prediction results for each task
│   ├── results       # Evaluation results for each task
│   └── summary       # Summarized evaluation results for a single experiment
Ezra-Yu's avatar
Ezra-Yu committed
245
├── ...
Ezra-Yu's avatar
Ezra-Yu committed
246
247
```

Ezra-Yu's avatar
Ezra-Yu committed
248
249
250
## Additional Tutorials

To learn more about using OpenCompass, explore the following tutorials:
Ezra-Yu's avatar
Ezra-Yu committed
251

Tong Gao's avatar
Tong Gao committed
252
253
254
255
256
- [Prepare Datasets](./user_guides/dataset_prepare.md)
- [Prepare Models](./user_guides/models.md)
- [Task Execution and Monitoring](./user_guides/experimentation.md)
- [Understand Prompts](./prompt/overview.md)
- [Learn about Config](./user_guides/config.md)