@@ -43,7 +43,7 @@ The datasets supported by OpenCompass mainly include two parts:
...
@@ -43,7 +43,7 @@ The datasets supported by OpenCompass mainly include two parts:
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
2. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
2. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
Run the following commands to download and place the datasets in the '${OpenCompass}/data' directory can complete dataset preparation.
Run the following commands to download and place the datasets in the `${OpenCompass}/data` directory can complete dataset preparation.
```bash
```bash
# Run in the OpenCompass directory
# Run in the OpenCompass directory
...
@@ -63,41 +63,53 @@ We will demonstrate some basic features of OpenCompass through evaluating pretra
...
@@ -63,41 +63,53 @@ We will demonstrate some basic features of OpenCompass through evaluating pretra
Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/InternLM/opencompass/tree/main/configs).
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/InternLM/opencompass/tree/main/configs).
To start the evaluation task, use the following command:
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation for the first run and check if there is any prblem. In debugging mode, the tasks will be executed sequentially and the status will be printed in real time.
from.datasets.winograd.winograd_pplimportwinograd_datasets# Load Winograd's configuration, which uses perplexity-based inference
from.datasets.siqa.siqa_genimportsiqa_datasets# gen inference
from.datasets.siqa.siqa_genimportsiqa_datasets# Load SIQA's configuration, which uses generation-based inference
datasets=[*siqa_datasets,*winograd_datasets]# Concatenate the datasets to be evaluated into the datasets field
datasets=[*siqa_datasets,*winograd_datasets]# Concatenate the datasets to be evaluated into the datasets field
```
```
Various dataset configurations are available in [configs/datasets](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets).
Various dataset configurations are available in [configs/datasets](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets).
Some datasets have two types of configuration files within their folders named `'ppl'` and `'gen'`, representing different evaluation methods. Specifically, `'ppl'` represents discriminative evaluation, while `'gen'` stands for generative evaluation.
Some datasets have two types of configuration files within their folders named `ppl` and `gen`, representing different evaluation methods. Specifically, `ppl` represents discriminative evaluation, while `gen` stands for generative evaluation.
[configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) contains various collections of datasets for comprehensive evaluation purposes.
[configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) contains various collections of datasets for comprehensive evaluation purposes.
</details>
You can find more information from [Dataset Preparation](./user_guides/dataset_prepare.md).
<details>
### Model list - `models`
<summary><b>Learn about `models`</b></summary>
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' from HuggingFace supports automatic downloading.
OpenCompass supports directly specifying the list of models to be tested in the configuration. For HuggingFace models, users usually do not need to modify the code. The following is the relevant configuration snippet:
```python
```python
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
...
@@ -115,9 +127,9 @@ opt350m = dict(
...
@@ -115,9 +127,9 @@ opt350m = dict(
proxies=None,
proxies=None,
trust_remote_code=True),
trust_remote_code=True),
model_kwargs=dict(device_map='auto'),
model_kwargs=dict(device_map='auto'),
max_seq_len=2048,
# Below are common parameters for all models, not specific to HuggingFaceCausalLM
# Common parameters for all models, not specific to HuggingFaceCausalLM's initialization parameters
abbr='opt350m',# Model abbreviation for result display
abbr='opt350m',# Model abbreviation for result display
max_seq_len=2048,# The maximum length of the entire sequence
max_out_len=100,# Maximum number of generated tokens
max_out_len=100,# Maximum number of generated tokens
batch_size=64,# batchsize
batch_size=64,# batchsize
run_cfg=dict(num_gpus=1),# Run configuration for specifying resource requirements
run_cfg=dict(num_gpus=1),# Run configuration for specifying resource requirements
...
@@ -135,9 +147,9 @@ opt125m = dict(
...
@@ -135,9 +147,9 @@ opt125m = dict(
proxies=None,
proxies=None,
trust_remote_code=True),
trust_remote_code=True),
model_kwargs=dict(device_map='auto'),
model_kwargs=dict(device_map='auto'),
max_seq_len=2048,
# Below are common parameters for all models, not specific to HuggingFaceCausalLM
# Common parameters for all models, not specific to HuggingFaceCausalLM's initialization parameters
abbr='opt125m',# Model abbreviation for result display
abbr='opt125m',# Model abbreviation for result display
max_seq_len=2048,# The maximum length of the entire sequence
max_out_len=100,# Maximum number of generated tokens
max_out_len=100,# Maximum number of generated tokens
batch_size=128,# batchsize
batch_size=128,# batchsize
run_cfg=dict(num_gpus=1),# Run configuration for specifying resource requirements
run_cfg=dict(num_gpus=1),# Run configuration for specifying resource requirements
...
@@ -146,12 +158,13 @@ opt125m = dict(
...
@@ -146,12 +158,13 @@ opt125m = dict(
models=[opt350m,opt125m]
models=[opt350m,opt125m]
```
```
</details>
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
<details>
More information about model configuration can be found in [Prepare Models](./user_guides/models.md).
<summary><b>Launch Evaluation</b></summary>
First, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage.
### Launch Evaluation
When the config file is ready, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage.
@@ -186,8 +199,6 @@ If you are not performing the evaluation on your local machine but using a Slurm
...
@@ -186,8 +199,6 @@ If you are not performing the evaluation on your local machine but using a Slurm
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task) for details.
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task) for details.
```
```
</details>
## Obtaining Evaluation Results
## Obtaining Evaluation Results
After the evaluation is complete, the evaluation results table will be printed as follows:
After the evaluation is complete, the evaluation results table will be printed as follows:
...
@@ -199,33 +210,28 @@ siqa e78df3 accuracy gen 21.55 12.44
...
@@ -199,33 +210,28 @@ siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
winograd b6c7ed accuracy ppl 51.23 49.82
```
```
All run outputs will default to `outputs/default/` directory with following structure:
All run outputs will be directed to `outputs/demo/` directory with following structure:
```text
```text
outputs/default/
outputs/default/
├── 20200220_120000
├── 20200220_120000
├── 20230220_183030 # one experiment pre folder
├── 20230220_183030 # one experiment pre folder
│ ├── configs # replicable config files
│ ├── configs # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
│ ├── logs # log files for both inference and evaluation stages
│ ├── logs # log files for both inference and evaluation stages
│ │ ├── eval
│ │ ├── eval
│ │ └── infer
│ │ └── infer
│ ├── predictions # json format of per data point inference result
│ ├── predictions # Prediction results for each task
│ └── results # numerical conclusions of each evaluation session
│ ├── results # Evaluation results for each task
│ └── summary # Summarized evaluation results for a single experiment
├── ...
├── ...
```
```
Each timestamp folder represents one experiment with the following contents:
-`configs`: configuration file storage;
-`logs`: log file storage for both **inference** and **evaluation** stages;
-`predictions`: json format output of inference result per data points;
-`results`: json format output of numerical conclusion on each evaluation session.
## Additional Tutorials
## Additional Tutorials
To learn more about using OpenCompass, explore the following tutorials:
To learn more about using OpenCompass, explore the following tutorials:
This section of the tutorial mainly focuses on how to prepare the datasets supported by OpenCompass and build configuration files to complete dataset selection.
This tutorial mainly focuses on selecting datasets supported by OpenCompass and preparing their configs files. Please make sure you have downloaded the datasets following the steps in [Dataset Preparation](../get_started.md#dataset-preparation).
## Directory Structure of Dataset Configuration Files
## Directory Structure of Dataset Configuration Files
...
@@ -31,26 +31,6 @@ The naming of the dataset configuration file is made up of `{dataset name}_{eval
...
@@ -31,26 +31,6 @@ The naming of the dataset configuration file is made up of `{dataset name}_{eval
In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.
In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.
## Dataset Preparation
The datasets supported by OpenCompass mainly include two parts:
1. Huggingface Dataset
[Huggingface Dataset](https://huggingface.co/datasets) provides a large number of datasets. OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.
2. Third-party Datasets
In addition to supporting Huggingface's existing datasets, OpenCompass also provides some third-party and self-built datasets. Run the following commands to download and place the datasets in the `./data` directory can complete dataset preparation.
Note that the Repo not only contains self-built datasets, but also includes some HF-supported datasets for testing convenience.
## Dataset Selection
## Dataset Selection
In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task) for details.