Unverified Commit 767c12a6 authored by Tong Gao's avatar Tong Gao Committed by GitHub
Browse files

[Docs] update get_started (#435)



* [Docs] update get_started

* [Docs] Refactor get_started

* update

* add zh FAQ

* add cn doc

* update

* fix dead links

---------
Co-authored-by: default avatarLeymore <zfz-960727@163.com>
parent 67382471
......@@ -115,9 +115,12 @@ python run.py --datasets ceval_ppl mmlu_ppl \
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--batch-size 8 \ # Batch size
--no-batch-padding \ # Don't enable batch padding, infer through for loop to avoid performance loss
--num-gpus 1 # Number of required GPUs
--num-gpus 1 # Number of minimum required GPUs
```
> **Note**<br />
> To run the command above, you will need to remove the comments starting from `# ` first.
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
<p align="right"><a href="#top">🔝Back to top</a></p>
......
......@@ -117,9 +117,12 @@ python run.py --datasets ceval_ppl mmlu_ppl \
--max-seq-len 2048 \ # 模型能接受的最大序列长度
--batch-size 8 \ # 批次大小
--no-batch-padding \ # 不打开 batch padding,通过 for loop 推理,避免精度损失
--num-gpus 1 # 所需 gpu 数
--num-gpus 1 # 运行该模型所需的最少 gpu 数
```
> **注意**<br />
> 若需要运行上述命令,你需要删除所有从 `# ` 开始的注释。
通过命令行或配置文件,OpenCompass 还支持评测 API 或自定义模型,以及更多样化的评测策略。请阅读[快速上手](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html#id3)了解如何运行一个评测任务。
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)
......
......@@ -54,6 +54,7 @@ extensions = [
'sphinx_tabs.tabs',
'notfound.extension',
'sphinxcontrib.jquery',
'sphinx_design',
]
# Add any paths that contain templates here, relative to this directory.
......
# FAQ
## Network
**Q1**: My tasks failed with error: `('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))` or `urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443)`
A: Because of HuggingFace's implementation, OpenCompass requires network (especially the connection to HuggingFace) for the first time it loads some datasets and models. Additionally, it connects to HuggingFace each time it is launched. For a successful run, you may:
- Work behind a proxy by specifying the environment variables `http_proxy` and `https_proxy`;
- Use the cache files from other machines. You may first run the experiment on a machine that has access to the Internet, and then copy the cached files to the offline one. The cached files are located at `~/.cache/huggingface/` by default ([doc](https://huggingface.co/docs/datasets/cache#cache-directory)). When the cached files are ready, you can start the evaluation in offline mode:
```python
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 HF_EVALUATE_OFFLINE=1 python run.py ...
```
With which no more network connection is needed for the evaluation. However, error will still be raised if the files any dataset or model is missing from the cache.
**Q2**: My server cannot connect to the Internet, how can I use OpenCompass?
Use the cache files from other machines, as suggested in the answer to **Q1**.
**Q3**: In evaluation phase, I'm running into an error saying that `FileNotFoundError: Couldn't find a module script at opencompass/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.`
A: HuggingFace tries to load the metric (e.g. `accuracy`) as an module online, and it could fail if the network is unreachable. Please refer to **Q1** for guidelines to fix your network issue.
# FAQ
## General
### How does OpenCompass allocate GPUs?
OpenCompass processes evaluation requests using the unit termed as "task". Each task is an independent combination of model(s) and dataset(s). The GPU resources needed for a task are determined entirely by the model being evaluated, specifically by the `num_gpus` parameter.
During evaluation, OpenCompass deploys multiple workers to execute tasks in parallel. These workers continuously try to secure GPU resources and run tasks until they succeed. As a result, OpenCompass always strives to leverage all available GPU resources to their maximum capacity.
For instance, if you're using OpenCompass on a local machine equipped with 8 GPUs, and each task demands 4 GPUs, then by default, OpenCompass will employ all 8 GPUs to concurrently run 2 tasks. However, if you adjust the `--max-num-workers` setting to 1, then only one task will be processed at a time, utilizing just 4 GPUs.
### How do I control the number of GPUs that OpenCompass occupies?
Currently, there isn't a direct method to specify the number of GPUs OpenCompass can utilize. However, the following are some indirect strategies:
**If evaluating locally:**
You can limit OpenCompass's GPU access by setting the `CUDA_VISIBLE_DEVICES` environment variable. For instance, using `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` will only expose the first four GPUs to OpenCompass, ensuring it uses no more than these four GPUs simultaneously.
**If using Slurm or DLC:**
Although OpenCompass doesn't have direct access to the resource pool, you can adjust the `--max-num-workers` parameter to restrict the number of evaluation tasks being submitted simultaneously. This will indirectly manage the number of GPUs that OpenCompass employs. For instance, if each task requires 4 GPUs, and you wish to allocate a total of 8 GPUs, then you should set `--max-num-workers` to 2.
## Network
### My tasks failed with error: `('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))` or `urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443)`
Because of HuggingFace's implementation, OpenCompass requires network (especially the connection to HuggingFace) for the first time it loads some datasets and models. Additionally, it connects to HuggingFace each time it is launched. For a successful run, you may:
- Work behind a proxy by specifying the environment variables `http_proxy` and `https_proxy`;
- Use the cache files from other machines. You may first run the experiment on a machine that has access to the Internet, and then copy the cached files to the offline one. The cached files are located at `~/.cache/huggingface/` by default ([doc](https://huggingface.co/docs/datasets/cache#cache-directory)). When the cached files are ready, you can start the evaluation in offline mode:
```python
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 HF_EVALUATE_OFFLINE=1 python run.py ...
```
With which no more network connection is needed for the evaluation. However, error will still be raised if the files any dataset or model is missing from the cache.
### My server cannot connect to the Internet, how can I use OpenCompass?
Use the cache files from other machines, as suggested in the answer to [Network-Q1](#my-tasks-failed-with-error-connection-aborted-connectionreseterror104-connection-reset-by-peer-or-urllib3exceptionsmaxretryerror-httpsconnectionpoolhostcdn-lfshuggingfaceco-port443).
### In evaluation phase, I'm running into an error saying that `FileNotFoundError: Couldn't find a module script at opencompass/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.`
HuggingFace tries to load the metric (e.g. `accuracy`) as an module online, and it could fail if the network is unreachable. Please refer to [Network-Q1](#my-tasks-failed-with-error-connection-aborted-connectionreseterror104-connection-reset-by-peer-or-urllib3exceptionsmaxretryerror-httpsconnectionpoolhostcdn-lfshuggingfaceco-port443) for guidelines to fix your network issue.
The issue has been fixed in the latest version of OpenCompass, so you might also consider pull from the latest version.
## Efficiency
### Why does OpenCompass partition each evaluation request into tasks?
Given the extensive evaluation time and the vast quantity of datasets, conducting a comprehensive linear evaluation on LLM models can be immensely time-consuming. To address this, OpenCompass divides the evaluation request into multiple independent "tasks". These tasks are then dispatched to various GPU groups or nodes, achieving full parallelism and maximizing the efficiency of computational resources.
### How does task partitioning work?
Each task in OpenCompass represents a combination of specific model(s) and portions of the dataset awaiting evaluation. OpenCompass offers a variety of task partitioning strategies, each tailored for different scenarios. During the inference stage, the prevalent partitioning method seeks to balance task size, or computational cost. This cost is heuristically derived from the dataset size and the type of inference.
### Why does it take more time to evaluate LLM models on OpenCompass?
There is a tradeoff between the number of tasks and the time to load the model. For example, if we partition an request that evaluates a model against a dataset into 100 tasks, the model will be loaded 100 times in total. When resources are abundant, these 100 tasks can be executed in parallel, so the additional time spent on model loading can be ignored. However, if resources are limited, these 100 tasks will operate more sequentially, and repeated loadings can become a bottleneck in execution time.
Hence, if users find that the number of tasks greatly exceeds the available GPUs, we advise setting the `--max-partition-size` to a larger value.
# Installation
1. Set up the OpenCompass environment:
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
```
If you want to customize the PyTorch version or related CUDA version, please refer to the [official documentation](https://pytorch.org/get-started/locally/) to set up the PyTorch environment. Note that OpenCompass requires `pytorch>=1.13`.
2. Install OpenCompass:
```bash
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .
```
3. Install humaneval (Optional)
If you want to **evaluate your models coding ability on the humaneval dataset**, follow this step.
<details>
<summary><b>click to show the details</b></summary>
```bash
git clone https://github.com/openai/human-eval.git
cd human-eval
pip install -r requirements.txt
pip install -e .
cd ..
```
Please read the comments in `human_eval/execution.py` **lines 48-57** to understand the potential risks of executing the model generation code. If you accept these risks, uncomment **line 58** to enable code execution evaluation.
</details>
4. Install Llama (Optional)
If you want to **evaluate Llama / Llama-2 / Llama-2-chat with its official implementation**, follow this step.
<details>
<summary><b>click to show the details</b></summary>
```bash
git clone https://github.com/facebookresearch/llama.git
cd llama
pip install -r requirements.txt
pip install -e .
cd ..
```
You can find example configs in `configs/models`. ([example](https://github.com/open-compass/opencompass/blob/eb4822a94d624a4e16db03adeb7a59bbd10c2012/configs/models/llama2_7b_chat.py))
</details>
# Dataset Preparation
The datasets supported by OpenCompass mainly include two parts:
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
2. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
Run the following commands to download and place the datasets in the `${OpenCompass}/data` directory can complete dataset preparation.
```bash
# Run in the OpenCompass directory
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.
For next step, please read [Quick Start](./quick_start.md).
# Installation
1. Set up the OpenCompass environment:
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
```
If you want to customize the PyTorch version or related CUDA version, please refer to the [official documentation](https://pytorch.org/get-started/locally/) to set up the PyTorch environment. Note that OpenCompass requires `pytorch>=1.13`.
2. Install OpenCompass:
```bash
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .
```
3. Install humaneval (Optional)
If you want to **evaluate your models coding ability on the humaneval dataset**, follow this step.
<details>
<summary><b>click to show the details</b></summary>
```bash
git clone https://github.com/openai/human-eval.git
cd human-eval
pip install -r requirements.txt
pip install -e .
cd ..
```
Please read the comments in `human_eval/execution.py` **lines 48-57** to understand the potential risks of executing the model generation code. If you accept these risks, uncomment **line 58** to enable code execution evaluation.
</details>
4. Install Llama (Optional)
If you want to **evaluate Llama / Llama-2 / Llama-2-chat with its official implementation**, follow this step.
<details>
<summary><b>click to show the details</b></summary>
# Quick Start
```bash
git clone https://github.com/facebookresearch/llama.git
cd llama
pip install -r requirements.txt
pip install -e .
cd ..
```
![image](https://github.com/open-compass/opencompass/assets/22607038/d063cae0-3297-4fd2-921a-366e0a24890b)
You can find example configs in `configs/models`. ([example](https://github.com/open-compass/opencompass/blob/eb4822a94d624a4e16db03adeb7a59bbd10c2012/configs/models/llama2_7b_chat.py))
## Overview
</details>
OpenCompass provides a streamlined workflow for evaluating a model, which consists of the following stages: **Configure** -> **Inference** -> **Evaluation** -> **Visualization**.
# Dataset Preparation
**Configure**: This is your starting point. Here, you'll set up the entire evaluation process, choosing the model(s) and dataset(s) to assess. You also have the option to select an evaluation strategy, the computation backend, and define how you'd like the results displayed.
The datasets supported by OpenCompass mainly include two parts:
**Inference & Evaluation**: OpenCompass efficiently manages the heavy lifting, conducting parallel inference and evaluation on your chosen model(s) and dataset(s). The **Inference** phase is all about producing outputs from your datasets, whereas the **Evaluation** phase measures how well these outputs align with the gold standard answers. While this procedure is broken down into multiple "tasks" that run concurrently for greater efficiency, be aware that working with limited computational resources might introduce some unexpected overheads, and resulting in generally slower evaluation. To understand this issue and know how to solve it, check out [FAQ: Efficiency](faq.md#efficiency).
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
2. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
**Visualization**: Once the evaluation is done, OpenCompass collates the results into an easy-to-read table and saves them as both CSV and TXT files. If you need real-time updates, you can activate lark reporting and get immediate status reports in your Lark clients.
Run the following commands to download and place the datasets in the `${OpenCompass}/data` directory can complete dataset preparation.
```bash
# Run in the OpenCompass directory
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.
# Quick Start
We will demonstrate some basic features of OpenCompass through evaluating pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on both [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winogrande) benchmark tasks with their config file located at [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py).
Coming up, we'll walk you through the basics of OpenCompass, showcasing evaluations of pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on the [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winograd_wsc) benchmark tasks. Their configuration files can be found at [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py).
Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/open-compass/opencompass/tree/main/configs).
## Configure an Evaluation Task
## Configuring an Evaluation Task
In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is `run.py`. Users can select the model and dataset to be tested either via command line or configuration files.
......@@ -102,7 +40,10 @@ python tools/list_configs.py
python tools/list_configs.py llama mmlu
```
Some sample outputs are:
:::{dropdown} More about `list_configs`
:animate: fade-in-slide-down
Running `python tools/list_configs.py llama mmlu` gives the output like:
```text
+-----------------+-----------------------------------+
......@@ -122,6 +63,18 @@ Some sample outputs are:
```
Users can use the names in the first column as input parameters for `--models` and `--datasets` in `python run.py`. For datasets, the same name with different suffixes generally indicates that its prompts or evaluation methods are different.
:::
:::{dropdown} Model not on the list?
:animate: fade-in-slide-down
If you want to evaluate other models, please check out the "Command Line (Custom HF Model)" tab for the way to construct a custom HF model without a configuration file, or "Configuration File" tab to learn the general way to prepare your model configurations.
:::
````
````{tab} Command Line (Custom HF Model)
For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the `facebook/opt-125m` model, you can evaluate it with the following command:
......@@ -133,21 +86,40 @@ python run.py --datasets siqa_gen winograd_ppl \
--max-seq-len 2048 \
--max-out-len 100 \
--batch-size 128 \
--num-gpus 1
--num-gpus 1 # Number of minimum required GPUs
```
```{tip}
For all HuggingFace related parameters supported by `run.py`, please read [Initiating Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task).
Note that in this way, OpenCompass only evaluates one model at a time, while other ways can evaluate multiple models at once.
```{caution}
`--num-gpus` does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. [More](faq.md#how-does-opencompass-allocate-gpus)
```
:::{dropdown} More detailed example
:animate: fade-in-slide-down
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path facebook/opt-125m \ # HuggingFace model path
--tokenizer-path facebook/opt-125m \ # HuggingFace tokenizer path (if the same as the model path, can be omitted)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # Arguments to construct the tokenizer
--model-kwargs device_map='auto' \ # Arguments to construct the model
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--max-out-len 100 \ # Maximum number of tokens to generate
--batch-size 64 \ # Batch size
--num-gpus 1 # Number of GPUs required to run the model
```
```{seealso}
For all HuggingFace related parameters supported by `run.py`, please read [Launching Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task).
```
:::
````
````{tab} Configuration File
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. This method of configuration allows users to easily modify experimental parameters, provides a more flexible configuration, and simplifies the run command. The configuration file is organized in Python format and must include the `datasets` and `models` fields.
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. The configuration file is organized in Python format and must include the `datasets` and `models` fields.
The test configuration for this time is [configs/eval_demo.py](/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](./user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
The test configuration for this time is [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](../user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
```python
from mmengine.config import read_base
......@@ -168,43 +140,10 @@ When running tasks, we just need to pass the path of the configuration file to `
python run.py configs/eval_demo.py
```
````
`````
The configuration file evaluation method is more concise. The following sections will use this method as an example to explain the other features.
## Run Evaluation
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation for the first run and check if there is any prblem. In debugging mode, the tasks will be executed sequentially and the status will be printed in real time.
```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```
If everything is fine, you should see "Starting inference process" on screen:
```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
Then you can press `ctrl+c` to interrupt the program, and then run the following command to start the parallel evaluation:
```bash
python run.py configs/eval_demo.py -w outputs/demo
```
Now let's go over the configuration file and the launch options used in this case.
```{warning}
OpenCompass usually assumes network is available. If you encounter network issues or wish to run OpenCompass in an offline environment, please refer to [FAQ - Network - Q1](./faq.md#network) for solutions.
```
## Explanations
:::{dropdown} More about `models`
:animate: fade-in-slide-down
### Model list - `models`
OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](/configs/models/hf_opt_350m.py) (`configs/models/hf_opt_350m.py`):
OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](https://github.com/open-compass/opencompass/blob/main/configs/models/opt/hf_opt_350m.py) (`configs/models/opt/hf_opt_350m.py`):
```python
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
......@@ -231,33 +170,17 @@ opt350m = dict(
)
```
When using configurations, we can specify the relevant files through the command-line argument ``` --models`` or import the model configurations into the ```models\` list in the configuration file using the inheritance mechanism.
If the HuggingFace model you want to test is not among them, you can also directly specify the related parameters in the command line.
```bash
python run.py \
--hf-path facebook/opt-350m \ # HuggingFace model path
--tokenizer-path facebook/opt-350m \ # HuggingFace tokenizer path (if the same as the model path, can be omitted)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # Arguments to construct the tokenizer
--model-kwargs device_map='auto' \ # Arguments to construct the model
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--max-out-len 100 \ # Maximum number of tokens to generate
--batch-size 64 \ # Batch size
--num-gpus 1 # Number of GPUs required to run the model
```
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
When using configurations, we can specify the relevant files through the command-line argument ` --models` or import the model configurations into the `models` list in the configuration file using the inheritance mechanism.
```{note}
More information about model configuration can be found in [Prepare Models](./user_guides/models.md).
```{seealso}
More information about model configuration can be found in [Prepare Models](../user_guides/models.md).
```
:::
### Dataset list - `datasets`
:::{dropdown} More about `datasets`
:animate: fade-in-slide-down
The translation is:
Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance.
Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance
Below is a dataset-related configuration snippet from `configs/eval_demo.py`:
......@@ -280,31 +203,53 @@ Moreover, [configs/datasets/collections](https://github.com/open-compass/opencom
python run.py --models hf_llama_7b --datasets base_medium
```
```{note}
You can find more information from [Dataset Preparation](./user_guides/dataset_prepare.md).
```{seealso}
You can find more information from [Dataset Preparation](../user_guides/datasets.md).
```
:::
````
`````
```{warning}
OpenCompass usually assumes network is available. If you encounter network issues or wish to run OpenCompass in an offline environment, please refer to [FAQ - Network - Q1](./faq.md#network) for solutions.
```
### Launch Evaluation
The following sections will use configuration-based method as an example to explain the other features.
## Launching Evaluation
When the config file is ready, we can start the task in **debug mode** to check for any exceptions in model loading, dataset reading, or incorrect cache usage.
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation in `--debug` mode for the first run and check if there is any problem. In `--debug` mode, the tasks will be executed sequentially and output will be printed in real time.
```shell
```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```
However, in `--debug` mode, tasks are executed sequentially. After confirming that everything is correct, you
can disable the `--debug` mode to fully utilize multiple GPUs.
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
If everything is fine, you should see "Starting inference process" on screen:
```shell
```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
Then you can press `ctrl+c` to interrupt the program, and run the following command in normal mode:
```bash
python run.py configs/eval_demo.py -w outputs/demo
```
In normal mode, the evaluation tasks will be executed parallelly in the background, and their output will be redirected to the output directory `outputs/demo/{TIMESTAMP}`. The progress bar on the frontend only indicates the number of completed tasks, regardless of their success or failure. **Any backend task failures will only trigger a warning message in the terminal.**
:::{dropdown} More parameters in `run.py`
:animate: fade-in-slide-down
Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:
- `-w outputs/demo`: Directory to save evaluation logs and results.
- `-r`: Restart the previous (interrupted) evaluation.
- `-w outputs/demo`: Work directory to save evaluation logs and results. In this case, the experiment result will be saved to `outputs/demo/{TIMESTAMP}`.
- `-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
- `--mode all`: Specify a specific stage of the task.
- all: Perform a complete evaluation, including inference and evaluation.
- all: (Default) Perform a complete evaluation, including inference and evaluation.
- infer: Perform inference on each dataset.
- eval: Perform evaluation based on the inference results.
- viz: Display evaluation results only.
......@@ -317,11 +262,13 @@ If you are not performing the evaluation on your local machine but using a Slurm
- `--partition(-p) my_part`: Slurm cluster partition.
- `--retry 2`: Number of retries for failed tasks.
```{tip}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task) for details.
```{seealso}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task) for details.
```
## Obtaining Evaluation Results
:::
## Visualizing Evaluation Results
After the evaluation is complete, the evaluation results table will be printed as follows:
......@@ -350,15 +297,15 @@ outputs/default/
The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).
More information about obtaining evaluation results can be found in [Results Summary](./user_guides/summarizer.md).
More information about obtaining evaluation results can be found in [Results Summary](../user_guides/summarizer.md).
## Additional Tutorials
To learn more about using OpenCompass, explore the following tutorials:
- [Prepare Datasets](./user_guides/dataset_prepare.md)
- [Prepare Models](./user_guides/models.md)
- [Task Execution and Monitoring](./user_guides/experimentation.md)
- [Understand Prompts](./prompt/overview.md)
- [Results Summary](./user_guides/summarizer.md)
- [Learn about Config](./user_guides/config.md)
- [Prepare Datasets](../user_guides/datasets.md)
- [Prepare Models](../user_guides/models.md)
- [Task Execution and Monitoring](../user_guides/experimentation.md)
- [Understand Prompts](../prompt/overview.md)
- [Results Summary](../user_guides/summarizer.md)
- [Learn about Config](../user_guides/config.md)
......@@ -23,8 +23,9 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
:maxdepth: 1
:caption: Get Started
get_started.md
faq.md
get_started/installation.md
get_started/quick_start.md
get_started/faq.md
.. _UserGuides:
.. toctree::
......
......@@ -33,7 +33,7 @@ Task Configuration (`$EXP`):
--max-seq-len 2048 \ # Maximum generated token count
--batch-size 8 \ # Batch size
--no-batch-padding \ # Disable batch padding and infer through a for loop to avoid accuracy loss
--num-gpus 1 # Number of required GPUs
--num-gpus 1 # Number of minimum required GPUs for this model
```
Complete HuggingFace parameter descriptions:
......@@ -47,7 +47,7 @@ Task Configuration (`$EXP`):
- `--max-seq-len`: Maximum sequence length the model can accept
- `--no-batch-padding`: Disable batch padding and infer through a for loop to avoid accuracy loss
- `--batch-size`: Batch size
- `--num-gpus`: Number of GPUs required to run the model
- `--num-gpus`: Number of GPUs required to run the model. Please note that this parameter is only used to determine the number of GPUs required to run the model, and does not affect the actual number of GPUs used for the task. Refer to [Efficient Evaluation](./evaluation.md) for more details.
Starting Methods:
......
......@@ -54,6 +54,7 @@ extensions = [
'sphinx_tabs.tabs',
'notfound.extension',
'sphinxcontrib.jquery',
'sphinx_design',
]
# Add any paths that contain templates here, relative to this directory.
......
# 安装
1. 准备 OpenCompass 运行环境:
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
```
如果你希望自定义 PyTorch 版本或相关的 CUDA 版本,请参考 [官方文档](https://pytorch.org/get-started/locally/) 准备 PyTorch 环境。需要注意的是,OpenCompass 要求 `pytorch>=1.13`
2. 安装 OpenCompass:
```bash
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .
```
3. 安装 humaneval(可选):
如果你需要**在 humaneval 数据集上评估模型代码能力**,请执行此步骤,否则忽略这一步。
<details>
<summary><b>点击查看详细</b></summary>
```bash
git clone https://github.com/openai/human-eval.git
cd human-eval
pip install -r requirements.txt
pip install -e .
cd ..
```
请仔细阅读 `human_eval/execution.py` **第48-57行**的注释,了解执行模型生成的代码可能存在的风险,如果接受这些风险,请取消**第58行**的注释,启用代码执行评测。
</details>
4. 安装 Llama(可选):
如果你需要**使用官方实现评测 Llama / Llama-2 / Llama-2-chat 模型**,请执行此步骤,否则忽略这一步。
<details>
<summary><b>点击查看详细</b></summary>
```bash
git clone https://github.com/facebookresearch/llama.git
cd llama
pip install -r requirements.txt
pip install -e .
cd ..
```
你可以在 `configs/models` 下找到所有 Llama / Llama-2 / Llama-2-chat 模型的配置文件示例。([示例](https://github.com/open-compass/opencompass/blob/eb4822a94d624a4e16db03adeb7a59bbd10c2012/configs/models/llama2_7b_chat.py))
</details>
# 数据集准备
OpenCompass 支持的数据集主要包括两个部分:
1. Huggingface 数据集: [Huggingface Dataset](https://huggingface.co/datasets) 提供了大量的数据集,这部分数据集运行时会**自动下载**
2. 自建以及第三方数据集:OpenCompass 还提供了一些第三方数据集及自建**中文**数据集。运行以下命令**手动下载解压**
在 OpenCompass 项目根目录下运行下面命令,将数据集准备至 `${OpenCompass}/data` 目录下:
```bash
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
OpenCompass 已经支持了大多数常用于性能比较的数据集,具体支持的数据集列表请直接在 `configs/datasets` 下进行查找。
# 快速上手
我们会以测试 [OPT-125M](https://huggingface.co/facebook/opt-125m) 以及 [OPT-350M](https://huggingface.co/facebook/opt-350m) 预训练基座模型在 [SIQA](https://huggingface.co/datasets/social_i_qa)[Winograd](https://huggingface.co/datasets/winogrande) 上的性能为例,带领你熟悉 OpenCompass 的一些基本功能。
运行前确保已经安装了 OpenCompass,本实验可以在单张 _GTX-1660-6G_ 显卡上成功运行。
更大参数的模型,如 Llama-7B, 可参考 [configs](https://github.com/open-compass/opencompass/tree/main/configs) 中其他例子。
## 配置任务
OpenCompass 中,每个评测任务都由待评测的模型和数据集组成,而评测的入口为 `run.py`。用户可以通过命令行或配置文件的方式去选择待测的模型和数据集。
`````{tabs}
````{tab} 命令行形式
用户可以通过 `--models` 和 `--datasets` 组合待测试的模型和数据集。
```bash
python run.py --models hf_opt_125m hf_opt_350m --datasets siqa_gen winograd_ppl
```
模型和数据集以配置文件的形式预先存放在 `configs/models` 和 `configs/datasets` 下。用户可以通过 `tools/list_configs.py` 查看或筛选当前可用的模型和数据集配置。
```bash
# 列出所有配置
python tools/list_configs.py
# 列出所有跟 llama 及 mmlu 相关的配置
python tools/list_configs.py llama mmlu
```
部分样例输出如下:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| ... | ... |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| ... | ... |
+-------------------+---------------------------------------------------+
```
用户可以按照第一列中的名称去作为 `python run.py` 中 `--models` 和 `--datasets` 的传入参数。在数据集部分,相同名称但不同后缀的数据集一般意味着其提示词或评测方式是不一样的。
对于 HuggingFace 模型,用户可以直接通过命令行设定模型参数,而无需额外配置文件。例如,对于 `facebook/opt-125m` 模型,可以通过以下命令进行评测:
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path facebook/opt-125m \
--model-kwargs device_map='auto' \
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \
--max-seq-len 2048 \
--max-out-len 100 \
--batch-size 128 \
--num-gpus 1
```
```{tip}
关于 `run.py` 支持的所有 HuggingFace 相关参数,请阅读 [评测任务发起](./user_guides/experimentation.md#评测任务发起)。
```
````
````{tab} 配置形式
除了通过在命令行中对实验进行配置,OpenCompass 也支持用户把实验全量配置写入一份配置文件中,并直接通过 `run.py` 运行。这样的配置方式允许用户方便地修改实验参数,对实验进行更灵活的配置,也让运行命令更为简洁。配置文件以 Python 格式组织,且必须包含 `datasets` 和 `models` 字段。
本次的测试的配置文件为 [configs/eval_demo.py](/configs/eval_demo.py)。该配置通过[继承机制](./user_guides/config.md#继承机制)引入了所需的数据集和模型配置,并按照格式组合了 `datasets` 和 `models` 字段。
```python
from mmengine.config import read_base
with read_base():
from .datasets.siqa.siqa_gen import siqa_datasets
from .datasets.winograd.winograd_ppl import winograd_datasets
from .models.opt.hf_opt_125m import opt125m
from .models.opt.hf_opt_350m import opt350m
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
```
在运行任务时,我们只需要往 `run.py` 传入配置文件的路径即可:
```bash
python run.py configs/eval_demo.py
```
````
`````
配置文件评测方式较为简洁,下文将以该方式为例讲解其余功能。
## 运行评测
由于 OpenCompass 默认使用并行的方式进行评测,为了便于及时发现问题,我们可以在首次启动时使用 debug 模式运行,该模式会将任务串行执行,并会实时输出任务的执行进度。
```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```
如果一切正常,屏幕上会出现 "Starting inference process":
```bash
Loading cached processed dataset at .cache/huggingface/datasets/social_i_qa/default/0.1.0/674d85e42ac7430d3dcd4de7007feaffcb1527c535121e09bab2803fbcc925f8/cache-742512eab30e8c9c.arrow
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
此时可以使用 `ctrl+c` 中断 debug 模式的执行,并运行以下命令进行并行评测:
```bash
python run.py configs/eval_demo.py -w outputs/demo
```
运行 demo 期间,我们来介绍一下本案例中的配置内容以及启动选项。
## 配置详解
### 模型列表 `models`
OpenCompass 在 `configs/models` 下提供了一系列预定义好的模型配置。下面为 [opt-350m](/configs/models/hf_opt_350m.py) (`configs/models/hf_opt_350m.py`) 相关的配置片段:
```python
# 提供直接使用 HuggingFaceCausalLM 模型的接口
from opencompass.models import HuggingFaceCausalLM
# OPT-350M
opt350m = dict(
type=HuggingFaceCausalLM,
# 以下参数为 HuggingFaceCausalLM 相关的初始化参数
path='facebook/opt-350m', # HuggingFace 模型地址
tokenizer_path='facebook/opt-350m',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True),
model_kwargs=dict(device_map='auto'), # 构造 model 的参数
# 下列参数为所有模型均需设定的初始化参数,非 HuggingFaceCausalLM 独有
abbr='opt350m', # 模型简称,用于结果展示
max_seq_len=2048, # 模型能接受的最大序列长度
max_out_len=100, # 最长生成 token 数
batch_size=64, # 批次大小
run_cfg=dict(num_gpus=1), # 运行模型所需的gpu数
)
```
在使用配置时,我们可以通过在命令行参数中使用 `--models` 指定相关文件,也可以通过继承机制在实验配置文件中导入模型配置,并加入到 `models` 列表中。
如果你想要测试的 HuggingFace 模型不在其中,也可以在命令行中直接指定相关参数。
```bash
python run.py \
--hf-path facebook/opt-350m \ # HuggingFace 模型地址
--tokenizer-path facebook/opt-350m \ # HuggingFace tokenizer 地址(如与模型地址相同,可省略)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # 构造 tokenizer 的参数
--model-kwargs device_map='auto' \ # 构造 model 的参数
--max-seq-len 2048 \ # 模型能接受的最大序列长度
--max-out-len 100 \ # 最长生成 token 数
--batch-size 64 \ # 批次大小
--num-gpus 1 # 运行模型所需的gpu数
```
HuggingFace 中的 'facebook/opt-350m' 以及 'facebook/opt-125m' 权重会在运行时自动下载。
```{note}
如果需要了解更多参数的说明,或 API 模型及自定义模型的测试,可阅读 [准备模型](./user_guides/models.md)。
```
### 数据集列表 `datasets`
与模型类似,数据集的配置文件都提供在 `configs/datasets` 下,用户可以在命令行中通过 `--datasets` ,或在配置文件中通过继承导入相关配置。
以下为 `configs/eval_demo.py` 中与数据集相关的配置片段:
```python
from mmengine.config import read_base # 使用 mmengine.read_base() 读取基础配置
with read_base():
# 直接从预设数据集配置中读取需要的数据集配置
from .datasets.winograd.winograd_ppl import winograd_datasets # 读取 Winograd 的配置,基于 PPL (perplexity) 进行评测
from .datasets.siqa.siqa_gen import siqa_datasets # 读取 SIQA 的配置,基于生成式进行评测
datasets = [*siqa_datasets, *winograd_datasets] # 最后 config 需要包含所需的评测数据集列表 datasets
```
数据集的配置通常为 'ppl' 和 'gen' 两类配置文件,表示使用的评估方式。其中 `ppl` 表示使用判别式评测, `gen` 表示使用生成式评测。
此外,[configs/datasets/collections](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections) 存放了各类数据集集合,方便做综合评测。OpenCompass 常用 [`base_medium.py`](/configs/datasets/collections/base_medium.py) 对模型进行全量测试。若需要复现结果,直接导入该文件即可。如:
```bash
python run.py --models hf_llama_7b --datasets base_medium
```
```{note}
更多介绍可查看 [数据集配置](./user_guides/dataset_prepare.md)。
```
```{warning}
OpenCompass 在运行时通常需要依赖网络连接 HuggingFace 服务器 (https://huggingface.co/) 下载模型或数据集。如果连接遇到问题,或者需要离线运行评测,可以参考 [FAQ - Network - Q1](./faq.md#network)。
```
### 启动评测
配置文件准备完毕后,我们可以使用 debug 模式启动任务,以检查模型加载、数据集读取是否出现异常,如未正确读取缓存等。
```shell
python run.py configs/eval_demo.py -w outputs/demo --debug
```
`--debug` 模式下只能逐一序列执行任务,因此检查无误后,可关闭 `--debug` 模式,使程序充分利用多卡资源
```shell
python run.py configs/eval_demo.py -w outputs/demo
```
以下是一些与评测相关的参数,可以帮助你根据自己的环境情况配置更高效的推理任务。
- `-w outputs/demo`: 评测日志及结果保存目录。若不指定,则默认为 `outputs/default`
- `-r`: 重启上一次(中断的)评测
- `--mode all`: 指定进行某一阶段的任务
- all: 进行全阶段评测,包括推理和评估
- infer: 仅进行各个数据集上的推理
- eval: 仅基于推理结果进行评估
- viz: 仅展示评估结果
- `--max-partition-size 2000`: 数据集拆分尺寸,部分数据集可能比较大,利用此参数将其拆分成多个子任务,能有效利用资源。但如果拆分过细,则可能因为模型本身加载时间过长,反而速度更慢
- `--max-num-workers 32`: 最大并行启动任务数,在 Slurm 等分布式环境中,该参数用于指定最大提交任务数;在本地环境中,该参数用于指定最大并行执行的任务数,注意实际并行执行任务数受制于 GPU 等资源数,并不一定为该数字。
如果你不是在本机进行评测,而是使用 slurm 集群,可以指定如下参数:
- `--slurm`: 使用 slurm 在集群提交任务
- `--partition(-p) my_part`: slurm 集群分区
- `--retry 2`: 任务出错重试次数
```{tip}
这个脚本同样支持将任务提交到阿里云深度学习中心(DLC)上运行,以及更多定制化的评测策略。请参考 [评测任务发起](./user_guides/experimentation.md#评测任务发起) 了解更多细节。
```
## 评测结果
评测完成后,会打印评测结果表格如下:
```text
dataset version metric mode opt350m opt125m
--------- --------- -------- ------ --------- ---------
siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
```
所有过程的日志,预测,以及最终结果会放在 `outputs/demo/` 目录下。目录结构如下所示:
```text
outputs/default/
├── 20200220_120000
├── 20230220_183030 # 一次实验
│   ├── configs # 每次实验都会在此处存下用于追溯的 config
│   ├── logs # 运行日志
│   │   ├── eval
│   │   └── infer
│   ├── predictions # 储存了每个任务的推理结果
│   ├── results # 储存了每个任务的评测结果
│   └── summary # 汇总每次实验的所有评测结果
├── ...
```
打印评测结果的过程可被进一步定制化,用于输出一些数据集的平均分 (例如 MMLU, C-Eval 等)。
关于评测结果输出的更多介绍可阅读 [结果展示](./user_guides/summarizer.md)
## 更多教程
想要更多了解 OpenCompass, 可以点击下列链接学习。
- [数据集配置](./user_guides/dataset_prepare.md)
- [准备模型](./user_guides/models.md)
- [任务运行和监控](./user_guides/experimentation.md)
- [如何调Prompt](./prompt/overview.md)
- [结果展示](./user_guides/summarizer.md)
- [学习配置文件](./user_guides/config.md)
# 常见问题
## 通用
### OpenCompass 如何分配 GPU?
OpenCompass 使用称为 task (任务) 的单位处理评估请求。每个任务都是模型和数据集的独立组合。任务所需的 GPU 资源完全由正在评估的模型决定,具体取决于 `num_gpus` 参数。
在评估过程中,OpenCompass 部署多个工作器并行执行任务。这些工作器不断尝试获取 GPU 资源直到成功运行任务。因此,OpenCompass 始终努力充分利用所有可用的 GPU 资源。
例如,如果您在配备有 8 个 GPU 的本地机器上使用 OpenCompass,每个任务要求 4 个 GPU,那么默认情况下,OpenCompass 会使用所有 8 个 GPU 同时运行 2 个任务。但是,如果您将 `--max-num-workers` 设置为 1,那么一次只会处理一个任务,只使用 4 个 GPU。
### 我如何控制 OpenCompass 占用的 GPU 数量?
目前,没有直接的方法来指定 OpenCompass 可以使用的 GPU 数量。但以下是一些间接策略:
**如果在本地评估:**
您可以通过设置 `CUDA_VISIBLE_DEVICES` 环境变量来限制 OpenCompass 的 GPU 访问。例如,使用 `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` 只会向 OpenCompass 暴露前四个 GPU,确保它同时使用的 GPU 数量不超过这四个。
**如果使用 Slurm 或 DLC:**
尽管 OpenCompass 没有直接访问资源池,但您可以调整 `--max-num-workers` 参数以限制同时提交的评估任务数量。这将间接管理 OpenCompass 使用的 GPU 数量。例如,如果每个任务需要 4 个 GPU,您希望分配总共 8 个 GPU,那么应将 `--max-num-workers` 设置为 2。
## 网络
### 运行报错:`('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))` 或 `urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443)`
由于 HuggingFace 的实现,OpenCompass 在首次加载某些数据集和模型时需要网络(尤其是与 HuggingFace 的连接)。此外,每次启动时都会连接到 HuggingFace。为了成功运行,您可以:
- 通过指定环境变量 `http_proxy``https_proxy`,挂上代理;
- 使用其他机器的缓存文件。首先在有 HuggingFace 访问权限的机器上运行实验,然后将缓存文件复制到离线的机器上。缓存文件默认位于 `~/.cache/huggingface/`[文档](https://huggingface.co/docs/datasets/cache#cache-directory))。当缓存文件准备好时,您可以在离线模式下启动评估:
```python
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 HF_EVALUATE_OFFLINE=1 python run.py ...
```
这样,评估不再需要网络连接。但是,如果缓存中缺少任何数据集或模型的文件,仍然会引发错误。
### 我的服务器无法连接到互联网,我如何使用 OpenCompass?
[网络-Q1](#运行报错Connection-aborted-ConnectionResetError104-Connection-reset-by-peer-或-urllib3exceptionsMaxRetryError-HTTPSConnectionPoolhostcdn-lfshuggingfaceco-port443) 所述,使用其他机器的缓存文件。
### 在评估阶段报错 `FileNotFoundError: Couldn't find a module script at opencompass/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.`
HuggingFace 试图将度量(例如 `accuracy`)作为在线模块加载,如果网络无法访问,它可能会失败。请参考 [网络-Q1](#运行报错Connection-aborted-ConnectionResetError104-Connection-reset-by-peer-或-urllib3exceptionsMaxRetryError-HTTPSConnectionPoolhostcdn-lfshuggingfaceco-port443) 以解决您的网络问题。
该问题在最新版 OpenCompass 中已经修复,因此也可以考虑使用最新版的 OpenCompass。
## 效率
### 为什么 OpenCompass 将每个评估请求分割成任务?
鉴于大量的评估时间和大量的数据集,对 LLM 模型进行全面的线性评估可能非常耗时。为了解决这个问题,OpenCompass 将评估请求分为多个独立的 “任务”。然后,这些任务被派发到各种 GPU 组或节点,实现全并行并最大化计算资源的效率。
### 任务分区是如何工作的?
OpenCompass 中的每个任务代表等待评估的特定模型和数据集部分的组合。OpenCompass 提供了各种任务分区策略,每种策略都针对不同的场景。在推理阶段,主要的分区方法旨在平衡任务大小或计算成本。这种成本是从数据集大小和推理类型中启发式地得出的。
### 为什么在 OpenCompass 上评估 LLM 模型需要更多时间?
任务数量与加载模型的时间之间存在权衡。例如,如果我们将评估模型与数据集的请求分成 100 个任务,模型将总共加载 100 次。当资源充足时,这 100 个任务可以并行执行,所以在模型加载上花费的额外时间可以忽略。但是,如果资源有限,这 100 个任务会更加串行地执行,重复的加载可能成为执行时间的瓶颈。
因此,如果用户发现任务数量远远超过可用的 GPU,我们建议将 `--max-partition-size` 设置为一个较大的值。
# 安装
1. 准备 OpenCompass 运行环境:
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
```
如果你希望自定义 PyTorch 版本或相关的 CUDA 版本,请参考 [官方文档](https://pytorch.org/get-started/locally/) 准备 PyTorch 环境。需要注意的是,OpenCompass 要求 `pytorch>=1.13`
2. 安装 OpenCompass:
```bash
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .
```
3. 安装 humaneval(可选):
如果你需要**在 humaneval 数据集上评估模型代码能力**,请执行此步骤,否则忽略这一步。
<details>
<summary><b>点击查看详细</b></summary>
```bash
git clone https://github.com/openai/human-eval.git
cd human-eval
pip install -r requirements.txt
pip install -e .
cd ..
```
请仔细阅读 `human_eval/execution.py` **第48-57行**的注释,了解执行模型生成的代码可能存在的风险,如果接受这些风险,请取消**第58行**的注释,启用代码执行评测。
</details>
4. 安装 Llama(可选):
如果你需要**使用官方实现评测 Llama / Llama-2 / Llama-2-chat 模型**,请执行此步骤,否则忽略这一步。
<details>
<summary><b>点击查看详细</b></summary>
```bash
git clone https://github.com/facebookresearch/llama.git
cd llama
pip install -r requirements.txt
pip install -e .
cd ..
```
你可以在 `configs/models` 下找到所有 Llama / Llama-2 / Llama-2-chat 模型的配置文件示例。([示例](https://github.com/open-compass/opencompass/blob/eb4822a94d624a4e16db03adeb7a59bbd10c2012/configs/models/llama2_7b_chat.py))
</details>
# 数据集准备
OpenCompass 支持的数据集主要包括两个部分:
1. Huggingface 数据集: [Huggingface Dataset](https://huggingface.co/datasets) 提供了大量的数据集,这部分数据集运行时会**自动下载**
2. 自建以及第三方数据集:OpenCompass 还提供了一些第三方数据集及自建**中文**数据集。运行以下命令**手动下载解压**
在 OpenCompass 项目根目录下运行下面命令,将数据集准备至 `${OpenCompass}/data` 目录下:
```bash
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
OpenCompass 已经支持了大多数常用于性能比较的数据集,具体支持的数据集列表请直接在 `configs/datasets` 下进行查找。
接下来,你可以阅读[快速上手](./quick_start.md)了解 OpenCompass 的基本用法。
# 快速开始
![image](https://github.com/open-compass/opencompass/assets/22607038/d063cae0-3297-4fd2-921a-366e0a24890b)
## 概览
在 OpenCompass 中评估一个模型通常包括以下几个阶段:**配置** -> **推理** -> **评估** -> **可视化**
**配置**:这是整个工作流的起点。您需要配置整个评估过程,选择要评估的模型和数据集。此外,还可以选择评估策略、计算后端等,并定义显示结果的方式。
**推理与评估**:在这个阶段,OpenCompass 将会开始对模型和数据集进行并行推理和评估。**推理**阶段主要是让模型从数据集产生输出,而**评估**阶段则是衡量这些输出与标准答案的匹配程度。这两个过程会被拆分为多个同时运行的“任务”以提高效率,但请注意,如果计算资源有限,这种策略可能会使评测变得更慢。如果需要了解该问题及解决方案,可以参考 [FAQ: 效率](faq.md#效率)
**可视化**:评估完成后,OpenCompass 将结果整理成易读的表格,并将其保存为 CSV 和 TXT 文件。你也可以激活飞书状态上报功能,此后可以在飞书客户端中及时获得评测状态报告。
接下来,我们将展示 OpenCompass 的基础用法,展示预训练模型 [OPT-125M](https://huggingface.co/facebook/opt-125m)[OPT-350M](https://huggingface.co/facebook/opt-350m)[SIQA](https://huggingface.co/datasets/social_i_qa)[Winograd](https://huggingface.co/datasets/winograd_wsc) 基准任务上的评估。它们的配置文件可以在 [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py) 中找到。
在运行此实验之前,请确保您已在本地安装了 OpenCompass。这个例子可以在一台 _GTX-1660-6G_ GPU 下成功运行。
对于参数更大的模型,如 Llama-7B,请参考 [configs 目录](https://github.com/open-compass/opencompass/tree/main/configs) 中提供的其他示例。
## 配置评估任务
在 OpenCompass 中,每个评估任务由待评估的模型和数据集组成。评估的入口点是 `run.py`。用户可以通过命令行或配置文件选择要测试的模型和数据集。
`````{tabs}
````{tab} 命令行
用户可以使用 `--models` 和 `--datasets` 结合想测试的模型和数据集。
```bash
python run.py --models hf_opt_125m hf_opt_350m --datasets siqa_gen winograd_ppl
```
模型和数据集的配置文件预存于 `configs/models` 和 `configs/datasets` 中。用户可以使用 `tools/list_configs.py` 查看或过滤当前可用的模型和数据集配置。
```bash
# 列出所有配置
python tools/list_configs.py
# 列出与llama和mmlu相关的所有配置
python tools/list_configs.py llama mmlu
```
:::{dropdown} 关于 `list_configs`
:animate: fade-in-slide-down
运行 `python tools/list_configs.py llama mmlu` 将产生如下输出:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| ... | ... |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| ... | ... |
+-------------------+---------------------------------------------------+
```
用户可以使用第一列中的名称作为 `python run.py` 中 `--models` 和 `--datasets` 的输入参数。对于数据集,同一名称的不同后缀通常表示其提示或评估方法不同。
:::
:::{dropdown} 没有列出的模型?
:animate: fade-in-slide-down
如果您想评估其他模型,请查看 “命令行(自定义 HF 模型)”选项卡,了解无需配置文件自定义 HF 模型的方法,或 “配置文件”选项卡,了解准备模型配置的通用方法。
:::
````
````{tab} 命令行(自定义 HF 模型)
对于 HuggingFace 模型,用户可以通过命令行直接设置模型参数,无需额外的配置文件。例如,对于 `facebook/opt-125m` 模型,您可以使用以下命令进行评估:
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path facebook/opt-125m \
--model-kwargs device_map='auto' \
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \
--max-seq-len 2048 \
--max-out-len 100 \
--batch-size 128 \
--num-gpus 1 # 最少需要的 GPU 数量
```
请注意,通过这种方式,OpenCompass 一次只评估一个模型,而其他方式可以一次评估多个模型。
```{caution}
`--num-gpus` 不代表实际用于评估的 GPU 数量,而是该模型所需的最少 GPU 数量。[更多](faq.md#opencompass-如何分配-gpu)
```
:::{dropdown} 更详细的示例
:animate: fade-in-slide-down
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path facebook/opt-125m \ # HuggingFace 模型路径
--tokenizer-path facebook/opt-125m \ # HuggingFace tokenizer 路径(如果与模型路径相同,可以省略)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # 构建 tokenizer 的参数
--model-kwargs device_map='auto' \ # 构建模型的参数
--max-seq-len 2048 \ # 模型可以接受的最大序列长度
--max-out-len 100 \ # 生成的最大 token 数
--batch-size 64 \ # 批量大小
--num-gpus 1 # 运行模型所需的 GPU 数量
```
```{seealso}
有关 `run.py` 支持的所有与 HuggingFace 相关的参数,请阅读 [评测任务发起](../user_guides/experimentation.md#评测任务发起)
```
:::
````
````{tab} 配置文件
除了通过命令行配置实验外,OpenCompass 还允许用户在配置文件中编写实验的完整配置,并通过 `run.py` 直接运行它。配置文件是以 Python 格式组织的,并且必须包括 `datasets` 和 `models` 字段。
本次测试配置在 [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py) 中。此配置通过 [继承机制](../user_guides/config.md#继承机制) 引入所需的数据集和模型配置,并以所需格式组合 `datasets` 和 `models` 字段。
```python
from mmengine.config import read_base
with read_base():
from .datasets.siqa.siqa_gen import siqa_datasets
from .datasets.winograd.winograd_ppl import winograd_datasets
from .models.opt.hf_opt_125m import opt125m
from .models.opt.hf_opt_350m import opt350m
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
```
运行任务时,我们只需将配置文件的路径传递给 `run.py`:
```bash
python run.py configs/eval_demo.py
```
:::{dropdown} 关于 `models`
:animate: fade-in-slide-down
OpenCompass 提供了一系列预定义的模型配置,位于 `configs/models` 下。以下是与 [opt-350m](https://github.com/open-compass/opencompass/blob/main/configs/models/opt/hf_opt_350m.py)(`configs/models/opt/hf_opt_350m.py`)相关的配置片段:
```python
# 使用 `HuggingFaceCausalLM` 评估由 HuggingFace 的 `AutoModelForCausalLM` 支持的模型
from opencompass.models import HuggingFaceCausalLM
# OPT-350M
opt350m = dict(
type=HuggingFaceCausalLM,
# `HuggingFaceCausalLM` 的初始化参数
path='facebook/opt-350m',
tokenizer_path='facebook/opt-350m',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
proxies=None,
trust_remote_code=True),
model_kwargs=dict(device_map='auto'),
# 下面是所有模型的共同参数,不特定于 HuggingFaceCausalLM
abbr='opt350m', # 结果显示的模型缩写
max_seq_len=2048, # 整个序列的最大长度
max_out_len=100, # 生成的最大 token 数
batch_size=64, # 批量大小
run_cfg=dict(num_gpus=1), # 该模型所需的 GPU 数量
)
```
使用配置时,我们可以通过命令行参数 `--models` 指定相关文件,或使用继承机制将模型配置导入到配置文件中的 `models` 列表中。
```{seealso}
有关模型配置的更多信息,请参见 [准备模型](../user_guides/models.md)。
```
:::
:::{dropdown} 关于 `datasets`
:animate: fade-in-slide-down
与模型类似,数据集的配置文件也提供在 `configs/datasets` 下。用户可以在命令行中使用 `--datasets`,或通过继承在配置文件中导入相关配置
下面是来自 `configs/eval_demo.py` 的与数据集相关的配置片段:
```python
from mmengine.config import read_base # 使用 mmengine.read_base() 读取基本配置
with read_base():
# 直接从预设的数据集配置中读取所需的数据集配置
from .datasets.winograd.winograd_ppl import winograd_datasets # 读取 Winograd 配置,基于 PPL(困惑度)进行评估
from .datasets.siqa.siqa_gen import siqa_datasets # 读取 SIQA 配置,基于生成进行评估
datasets = [*siqa_datasets, *winograd_datasets] # 最终的配置需要包含所需的评估数据集列表 'datasets'
```
数据集配置通常有两种类型:'ppl' 和 'gen',分别指示使用的评估方法。其中 `ppl` 表示辨别性评估,`gen` 表示生成性评估。
此外,[configs/datasets/collections](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections) 收录了各种数据集集合,方便进行综合评估。OpenCompass 通常使用 [`base_medium.py`](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections/base_medium.py) 进行全面的模型测试。要复制结果,只需导入该文件,例如:
```bash
python run.py --models hf_llama_7b --datasets base_medium
```
```{seealso}
您可以从 [配置数据集](../user_guides/datasets.md) 中找到更多信息。
```
:::
````
`````
```{warning}
OpenCompass 通常假定运行环境网络是可用的。如果您遇到网络问题或希望在离线环境中运行 OpenCompass,请参阅 [FAQ - 网络 - Q1](./faq.md#网络) 寻求解决方案。
```
接下来的部分将使用基于配置的方法作为示例来解释其他特征。
## 启动评估
由于 OpenCompass 默认并行启动评估过程,我们可以在第一次运行时以 `--debug` 模式启动评估,并检查是否存在问题。在 `--debug` 模式下,任务将按顺序执行,并实时打印输出。
```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```
预训练模型 'facebook/opt-350m' 和 'facebook/opt-125m' 将在首次运行期间从 HuggingFace 自动下载。
如果一切正常,您应该看到屏幕上显示 “Starting inference process”:
```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
然后,您可以按 `ctrl+c` 中断程序,并以正常模式运行以下命令:
```bash
python run.py configs/eval_demo.py -w outputs/demo
```
在正常模式下,评估任务将在后台并行执行,其输出将被重定向到输出目录 `outputs/demo/{TIMESTAMP}`。前端的进度条只指示已完成任务的数量,而不考虑其成功或失败。**任何后端任务失败都只会在终端触发警告消息。**
:::{dropdown} `run.py` 中的更多参数
:animate: fade-in-slide-down
以下是与评估相关的一些参数,可以帮助您根据环境配置更有效的推理任务:
- `-w outputs/demo`:保存评估日志和结果的工作目录。在这种情况下,实验结果将保存到 `outputs/demo/{TIMESTAMP}`
- `-r`:重用现有的推理结果,并跳过已完成的任务。如果后面跟随时间戳,将重用工作空间路径下该时间戳的结果;否则,将重用指定工作空间路径下的最新结果。
- `--mode all`:指定任务的特定阶段。
- all:(默认)执行完整评估,包括推理和评估。
- infer:在每个数据集上执行推理。
- eval:根据推理结果进行评估。
- viz:仅显示评估结果。
- `--max-partition-size 40000`:数据集分片大小。一些数据集可能很大,使用此参数可以将它们分成多个子任务以有效利用资源。但是,如果分片过细,由于模型加载时间较长,整体速度可能会变慢。
- `--max-num-workers 32`:并行任务的最大数量。在如 Slurm 之类的分布式环境中,此参数指定提交任务的最大数量。在本地环境中,它指定同时执行的任务的最大数量。请注意,实际的并行任务数量取决于可用的 GPU 资源,可能不等于这个数字。
如果您不是在本地机器上执行评估,而是使用 Slurm 集群,您可以指定以下参数:
- `--slurm`:在集群上使用 Slurm 提交任务。
- `--partition(-p) my_part`:Slurm 集群分区。
- `--retry 2`:失败任务的重试次数。
```{seealso}
入口还支持将任务提交到阿里巴巴深度学习中心(DLC),以及更多自定义评估策略。请参考 [评测任务发起](../user_guides/experimentation.md#评测任务发起) 了解详情。
```
:::
## 可视化评估结果
评估完成后,评估结果表格将打印如下:
```text
dataset version metric mode opt350m opt125m
--------- --------- -------- ------ --------- ---------
siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
```
所有运行输出将定向到 `outputs/demo/` 目录,结构如下:
```text
outputs/default/
├── 20200220_120000
├── 20230220_183030 # 每个实验一个文件夹
│ ├── configs # 用于记录的已转储的配置文件。如果在同一个实验文件夹中重新运行了不同的实验,可能会保留多个配置
│ ├── logs # 推理和评估阶段的日志文件
│ │ ├── eval
│ │ └── infer
│   ├── predictions # 每个任务的推理结果
│   ├── results # 每个任务的评估结果
│   └── summary # 单个实验的汇总评估结果
├── ...
```
打印评测结果的过程可被进一步定制化,用于输出一些数据集的平均分 (例如 MMLU, C-Eval 等)。
关于评测结果输出的更多介绍可阅读 [结果展示](../user_guides/summarizer.md)
## 更多教程
想要更多了解 OpenCompass, 可以点击下列链接学习。
- [配置数据集](../user_guides/datasets.md)
- [准备模型](../user_guides/models.md)
- [任务运行和监控](../user_guides/experimentation.md)
- [如何调Prompt](../prompt/overview.md)
- [结果展示](../user_guides/summarizer.md)
- [学习配置文件](../user_guides/config.md)
......@@ -24,8 +24,9 @@ OpenCompass 上手路线
:maxdepth: 1
:caption: 开始你的第一步
get_started.md
faq.md
get_started/installation.md
get_started/quick_start.md
get_started/faq.md
.. _教程:
.. toctree::
......
......@@ -4,6 +4,7 @@ myst-parser
-e git+https://github.com/open-compass/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
sphinx==6.1.3
sphinx-copybutton
sphinx-design
sphinx-notfound-page
sphinx-tabs
sphinxcontrib-jquery
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment