[Docs] Update docs for new entry script (#246)

* update docs * update docs * update * update en docs * update * update --------- Co-authored-by: Leymore <zfz-960727@163.com>

[Docs] Update docs for new entry script (#246)
* update docs * update docs * update * update en docs * update * update --------- Co-authored-by: Leymore <zfz-960727@163.com>
166022f5 · Tong Gao · GitHub · a4d68407 · 166022f5 · 166022f5
Unverified Commit 166022f5 authored Aug 31, 2023 by Tong Gao Committed by GitHub Aug 31, 2023
11 changed files
--- a/README.md
+++ b/README.md
@@ -325,9 +325,36 @@ Some third-party features, like Humaneval and Llama, may require additional step
 ## 🏗️ ️Evaluation
-Make sure you have installed OpenCompass correctly and prepared your datasets according to the above steps. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html#quick-start) to learn how to run an evaluation task.
+After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared, you can evaluate the performance of the LLaMA-7b model on the MMLU and C-Eval datasets using the following command:
-For more tutorials, please check our [Documentation](https://opencompass.readthedocs.io/en/latest/index.html).
+```bash
+python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
+```
+OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
+```bash
+# List all configurations
+python tools/list_configs.py
+# List all configurations related to llama and mmlu
+python tools/list_configs.py llama mmlu
+```
+You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
+```bash
+python run.py --datasets ceval_ppl mmlu_ppl \
+--hf-path huggyllama/llama-7b \  # HuggingFace model path
+--model-kwargs device_map='auto' \  # Arguments for model construction
+--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # Arguments for tokenizer construction
+--max-out-len 100 \  # Maximum number of tokens generated
+--max-seq-len 2048 \  # Maximum sequence length the model can accept
+--batch-size 8 \  # Batch size
+--no-batch-padding \  # Don't enable batch padding, infer through for loop to avoid performance loss
+--num-gpus 1  # Number of required GPUs
+```
+Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
 ## 🔜 Roadmap

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -326,7 +326,36 @@ unzip OpenCompassData.zip
 ## 🏗️ ️评测
-确保按照上述步骤正确安装 OpenCompass 并准备好数据集后，请阅读[快速上手](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html#id3)了解如何运行一个评测任务。
+确保按照上述步骤正确安装 OpenCompass 并准备好数据集后，可以通过以下命令评测 LLaMA-7b 模型在 MMLU 和 C-Eval 数据集上的性能：
+```bash
+python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
+```
+OpenCompass 预定义了许多模型和数据集的配置，你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
+```bash
+# 列出所有配置
+python tools/list_configs.py
+# 列出所有跟 llama 及 mmlu 相关的配置
+python tools/list_configs.py llama mmlu
+```
+你也可以通过命令行去评测其它 HuggingFace 模型。同样以 LLaMA-7b 为例：
+```bash
+python run.py --datasets ceval_ppl mmlu_ppl \
+--hf-path huggyllama/llama-7b \  # HuggingFace 模型地址
+--model-kwargs device_map='auto' \  # 构造 model 的参数
+--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # 构造 tokenizer 的参数
+--max-out-len 100 \  # 最长生成 token 数
+--max-seq-len 2048 \  # 模型能接受的最大序列长度
+--batch-size 8 \  # 批次大小
+--no-batch-padding \  # 不打开 batch padding，通过 for loop 推理，避免精度损失
+--num-gpus 1  # 所需 gpu 数
+```
+通过命令行或配置文件，OpenCompass 还支持评测 API 或自定义模型，以及更多样化的评测策略。请阅读[快速上手](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html#id3)了解如何运行一个评测任务。
 更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。

--- a/configs/eval_demo.py
+++ b/configs/eval_demo.py
 from mmengine.config import read_base
 with read_base():
-    from .datasets.winograd.winograd_ppl import winograd_datasets
    from .datasets.siqa.siqa_gen import siqa_datasets
+    from .datasets.winograd.winograd_ppl import winograd_datasets
+    from .models.hf_opt_125m import opt125m
+    from .models.hf_opt_350m import opt350m
 datasets = [*siqa_datasets, *winograd_datasets]
+models = [opt125m, opt350m]
-from opencompass.models import HuggingFaceCausalLM
-# OPT-350M
-opt350m = dict(
-       type=HuggingFaceCausalLM,
-       # the folowing are HuggingFaceCausalLM init parameters
-       path='facebook/opt-350m',
-       tokenizer_path='facebook/opt-350m',
-       tokenizer_kwargs=dict(
-           padding_side='left',
-           truncation_side='left',
-           proxies=None,
-           trust_remote_code=True),
-       model_kwargs=dict(device_map='auto'),
-       max_seq_len=2048,
-       # the folowing are not HuggingFaceCausalLM init parameters
-       abbr='opt350m',                    # Model abbreviation
-       max_out_len=100,                   # Maximum number of generated tokens          
-       batch_size=64,
-       run_cfg=dict(num_gpus=1),    # Run configuration for specifying resource requirements
-    )
-# OPT-125M
-opt125m = dict(
-       type=HuggingFaceCausalLM,
-       # the folowing are HuggingFaceCausalLM init parameters
-       path='facebook/opt-125m',
-       tokenizer_path='facebook/opt-125m',
-       tokenizer_kwargs=dict(
-           padding_side='left',
-           truncation_side='left',
-           proxies=None,
-           trust_remote_code=True),
-       model_kwargs=dict(device_map='auto'),
-       max_seq_len=2048,
-       # the folowing are not HuggingFaceCausalLM init parameters
-       abbr='opt125m',                # Model abbreviation
-       max_out_len=100,               # Maximum number of generated tokens
-       batch_size=128,
-       run_cfg=dict(num_gpus=1),   # Run configuration for specifying resource requirements
-    )
-models = [opt350m, opt125m]
\ No newline at end of file
--- a/configs/models/hf_opt_125m.py
+++ b/configs/models/hf_opt_125m.py
+from opencompass.models import HuggingFaceCausalLM
+# OPT-125M
+opt125m = dict(
+       type=HuggingFaceCausalLM,
+       # the folowing are HuggingFaceCausalLM init parameters
+       path='facebook/opt-125m',
+       tokenizer_path='facebook/opt-125m',
+       tokenizer_kwargs=dict(
+           padding_side='left',
+           truncation_side='left',
+           proxies=None,
+           trust_remote_code=True),
+       model_kwargs=dict(device_map='auto'),
+       max_seq_len=2048,
+       # the folowing are not HuggingFaceCausalLM init parameters
+       abbr='opt125m',                # Model abbreviation
+       max_out_len=100,               # Maximum number of generated tokens
+       batch_size=128,
+       run_cfg=dict(num_gpus=1),   # Run configuration for specifying resource requirements
+    )
--- a/configs/models/hf_opt_350m.py
+++ b/configs/models/hf_opt_350m.py
+from opencompass.models import HuggingFaceCausalLM
+# OPT-350M
+opt350m = dict(
+       type=HuggingFaceCausalLM,
+       # the folowing are HuggingFaceCausalLM init parameters
+       path='facebook/opt-350m',
+       tokenizer_path='facebook/opt-350m',
+       tokenizer_kwargs=dict(
+           padding_side='left',
+           truncation_side='left',
+           proxies=None,
+           trust_remote_code=True),
+       model_kwargs=dict(device_map='auto'),
+       max_seq_len=2048,
+       # the folowing are not HuggingFaceCausalLM init parameters
+       abbr='opt350m',                    # Model abbreviation
+       max_out_len=100,                   # Maximum number of generated tokens          
+       batch_size=64,
+       run_cfg=dict(num_gpus=1),    # Run configuration for specifying resource requirements
+    )
--- a/docs/en/get_started.md
+++ b/docs/en/get_started.md
@@ -74,61 +74,133 @@ OpenCompass has supported most of the datasets commonly used for performance com
 # Quick Start
-The evaluation of OpenCompass relies on configuration files which must contain fields **`datasets`** and **`models`**.
-The configurations specify the models and datasets to evaluate using **"run.py"**.
 We will demonstrate some basic features of OpenCompass through evaluating pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on both [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winogrande) benchmark tasks with their config file located at [configs/eval_demo.py](https://github.com/InternLM/opencompass/blob/main/configs/eval_demo.py).
 Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
 For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/InternLM/opencompass/tree/main/configs).
-Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation for the first run and check if there is any prblem. In debugging mode, the tasks will be executed sequentially and the status will be printed in real time.
+## Configure an Evaluation Task
+In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is `run.py`. Users can select the model and dataset to be tested either via command line or configuration files.
+`````{tabs}
+````{tab} Command Line
+Users can combine the models and datasets they want to test using `--models` and `--datasets`.
 ```bash
-python run.py configs/eval_demo.py -w outputs/demo --debug
+python run.py --models opt_125m opt_350m --datasets siqa_gen winograd_ppl
 ```
-If everything is fine, you should see "Starting inference process" on screen:
+The models and datasets are pre-stored in the form of configuration files in `configs/models` and `configs/datasets`. Users can view or filter the currently available model and dataset configurations using `tools/list_configs.py`.
 ```bash
-[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
+# List all configurations
+python tools/list_configs.py
+# List all configurations related to llama and mmlu
+python tools/list_configs.py llama mmlu
 ```
-Then you can press `ctrl+c` to interrupt the program, and then run the following command to start the parallel evaluation:
+Some sample outputs are:
+```text
+-----------------+-----------------------------------+
+| Model           | Config Path                       |
+|-----------------+-----------------------------------|
+| hf_llama2_13b   | configs/models/hf_llama2_13b.py   |
+| hf_llama2_70b   | configs/models/hf_llama2_70b.py   |
+| ...             | ...                               |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
+| Dataset           | Config Path                                       |
+|-------------------+---------------------------------------------------|
+| cmmlu_gen         | configs/datasets/cmmlu/cmmlu_gen.py               |
+| cmmlu_gen_ffe7c0  | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py        |
+| ...               | ...                                               |
+-------------------+---------------------------------------------------+
+```
+Users can use the names in the first column as input parameters for `--models` and `--datasets` in `python run.py`. For datasets, the same name with different suffixes generally indicates that its prompts or evaluation methods are different.
+For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the `facebook/opt-125m` model, you can evaluate it with the following command:
 ```bash
-python run.py configs/eval_demo.py -w outputs/demo
+python run.py --datasets siqa_gen winograd_ppl \
+--hf-model facebook/opt-125m \
+--model-kwargs device_map='auto' \
+--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \
+--max-seq-len 2048 \
+--max-out-len 100 \
+--batch-size 128  \
+--num-gpus 1
 ```
-Now let's go over the configuration file and the launch options used in this case.
+```{tip}
+For all HuggingFace related parameters supported by `run.py`, please read [Initiating Evaluation Task](./user_guides/experimentation.md#launching-an-evaluation-task).
+```
-## Explanations
-### Dataset list - `datasets`
+````
-Below is the configuration snippet related to datasets in `configs/eval_demo.py`:
+````{tab} Configuration File
+In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. This method of configuration allows users to easily modify experimental parameters, provides a more flexible configuration, and simplifies the run command. The configuration file is organized in Python format and must include the `datasets` and `models` fields.
+The test configuration for this time is [configs/eval_demo.py](/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](./user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
 ```python
-from mmengine.config import read_base  # Use mmengine.read_base() to load base configs
+from mmengine.config import read_base
 with read_base():
-    # Read the required dataset configurations directly from the preset dataset configurations
+    from .datasets.siqa.siqa_gen import siqa_datasets
-    from .datasets.winograd.winograd_ppl import winograd_datasets   # Load Winograd's configuration, which uses perplexity-based inference
+    from .datasets.winograd.winograd_ppl import winograd_datasets
-    from .datasets.siqa.siqa_gen import siqa_datasets               # Load SIQA's configuration, which uses generation-based inference
+    from .models.hf_opt_125m import opt125m
+    from .models.hf_opt_350m import opt350m
-datasets = [*siqa_datasets, *winograd_datasets]   # Concatenate the datasets to be evaluated into the datasets field
+datasets = [*siqa_datasets, *winograd_datasets]
+models = [opt125m, opt350m]
 ```
-Various dataset configurations are available in [configs/datasets](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets).
+When running tasks, we just need to pass the path of the configuration file to `run.py`:
-Some datasets have two types of configuration files within their folders named `ppl` and `gen`, representing different evaluation methods. Specifically, `ppl` represents discriminative evaluation, while `gen` stands for generative evaluation.
+```bash
+python run.py configs/eval_demo.py
+```
-[configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) contains various collections of datasets for comprehensive evaluation purposes.
+````
-You can find more information from [Dataset Preparation](./user_guides/dataset_prepare.md).
+`````
+The configuration file evaluation method is more concise. The following sections will use this method as an example to explain the other features.
+## Run Evaluation
+Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation for the first run and check if there is any prblem. In debugging mode, the tasks will be executed sequentially and the status will be printed in real time.
+```bash
+python run.py configs/eval_demo.py -w outputs/demo --debug
+```
+If everything is fine, you should see "Starting inference process" on screen:
+```bash
+[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
+```
+Then you can press `ctrl+c` to interrupt the program, and then run the following command to start the parallel evaluation:
+```bash
+python run.py configs/eval_demo.py -w outputs/demo
+```
+Now let's go over the configuration file and the launch options used in this case.
+## Explanations
 ### Model list - `models`
-OpenCompass supports directly specifying the list of models to be tested in the configuration. For HuggingFace models, users usually do not need to modify the code. The following is the relevant configuration snippet:
+OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](/configs/models/hf_opt_350m.py) (`configs/models/hf_opt_350m.py`):
 ```python
 # Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
@@ -151,35 +223,62 @@ opt350m = dict(
       max_seq_len=2048,             # The maximum length of the entire sequence
       max_out_len=100,              # Maximum number of generated tokens
       batch_size=64,                # batchsize
-       run_cfg=dict(num_gpus=1),     # Run configuration for specifying resource requirements
+       run_cfg=dict(num_gpus=1),     # The required GPU numbers for this model
    )
+```
-# OPT-125M
+When using configurations, we can specify the relevant files through the command-line argument ``` --models`` or import the model configurations into the  ```models\` list in the configuration file using the inheritance mechanism.
-opt125m = dict(
-       type=HuggingFaceCausalLM,
+If the HuggingFace model you want to test is not among them, you can also directly specify the related parameters in the command line.
-       # Initialization parameters for `HuggingFaceCausalLM`
-       path='facebook/opt-125m',
-       tokenizer_path='facebook/opt-125m',
-       tokenizer_kwargs=dict(
-           padding_side='left',
-           truncation_side='left',
-           proxies=None,
-           trust_remote_code=True),
-       model_kwargs=dict(device_map='auto'),
-       # Below are common parameters for all models, not specific to HuggingFaceCausalLM
-       abbr='opt125m',                # Model abbreviation for result display
-       max_seq_len=2048,              # The maximum length of the entire sequence
-       max_out_len=100,               # Maximum number of generated tokens
-       batch_size=128,                # batchsize
-       run_cfg=dict(num_gpus=1),      # Run configuration for specifying resource requirements
-    )
-models = [opt350m, opt125m]
+```bash
+python run.py \
+--hf-model facebook/opt-350m \  # HuggingFace model path
+--tokenizer-path facebook/opt-350m \  # HuggingFace tokenizer path (if the same as the model path, can be omitted)
+--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \  # Arguments to construct the tokenizer
+--model-kwargs device_map='auto' \  # Arguments to construct the model
+--max-seq-len 2048 \  # Maximum sequence length the model can accept
+--max-out-len 100 \  # Maximum number of tokens to generate
+--batch-size 64  \  # Batch size
+--num-gpus 1  # Number of GPUs required to run the model
 ```
 The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
+```{note}
 More information about model configuration can be found in [Prepare Models](./user_guides/models.md).
+```
+### Dataset list - `datasets`
+The translation is:
+Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance.
+Below is a dataset-related configuration snippet from `configs/eval_demo.py`:
+```python
+from mmengine.config import read_base  # Use mmengine.read_base() to read the base configuration
+with read_base():
+    # Directly read the required dataset configurations from the preset dataset configurations
+    from .datasets.winograd.winograd_ppl import winograd_datasets  # Read Winograd configuration, evaluated based on PPL (perplexity)
+    from .datasets.siqa.siqa_gen import siqa_datasets  # Read SIQA configuration, evaluated based on generation
+datasets = [*siqa_datasets, *winograd_datasets]       # The final config needs to contain the required evaluation dataset list 'datasets'
+```
+Dataset configurations are typically of two types: 'ppl' and 'gen', indicating the evaluation method used. Where `ppl` means discriminative evaluation and `gen` means generative evaluation.
+Moreover, [configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) houses various dataset collections, making it convenient for comprehensive evaluations. OpenCompass often uses [`base_medium.py`](/configs/datasets/collections/base_medium.py) for full-scale model testing. To replicate results, simply import that file, for example:
+```bash
+python run.py --models hf_llama_7b --datasets base_medium
+```
+```{note}
+You can find more information from [Dataset Preparation](./user_guides/dataset_prepare.md).
+```
 ### Launch Evaluation

--- a/docs/en/tools.md
+++ b/docs/en/tools.md
@@ -70,6 +70,58 @@ python tools/prediction_merger.py CONFIG_PATH [-w WORK_DIR]
 - `-w`: Work path, default is `'./outputs/default'`.
+## List Configs
+This tool can list or search all available model and dataset configurations. It supports fuzzy search, making it convenient for use in conjunction with `run.py`.
+Usage:
+```bash
+python tools/list_configs.py [PATTERN1] [PATTERN2] [...]
+```
+If executed without any parameters, it will list all model configurations in the `configs/models` and `configs/dataset` directories by default.
+Users can also pass any number of parameters. The script will list all configurations related to the provided strings, supporting fuzzy search and the use of the * wildcard. For example, the following command will list all configurations related to `mmlu` and `llama`:
+```bash
+python tools/list_configs.py mmlu llama
+```
+Its output could be:
+```text
+-----------------+-----------------------------------+
+| Model           | Config Path                       |
+|-----------------+-----------------------------------|
+| hf_llama2_13b   | configs/models/hf_llama2_13b.py   |
+| hf_llama2_70b   | configs/models/hf_llama2_70b.py   |
+| hf_llama2_7b    | configs/models/hf_llama2_7b.py    |
+| hf_llama_13b    | configs/models/hf_llama_13b.py    |
+| hf_llama_30b    | configs/models/hf_llama_30b.py    |
+| hf_llama_65b    | configs/models/hf_llama_65b.py    |
+| hf_llama_7b     | configs/models/hf_llama_7b.py     |
+| llama2_13b_chat | configs/models/llama2_13b_chat.py |
+| llama2_70b_chat | configs/models/llama2_70b_chat.py |
+| llama2_7b_chat  | configs/models/llama2_7b_chat.py  |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
+| Dataset           | Config Path                                       |
+|-------------------+---------------------------------------------------|
+| cmmlu_gen         | configs/datasets/cmmlu/cmmlu_gen.py               |
+| cmmlu_gen_ffe7c0  | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py        |
+| cmmlu_ppl         | configs/datasets/cmmlu/cmmlu_ppl.py               |
+| cmmlu_ppl_fd1f2f  | configs/datasets/cmmlu/cmmlu_ppl_fd1f2f.py        |
+| mmlu_gen          | configs/datasets/mmlu/mmlu_gen.py                 |
+| mmlu_gen_23a9a9   | configs/datasets/mmlu/mmlu_gen_23a9a9.py          |
+| mmlu_gen_5d1409   | configs/datasets/mmlu/mmlu_gen_5d1409.py          |
+| mmlu_gen_79e572   | configs/datasets/mmlu/mmlu_gen_79e572.py          |
+| mmlu_gen_a484b3   | configs/datasets/mmlu/mmlu_gen_a484b3.py          |
+| mmlu_ppl          | configs/datasets/mmlu/mmlu_ppl.py                 |
+| mmlu_ppl_ac766d   | configs/datasets/mmlu/mmlu_ppl_ac766d.py          |
+-------------------+---------------------------------------------------+
+```
 ## Dataset Suffix Updater
 This tool can quickly modify the suffixes of configuration files located under the `configs/dataset` directory, aligning them with the naming conventions based on prompt hash.

--- a/docs/en/user_guides/experimentation.md
+++ b/docs/en/user_guides/experimentation.md
@@ -2,18 +2,59 @@
 ## Launching an Evaluation Task
-The program entry for the evaluation task is `run.py`, its usage is as follows:
+The program entry for the evaluation task is `run.py`. The usage is as follows:
 ```shell
-python run.py $Config {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--debug] [-m MODE] [-r [REUSE]] [-w WORKDIR] [-l] [--dry-run]
+python run.py $EXP {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--debug] [-m MODE] [-r [REUSE]] [-w WORKDIR] [-l] [--dry-run]
 ```
-Here are some examples for launching the task in different environments:
+Task Configuration (`$EXP`):
- Running locally: `run.py $Config`, where `$Config` does not contain fields 'eval' and 'infer'.
+- `run.py` accepts a .py configuration file as task-related parameters, which must include the `datasets` and `models` fields.
- Running with Slurm: `run.py $Config --slurm -p $PARTITION_name`.
- Running on ALiYun DLC: `run.py $Config --dlc --aliyun-cfg $AliYun_Cfg`, tutorial will come later.
+  ```bash
- Customized run: `run.py $Config`, where `$Config` contains fields 'eval' and 'infer', and you are able to customize the way how each task will be split and launched. See [Evaluation document](./evaluation.md).
+  python run.py configs/eval_demo.py
+  ```
+- If no configuration file is provided, users can also specify models and datasets using `--models MODEL1 MODEL2 ...` and `--datasets DATASET1 DATASET2 ...`:
+  ```bash
+  python run.py --models hf_opt_350m hf_opt_125m --datasets siqa_gen winograd_ppl
+  ```
+- For HuggingFace related models, users can also define a model quickly in the command line through HuggingFace parameters and then specify datasets using `--datasets DATASET1 DATASET2 ...`.
+  ```bash
+  python run.py --datasets siqa_gen winograd_ppl \
+  --hf-path huggyllama/llama-7b \  # HuggingFace model path
+  --model-kwargs device_map='auto' \  # Parameters for constructing the model
+  --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # Parameters for constructing the tokenizer
+  --max-out-len 100 \  # Maximum sequence length the model can accept
+  --max-seq-len 2048 \  # Maximum generated token count
+  --batch-size 8 \  # Batch size
+  --no-batch-padding \  # Disable batch padding and infer through a for loop to avoid accuracy loss
+  --num-gpus 1  # Number of required GPUs
+  ```
+  Complete HuggingFace parameter descriptions:
+  - `--hf-path`: HuggingFace model path
+  - `--peft-path`: PEFT model path
+  - `--tokenizer-path`: HuggingFace tokenizer path (if it's the same as the model path, it can be omitted)
+  - `--model-kwargs`: Parameters for constructing the model
+  - `--tokenizer-kwargs`: Parameters for constructing the tokenizer
+  - `--max-out-len`: Maximum generated token count
+  - `--max-seq-len`: Maximum sequence length the model can accept
+  - `--no-batch-padding`: Disable batch padding and infer through a for loop to avoid accuracy loss
+  - `--batch-size`: Batch size
+  - `--num-gpus`: Number of GPUs required to run the model
+Starting Methods:
+- Running on local machine: `run.py $EXP`.
+- Running with slurm: `run.py $EXP --slurm -p $PARTITION_name`.
+- Running with dlc: `run.py $EXP --dlc --aliyun-cfg $AliYun_Cfg`
+- Customized starting: `run.py $EXP`. Here, $EXP is the configuration file which includes the `eval` and `infer` fields. For detailed configurations, please refer to [Efficient Evaluation](./evaluation.md).
 The parameter explanation is as follows:

--- a/docs/zh_cn/get_started.md
+++ b/docs/zh_cn/get_started.md
@@ -74,62 +74,134 @@ OpenCompass 已经支持了大多数常用于性能比较的数据集，具体
 # 快速上手
-OpenCompass 的评测以配置文件为中心，必须包含 `datasets` 和 `models` 字段，配置需要评测的模型以及数据集，使用入口 'run.py' 启动。
 我们会以测试 [OPT-125M](https://huggingface.co/facebook/opt-125m) 以及 [OPT-350M](https://huggingface.co/facebook/opt-350m) 预训练基座模型在 [SIQA](https://huggingface.co/datasets/social_i_qa) 和 [Winograd](https://huggingface.co/datasets/winogrande) 上的性能为例，带领你熟悉 OpenCompass 的一些基本功能。
-本次的测试的配置文件为[configs/eval_demo.py](https://github.com/InternLM/opencompass/blob/main/configs/eval_demo.py)。
 运行前确保已经安装了 OpenCompass，本实验可以在单张 _GTX-1660-6G_ 显卡上成功运行。
 更大参数的模型，如 Llama-7B, 可参考 [configs](https://github.com/InternLM/opencompass/tree/main/configs) 中其他例子。
-由于 OpenCompass 默认使用并行的方式进行评测，为了便于及时发现问题，我们可以在首次启动时使用 debug 模式运行，该模式会将任务串行执行，并会实时输出任务的执行进度。
+## 配置任务
+OpenCompass 中，每个评测任务都由待评测的模型和数据集组成，而评测的入口为 `run.py`。用户可以通过命令行或配置文件的方式去选择待测的模型和数据集。
+`````{tabs}
+````{tab} 命令行形式
+用户可以通过 `--models` 和 `--datasets` 组合待测试的模型和数据集。
 ```bash
-python run.py configs/eval_demo.py -w outputs/demo --debug
+python run.py --models opt_125m opt_350m --datasets siqa_gen winograd_ppl
 ```
-如果一切正常，屏幕上会出现 "Starting inference process"：
+模型和数据集以配置文件的形式预先存放在 `configs/models` 和 `configs/datasets` 下。用户可以通过 `tools/list_configs.py` 查看或筛选当前可用的模型和数据集配置。
 ```bash
-Loading cached processed dataset at .cache/huggingface/datasets/social_i_qa/default/0.1.0/674d85e42ac7430d3dcd4de7007feaffcb1527c535121e09bab2803fbcc925f8/cache-742512eab30e8c9c.arrow
+# 列出所有配置
-[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
+python tools/list_configs.py
+# 列出所有跟 llama 及 mmlu 相关的配置
+python tools/list_configs.py llama mmlu
 ```
-此时可以使用 `ctrl+c` 中断 debug 模式的执行，并运行以下命令进行并行评测：
+部分样例输出如下：
+```text
+-----------------+-----------------------------------+
+| Model           | Config Path                       |
+|-----------------+-----------------------------------|
+| hf_llama2_13b   | configs/models/hf_llama2_13b.py   |
+| hf_llama2_70b   | configs/models/hf_llama2_70b.py   |
+| ...             | ...                               |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
+| Dataset           | Config Path                                       |
+|-------------------+---------------------------------------------------|
+| cmmlu_gen         | configs/datasets/cmmlu/cmmlu_gen.py               |
+| cmmlu_gen_ffe7c0  | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py        |
+| ...               | ...                                               |
+-------------------+---------------------------------------------------+
+```
+用户可以按照第一列中的名称去作为 `python run.py` 中 `--models` 和 `--datasets` 的传入参数。在数据集部分，相同名称但不同后缀的数据集一般意味着其提示词或评测方式是不一样的。
+对于 HuggingFace 模型，用户可以直接通过命令行设定模型参数，而无需额外配置文件。例如，对于 `facebook/opt-125m` 模型，可以通过以下命令进行评测：
 ```bash
-python run.py configs/eval_demo.py -w outputs/demo
+python run.py --datasets siqa_gen winograd_ppl \
+--hf-model facebook/opt-125m \
+--model-kwargs device_map='auto' \
+--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \
+--max-seq-len 2048 \
+--max-out-len 100 \
+--batch-size 128  \
+--num-gpus 1
 ```
-运行 demo 期间，我们来介绍一下本案例中的配置内容以及启动选项。
+```{tip}
+关于 `run.py` 支持的所有 HuggingFace 相关参数，请阅读 [评测任务发起](./user_guides/experimentation.md#评测任务发起)。
+```
-## 步骤详解
-### 数据集列表 `datasets`
+````
-以下为 `configs/eval_demo.py` 中与数据集相关的配置片段：
+````{tab} 配置形式
+除了通过在命令行中对实验进行配置，OpenCompass 也支持用户把实验全量配置写入一份配置文件中，并直接通过 `run.py` 运行。这样的配置方式允许用户方便地修改实验参数，对实验进行更灵活的配置，也让运行命令更为简洁。配置文件以 Python 格式组织，且必须包含 `datasets` 和 `models` 字段。
+本次的测试的配置文件为 [configs/eval_demo.py](/configs/eval_demo.py)。该配置通过[继承机制](./user_guides/config.md#继承机制)引入了所需的数据集和模型配置，并按照格式组合了 `datasets` 和 `models` 字段。
 ```python
-from mmengine.config import read_base  # 使用 mmengine.read_base() 读取基础配置
+from mmengine.config import read_base
 with read_base():
-    # 直接从预设数据集配置中读取需要的数据集配置
+    from .datasets.siqa.siqa_gen import siqa_datasets
-    from .datasets.winograd.winograd_ppl import winograd_datasets  # 读取 Winograd 的配置，基于 PPL (perplexity) 进行评测
+    from .datasets.winograd.winograd_ppl import winograd_datasets
-    from .datasets.siqa.siqa_gen import siqa_datasets  # 读取 SIQA 的配置，基于生成式进行评测
+    from .models.hf_opt_125m import opt125m
+    from .models.hf_opt_350m import opt350m
-datasets = [*siqa_datasets, *winograd_datasets]       # 最后 config 需要包含所需的评测数据集列表 datasets
+datasets = [*siqa_datasets, *winograd_datasets]
+models = [opt125m, opt350m]
+```
+在运行任务时，我们只需要往 `run.py` 传入配置文件的路径即可：
+```bash
+python run.py configs/eval_demo.py
 ```
-[configs/datasets](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets) 包含各种数据集预先定义好的配置文件；
+````
-部分数据集文件夹下有 'ppl' 和 'gen' 两类配置文件，表示使用的评估方式，其中 `ppl` 表示使用判别式评测， `gen` 表示使用生成式评测。
-[configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) 存放了各类数据集集合，方便做综合评测。
+`````
-更多介绍可查看 [数据集配置](./user_guides/dataset_prepare.md)。
+配置文件评测方式较为简洁，下文将以该方式为例讲解其余功能。
+## 运行评测
+由于 OpenCompass 默认使用并行的方式进行评测，为了便于及时发现问题，我们可以在首次启动时使用 debug 模式运行，该模式会将任务串行执行，并会实时输出任务的执行进度。
+```bash
+python run.py configs/eval_demo.py -w outputs/demo --debug
+```
+如果一切正常，屏幕上会出现 "Starting inference process"：
+```bash
+Loading cached processed dataset at .cache/huggingface/datasets/social_i_qa/default/0.1.0/674d85e42ac7430d3dcd4de7007feaffcb1527c535121e09bab2803fbcc925f8/cache-742512eab30e8c9c.arrow
+[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
+```
+此时可以使用 `ctrl+c` 中断 debug 模式的执行，并运行以下命令进行并行评测：
+```bash
+python run.py configs/eval_demo.py -w outputs/demo
+```
+运行 demo 期间，我们来介绍一下本案例中的配置内容以及启动选项。
+## 配置详解
 ### 模型列表 `models`
-OpenCompass 支持直接在配置中指定待测试的模型列表，对于 HuggingFace 模型来说，用户通常无需添加代码。下面为相关的配置片段：
+OpenCompass 在 `configs/models` 下提供了一系列预定义好的模型配置。下面为 [opt-350m](/configs/models/hf_opt_350m.py) (`configs/models/hf_opt_350m.py`) 相关的配置片段：
 ```python
 # 提供直接使用 HuggingFaceCausalLM 模型的接口
@@ -139,48 +211,72 @@ from opencompass.models import HuggingFaceCausalLM
 opt350m = dict(
       type=HuggingFaceCausalLM,
       # 以下参数为 HuggingFaceCausalLM 相关的初始化参数
-       path='facebook/opt-350m',
+       path='facebook/opt-350m',  # HuggingFace 模型地址
       tokenizer_path='facebook/opt-350m',
       tokenizer_kwargs=dict(
           padding_side='left',
           truncation_side='left',
-           proxies=None,
           trust_remote_code=True),
-       model_kwargs=dict(device_map='auto'),
+       model_kwargs=dict(device_map='auto'),  # 构造 model 的参数
       # 下列参数为所有模型均需设定的初始化参数，非 HuggingFaceCausalLM 独有
       abbr='opt350m',                    # 模型简称，用于结果展示
       max_seq_len=2048,              # 模型能接受的最大序列长度
       max_out_len=100,                   # 最长生成 token 数
       batch_size=64,                     # 批次大小
-       run_cfg=dict(num_gpus=1),          # 运行配置，用于指定资源需求
+       run_cfg=dict(num_gpus=1),          # 运行模型所需的gpu数
    )
+```
-# OPT-125M
+在使用配置时，我们可以通过在命令行参数中使用 `--models` 指定相关文件，也可以通过继承机制在实验配置文件中导入模型配置，并加入到 `models` 列表中。
-opt125m = dict(
-       type=HuggingFaceCausalLM,
+如果你想要测试的 HuggingFace 模型不在其中，也可以在命令行中直接指定相关参数。
-       # 以下参数为 HuggingFaceCausalLM 的初始化参数
-       path='facebook/opt-125m',
-       tokenizer_path='facebook/opt-125m',
-       tokenizer_kwargs=dict(
-           padding_side='left',
-           truncation_side='left',
-           proxies=None,
-           trust_remote_code=True),
-       model_kwargs=dict(device_map='auto'),
-       # 下列参数为所有模型均需设定的初始化参数，非 HuggingFaceCausalLM 独有
-       abbr='opt125m',                # 模型简称，用于结果展示
-       max_seq_len=2048,              # 模型能接受的最大序列长度
-       max_out_len=100,               # 最长生成 token 数
-       batch_size=128,                # 批次大小
-       run_cfg=dict(num_gpus=1),      # 运行配置，用于指定资源需求
-    )
-models = [opt350m, opt125m]
+```bash
+python run.py \
+--hf-model facebook/opt-350m \  # HuggingFace 模型地址
+--tokenizer-path facebook/opt-350m \  # HuggingFace tokenizer 地址（如与模型地址相同，可省略）
+--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \  # 构造 tokenizer 的参数
+--model-kwargs device_map='auto' \  # 构造 model 的参数
+--max-seq-len 2048 \  # 模型能接受的最大序列长度
+--max-out-len 100 \  # 最长生成 token 数
+--batch-size 64  \  # 批次大小
+--num-gpus 1  # 运行模型所需的gpu数
 ```
 HuggingFace 中的 'facebook/opt-350m' 以及 'facebook/opt-125m' 权重会在运行时自动下载。
-关于模型配置的更多介绍可阅读 [准备模型](./user_guides/models.md)。
+```{note}
+如果需要了解更多参数的说明，或 API 模型及自定义模型的测试，可阅读 [准备模型](./user_guides/models.md)。
+```
+### 数据集列表 `datasets`
+与模型类似，数据集的配置文件都提供在 `configs/datasets` 下，用户可以在命令行中通过 `--datasets` ，或在配置文件中通过继承导入相关配置。
+以下为 `configs/eval_demo.py` 中与数据集相关的配置片段：
+```python
+from mmengine.config import read_base  # 使用 mmengine.read_base() 读取基础配置
+with read_base():
+    # 直接从预设数据集配置中读取需要的数据集配置
+    from .datasets.winograd.winograd_ppl import winograd_datasets  # 读取 Winograd 的配置，基于 PPL (perplexity) 进行评测
+    from .datasets.siqa.siqa_gen import siqa_datasets  # 读取 SIQA 的配置，基于生成式进行评测
+datasets = [*siqa_datasets, *winograd_datasets]       # 最后 config 需要包含所需的评测数据集列表 datasets
+```
+数据集的配置通常为 'ppl' 和 'gen' 两类配置文件，表示使用的评估方式。其中 `ppl` 表示使用判别式评测， `gen` 表示使用生成式评测。
+此外，[configs/datasets/collections](https://github.com/InternLM/OpenCompass/blob/main/configs/datasets/collections) 存放了各类数据集集合，方便做综合评测。OpenCompass 常用 [`base_medium.py`](/configs/datasets/collections/base_medium.py) 对模型进行全量测试。若需要复现结果，直接导入该文件即可。如：
+```bash
+python run.py --models hf_llama_7b --datasets base_medium
+```
+```{note}
+更多介绍可查看 [数据集配置](./user_guides/dataset_prepare.md)。
+```
 ### 启动评测

--- a/docs/zh_cn/tools.md
+++ b/docs/zh_cn/tools.md
@@ -79,6 +79,58 @@ python tools/prediction_merger.py CONFIG_PATH [-w WORK_DIR]
 - `-w`：工作路径，默认为 `'./outputs/default'`。
+## List Configs
+本工具可以列出或搜索所有可用的模型和数据集配置，且支持模糊搜索，便于结合 `run.py` 使用。
+运行方式：
+```bash
+python tools/list_configs.py [PATTERN1] [PATTERN2] [...]
+```
+若运行时不加任何参数，则默认列出所有在 `configs/models` 和 `configs/dataset` 下的模型配置。
+用户同样可以传入任意数量的参数，脚本会列出所有跟传入字符串相关的配置，支持模糊搜索及 * 号匹配。如下面的命令会列出所有跟 `mmlu` 和 `llama` 相关的配置：
+```bash
+python tools/list_configs.py mmlu llama
+```
+它的输出可以是：
+```text
+-----------------+-----------------------------------+
+| Model           | Config Path                       |
+|-----------------+-----------------------------------|
+| hf_llama2_13b   | configs/models/hf_llama2_13b.py   |
+| hf_llama2_70b   | configs/models/hf_llama2_70b.py   |
+| hf_llama2_7b    | configs/models/hf_llama2_7b.py    |
+| hf_llama_13b    | configs/models/hf_llama_13b.py    |
+| hf_llama_30b    | configs/models/hf_llama_30b.py    |
+| hf_llama_65b    | configs/models/hf_llama_65b.py    |
+| hf_llama_7b     | configs/models/hf_llama_7b.py     |
+| llama2_13b_chat | configs/models/llama2_13b_chat.py |
+| llama2_70b_chat | configs/models/llama2_70b_chat.py |
+| llama2_7b_chat  | configs/models/llama2_7b_chat.py  |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
+| Dataset           | Config Path                                       |
+|-------------------+---------------------------------------------------|
+| cmmlu_gen         | configs/datasets/cmmlu/cmmlu_gen.py               |
+| cmmlu_gen_ffe7c0  | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py        |
+| cmmlu_ppl         | configs/datasets/cmmlu/cmmlu_ppl.py               |
+| cmmlu_ppl_fd1f2f  | configs/datasets/cmmlu/cmmlu_ppl_fd1f2f.py        |
+| mmlu_gen          | configs/datasets/mmlu/mmlu_gen.py                 |
+| mmlu_gen_23a9a9   | configs/datasets/mmlu/mmlu_gen_23a9a9.py          |
+| mmlu_gen_5d1409   | configs/datasets/mmlu/mmlu_gen_5d1409.py          |
+| mmlu_gen_79e572   | configs/datasets/mmlu/mmlu_gen_79e572.py          |
+| mmlu_gen_a484b3   | configs/datasets/mmlu/mmlu_gen_a484b3.py          |
+| mmlu_ppl          | configs/datasets/mmlu/mmlu_ppl.py                 |
+| mmlu_ppl_ac766d   | configs/datasets/mmlu/mmlu_ppl_ac766d.py          |
+-------------------+---------------------------------------------------+
+```
 ## Dataset Suffix Updater
 本工具可以快速修改 `configs/dataset` 目录下的配置文件后缀，使其符合提示词哈希命名规范。

--- a/docs/zh_cn/user_guides/experimentation.md
+++ b/docs/zh_cn/user_guides/experimentation.md
@@ -5,15 +5,56 @@
 评测任务的程序入口为 `run.py`，使用方法如下：
 ```shell
-python run.py $Config {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--debug] [-m MODE] [-r [REUSE]] [-w WORKDIR] [-l] [--dry-run]
+python run.py $EXP {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--debug] [-m MODE] [-r [REUSE]] [-w WORKDIR] [-l] [--dry-run]
 ```
+任务配置 (`$EXP`)：
+- `run.py` 允许接受一个 .py 配置文件作为任务相关参数，里面需要包含 `datasets` 和 `models` 字段。
+  ```bash
+  python run.py configs/eval_demo.py
+  ```
+- 如果不传入配置文件，用户也可以通过 `--models MODEL1 MODEL2 ...` 和 `--datasets DATASET1 DATASET2 ...` 来指定模型和数据集:
+  ```bash
+  python run.py --models hf_opt_350m hf_opt_125m --datasets siqa_gen winograd_ppl
+  ```
+- 对于 HuggingFace 相关模型，用户也可以通过 HuggingFace 参数快速在命令行中定义一个模型，再通过 `--datasets DATASET1 DATASET2 ...` 定义数据集。
+  ```bash
+  python run.py --datasets siqa_gen winograd_ppl \
+  --hf-path huggyllama/llama-7b \  # HuggingFace 模型地址
+  --model-kwargs device_map='auto' \  # 构造 model 的参数
+  --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # 构造 tokenizer 的参数
+  --max-out-len 100 \  # 模型能接受的最大序列长度
+  --max-seq-len 2048 \  # 最长生成 token 数
+  --batch-size 8 \  # 批次大小
+  --no-batch-padding \  # 不打开 batch padding，通过 for loop 推理，避免精度损失
+  --num-gpus 1  # 所需 gpu 数
+  ```
+  HuggingFace 全量参数介绍如下：
+  - `--hf-path`:  HuggingFace 模型地址
+  - `--peft-path`: PEFT 模型地址
+  - `--tokenizer-path`: HuggingFace tokenizer 地址（如与模型地址相同，可省略）
+  - `--model-kwargs`: 构造 model 的参数
+  - `--tokenizer-kwargs`: 构造 tokenizer 的参数
+  - `--max-out-len`: 最长生成 token 数
+  - `--max-seq-len`: 模型能接受的最大序列长度
+  - `--no-batch-padding`: 不打开 batch padding，通过 for loop 推理，避免精度损失
+  - `--batch-size`: 批次大小
+  - `--num-gpus`: 运行模型所需的gpu数
 启动方式：
- 本地机器运行: `run.py $Config`，$Config 中不包含 `eval` 和 `infer` 字段。
+- 本地机器运行: `run.py $EXP`。
- srun运行: `run.py $Config --slurm -p $PARTITION_name`。
+- srun运行: `run.py $EXP --slurm -p $PARTITION_name`。
- dlc运行： `run.py $Config --dlc --aliyun-cfg $AliYun_Cfg`， 后续会有教程。
+- dlc运行： `run.py $EXP --dlc --aliyun-cfg $AliYun_Cfg`
- 定制化启动: `run.py $Config` $Config 中包含 `eval` 和 `infer` 字段，参考 [评估文档](./evaluation.md)。
+- 定制化启动: `run.py $EXP`。这里 $EXP 为配置文件，且里面包含 `eval` 和 `infer` 字段，详细配置请参考 [高效评测](./evaluation.md)。
 参数解释如下：
@@ -26,7 +67,7 @@ python run.py $Config {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--
 - `-l`: 打开飞书机器人状态上报。
 - `--dry-run`: 开启时，推理和评测任务仅会分发但不会真正运行，便于调试；
-以运行模式`-m all`为例，整体运行流如下：
+以运行模式 `-m all` 为例，整体运行流如下：
 1. 读取配置文件，解析出模型、数据集、评估器等配置信息
 2. 评测任务主要分为推理 `infer`、评测 `eval` 和可视化 `viz` 三个阶段，其中推理和评测经过 Partitioner 进行任务切分后，交由 Runner 负责并行执行。单个推理和评测任务则被抽象成 `OpenICLInferTask` 和 `OpenICLEvalTask`。