Commit be3dfa50 authored by jerrrrry's avatar jerrrrry
Browse files

Initial commit

parents
Pipeline #2876 failed with stages
in 0 seconds
# Prompt Template
## Background
In language model evaluation, we often construct prompts from the original dataset according to certain rules to enable the model to answer questions as required.
Typically, we place instructions at the beginning of the prompt, followed by several in-context examples, and finally, we include the question. For example:
```text
Solve the following questions.
1+1=?
2
3+9=?
12
5+6=?
```
Extensive experiments have shown that even with the same original test questions, different ways of constructing the prompt can affect the model's performance. Factors that may influence this include:
- The composition of the prompt itself, including instructions, in-context examples, and the format of the question.
- The selection of in-context examples, including the number and method of selection.
- The manner in which the prompt is used. Should the model complete the prompt based on the given context, or should it choose the best prompt among the candidate prompts?
OpenCompass defines the prompt construction strategy in the `infer_cfg` section of the dataset configuration. A typical `infer_cfg` is shown below:
```python
infer_cfg = dict(
ice_template=dict( # Template used to construct In Context Examples (ice).
type=PromptTemplate,
template='{question}\n{answer}'
),
prompt_template=dict( # Template used to construct the main prompt.
type=PromptTemplate,
template='Solve the following questions.\n</E>{question}\n{answer}',
ice_token="</E>"
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 1]), # Definition of how to retrieve in-context examples.
inferencer=dict(type=GenInferencer), # Method used to generate predictions.
)
```
In this document, we will mainly introduce the definitions of `ice_template`, `prompt_template`, and `inferencer`. For information on the `retriever`, please refer to other documents.
Let's start by introducing the basic syntax of the prompt.
## String-Based Prompt
String-based prompt is a classic form of template. Consider the following template:
```python
prompt_template=dict(
type=PromptTemplate,
template="{anything}\nQuestion: {question}\nAnswer: {answer}"
)
```
At runtime, the fields within the `{}` will be replaced with corresponding fields from the data sample. If a field does not exist in the data sample, it will be kept as is in the output.
For example, let's consider a data example as follows:
```python
example = {
'question': '1+1=?',
'answer': '2', # Assume the answer is in the reader_cfg.output_column
'irrelevant_infos': 'blabla',
}
```
After filling in the template, the result will be:
```text
{anything}
Question: 1+1=?
Answer:
```
As you can see, the actual answer for the question, represented by the field `answer`, does not appear in the generated result. This is because OpenCompass will mask fields that are written in `reader_cfg.output_column` to prevent answer leakage. For detailed explanations on `reader_cfg`, please refer to the relevant documentation on dataset configuration.
## Dialogue-Based Prompt
In practical testing, making models perform simple completions may not effectively test the performance of chat-based models. Therefore, we prefer prompts that take the form of dialogues. Additionally, different models have varying definitions of dialogue formats. Hence, we need prompts generated from the dataset to be more versatile, and the specific prompts required by each model can be generated during testing.
To achieve this, OpenCompass extends the string-based prompt to dialogue-based prompt. Dialogue-based prompt is more flexible, as it can combine with different [meta_templates](./meta_template.md) on the model side to generate prompts in various dialogue formats. It is applicable to both base and chat models, but their definitions are relatively complex.
Now, let's assume we have a data sample as follows:
```python
example = {
'question': '1+1=?',
'answer': '2', # Assume the answer is in the reader_cfg.output_column
'irrelavent_infos': 'blabla',
}
```
Next, let's showcase a few examples:
`````{tabs}
````{tab} Single-round Dialogue
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role="HUMAN", prompt="Question: {question}"),
dict(role="BOT", prompt="Answer: {answer}"),
]
)
)
```
The intermediate result obtained by OpenCompass after filling the data into the template is:
```python
PromptList([
dict(role='HUMAN', prompt='Question: 1+1=?'),
dict(role='BOT', prompt='Answer: '),
])
```
````
````{tab} Multi-round Dialogue
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role="HUMAN", prompt="Question: 2+2=?"),
dict(role="BOT", prompt="Answer: 4"),
dict(role="HUMAN", prompt="Question: 3+3=?"),
dict(role="BOT", prompt="Answer: 6"),
dict(role="HUMAN", prompt="Question: {question}"),
dict(role="BOT", prompt="Answer: {answer}"),
]
)
)
```
The intermediate result obtained by OpenCompass after filling the data into the template is:
```python
PromptList([
dict(role='HUMAN', prompt='Question: 2+2=?'),
dict(role='BOT', prompt='Answer: 4'),
dict(role='HUMAN', prompt='Question: 3+3=?'),
dict(role='BOT', prompt='Answer: 6'),
dict(role='HUMAN', prompt='Question: 1+1=?'),
dict(role='BOT', prompt='Answer: '),
])
```
````
````{tab} Dialogue with sys instruction
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
],
round=[
dict(role="HUMAN", prompt="Question: {question}"),
dict(role="BOT", prompt="Answer: {answer}"),
]
)
)
```
The intermediate result obtained by OpenCompass after filling the data into the template is:
```python
PromptList([
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
dict(role='HUMAN', prompt='Question: 1+1=?'),
dict(role='BOT', prompt='Answer: '),
])
```
During the processing of a specific meta template, if the definition includes the SYSTEM role, the template designated for the SYSTEM role will be used for processing. On the other hand, if the SYSTEM role is not defined, the template assigned to the fallback_role role will be utilized, which, in this example, corresponds to the HUMAN role.
````
`````
In dialogue-based templates, prompts are organized in the form of conversations between different roles (`role`). In the current predefined dataset configuration of OpenCompass, some commonly used roles in a prompt include:
- `HUMAN`: Represents a human, usually the one asking questions.
- `BOT`: Represents the language model, usually the one providing answers.
- `SYSTEM`: Represents the system, typically used at the beginning of prompts to give instructions.
Furthermore, unlike string-based templates, the prompts generated by dialogue-based templates are transformed into an intermediate structure called PromptList. This structure will be further combined with the model-side [meta_templates](./meta_template.md) to assemble the final prompt. If no meta template is specified, the prompts in the PromptList will be directly concatenated into a single string.
```{note}
The content within the PromptList in the example above is not the final input to the model and depends on the processing of the meta template. One potential source of misunderstanding is that in generative evaluations, the prompt of the last `BOT` role, `Answer: `, **will not** be inputted to the model. This is because API models generally cannot customize the initial part of model-generated responses. Therefore, this setting ensures consistency in the evaluation behavior between language models and API models. For more information, please refer to the documentation on [meta template](./meta_template.md).
```
<details>
<summary>Expand the complete parameter descriptions</summary>
- `begin`, `end`: (list, optional) The beginning and end of the prompt, typically containing system-level instructions. Each item inside can be **a dictionary or a string**.
- `round`: (list) The format of the dialogue in the template. Each item in the list must be a dictionary.
Each dictionary has the following parameters:
- `role` (str): The role name participating in the dialogue. It is used to associate with the names in meta_template but does not affect the actual generated prompt.
- `fallback_role` (str): The default role name to use in case the associated role is not found in the meta_template. Defaults to None.
- `prompt` (str): The dialogue content for the role.
</details>
## Prompt Templates and `inferencer`
Once we understand the basic definition of prompt templates, we also need to organize them according to the type of `inferencer`.
OpenCompass mainly supports two types of inferencers: `GenInferencer` and `PPLInferencer`, corresponding to two different inference methods.
`GenInferencer` corresponds to generative inference. During inference, the model is asked to continue generating text based on the input prompt. In this case, the `template` represents a single template for each sentence, for example:
`````{tabs}
````{group-tab} String-based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template='Solve the following questions.\n{question}\n{answer}'
)
```
````
````{group-tab} Dialogue-Based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
],
round=[
dict(role="HUMAN", prompt="{question}"),
dict(role="BOT", prompt="{answer}"),
]
)
)
```
````
`````
Then, the model's inference result will be a continuation of the concatenated string.
For `PPLInferencer`, it corresponds to discriminative inference. During inference, the model is asked to compute the perplexity (PPL) for each input string and select the item with the lowest perplexity as the model's inference result. In this case, `template` is a `dict` representing the template for each sentence, for example:
`````{tabs}
````{group-tab} String-based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
"A": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: A",
"B": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: B",
"C": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: C",
"UNK": "Question: Which is true?\nA. {A}\nB. {B}\nC. {C}\nAnswer: None of them is true.",
)
)
```
````
````{group-tab} Dialogue-Based Prompt
```python
prompt_template=dict(
type=PromptTemplate,
template=dict(
"A": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: A"),
]
),
"B": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: B"),
]
),
"C": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: C"),
]
),
"UNK": dict(
round=[
dict(role="HUMAN", prompt="Question: Which is true?\nA. {A}\nB. {B}\nC. {C}"),
dict(role="BOT", prompt="Answer: None of them is true."),
]
),
)
)
```
````
`````
In this case, the model's inference result will be one of the four keys in the `template` ("A" / "B" / "C" / "UNK").
## `ice_template` and `prompt_template`
In OpenCompass, for 0-shot evaluation, we usually only need to define the `prompt_template` field to complete prompt construction. However, for few-shot evaluation, we also need to define the `ice_template` field, which manages the prompt templates corresponding to the in-context examples during context learning.
Both `ice_template` and `prompt_template` follow the same syntax and rules. The complete prompt construction process can be represented using the following pseudo-code:
```python
def build_prompt():
ice = ice_template.format(*ice_example)
prompt = prompt_template.replace(prompt_template.ice_token, ice).format(*prompt_example)
return prompt
```
Now, let's assume there are two training data (ex1, ex2) and one testing data (ex3):
```python
ex1 = {
'question': '2+2=?',
'answer': '4',
'irrelavent_infos': 'blabla',
}
ex2 = {
'question': '3+3=?',
'answer': '6',
'irrelavent_infos': 'blabla',
}
ex3 = {
'question': '1+1=?',
'answer': '2', # Assume the answer is in the reader_cfg.output_column
'irrelavent_infos': 'blabla',
}
```
Next, let's take a look at the actual effects of different prompt construction methods:
`````{tabs}
````{group-tab} String-based Prompt
Template configurations are as follows:
```python
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template='{question}\n{answer}'
),
prompt_template=dict(
type=PromptTemplate,
template='Solve the following questions.\n</E>{question}\n{answer}'
ice_token='</E>',
)
)
```
The resulting strings are as follows:
```text
Solve the following questions.
2+2=?
4
3+3=?
6
1+1=?
```
````
````{group-tab} Dialogue-Based Prompt
Template configurations are as follows:
```python
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role="HUMAN", prompt="{question}"),
dict(role="BOT", prompt="{answer}"),
]
)
),
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
'</E>',
],
round=[
dict(role="HUMAN", prompt="{question}"),
dict(role="BOT", prompt="{answer}"),
],
),
ice_token='</E>',
)
)
```
The intermediate results obtained by OpenCompass after filling the data into the templates are as follows:
```python
PromptList([
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following questions.'),
dict(role='HUMAN', prompt='2+2=?'),
dict(role='BOT', prompt='4'),
dict(role='HUMAN', prompt='3+3=?'),
dict(role='BOT', prompt='6'),
dict(role='HUMAN', prompt='1+1=?'),
dict(role='BOT', prompt=''),
])
```
````
`````
### Abbreviated Usage
It is worth noting that, for the sake of simplicity in the configuration file, the `prompt_template` field can be omitted. When the `prompt_template` field is omitted, the `ice_template` will be used as the `prompt_template` as well, to assemble the complete prompt. The following two `infer_cfg` configurations are equivalent:
<table class="docutils">
<thead>
<tr>
<th>Complete Form</th>
<th>Abbreviated Form</th>
<tbody>
<tr>
<td>
```python
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template="Q: {question}\nA: {answer}",
),
prompt_template=dict(
type=PromptTemplate,
template="</E>Q: {question}\nA: {answer}",
ice_token="</E>",
),
# ...
)
```
</td>
<td>
```python
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template="</E>Q: {question}\nA: {answer}",
ice_token="</E>",
),
# ...
)
```
</td>
</tr>
</thead>
</table>
More generally, even in the case of 0-shot learning (i.e., when `retriever` is `ZeroRetriver`), this mechanism still applies. Therefore, the following configuration is also valid:
```python
datasets = [
dict(
infer_cfg=dict(
ice_template=dict(
type=PromptTemplate,
template="Q: {question}\nA: {answer}",
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
),
]
```
## Usage Suggestion
It is suggested to use the [Prompt Viewer](../tools.md) tool to visualize the completed prompts, confirm the correctness of the templates, and ensure that the results meet expectations.
#! /usr/bin/env python
from pathlib import Path
import yaml
from tabulate import tabulate
OC_ROOT = Path(__file__).absolute().parents[2]
GITHUB_PREFIX = 'https://github.com/open-compass/opencompass/tree/main/'
DATASETZOO_TEMPLATE = """\
# Dataset Statistics
On this page, we have listed all the datasets supported by OpenCompass.
You can use sorting and search functions to find the dataset you need.
"""
with open('dataset_statistics.md', 'w') as f:
f.write(DATASETZOO_TEMPLATE)
load_path = str(OC_ROOT / 'dataset-index.yml')
with open(load_path, 'r') as f2:
data_list = yaml.load(f2, Loader=yaml.FullLoader)
HEADER = ['name', 'category', 'paper', 'configpath']
def table_format(data_list):
table_format_list = []
for i in data_list:
table_format_list_sub = []
for j in i:
for index in HEADER:
if index == 'paper':
table_format_list_sub.append('[link](' + i[j][index] + ')')
elif index == 'configpath':
if isinstance(i[j][index], list):
sub_list_text = ''
for k in i[j][index]:
sub_list_text += ('[link](' + GITHUB_PREFIX + k +
') / ')
table_format_list_sub.append(sub_list_text[:-2])
else:
table_format_list_sub.append('[link](' +
GITHUB_PREFIX +
i[j][index] + ')')
else:
table_format_list_sub.append(i[j][index])
table_format_list.append(table_format_list_sub)
return table_format_list
data_format_list = table_format(data_list)
def generate_table(data_list, title=None):
with open('dataset_statistics.md', 'a') as f:
if title is not None:
f.write(f'\n{title}')
f.write("""\n```{table}\n:class: dataset\n""")
header = ['Name', 'Category', 'Paper or Repository', 'Config File']
table_cfg = dict(tablefmt='pipe',
floatfmt='.2f',
numalign='right',
stralign='center')
f.write(tabulate(data_list, header, **table_cfg))
f.write('\n```\n')
generate_table(
data_list=data_format_list,
title='## Supported Dataset List',
)
# Useful Tools
## Prompt Viewer
This tool allows you to directly view the generated prompt without starting the full training process. If the passed configuration is only the dataset configuration (such as `configs/datasets/nq/nq_gen.py`), it will display the original prompt defined in the dataset configuration. If it is a complete evaluation configuration (including the model and the dataset), it will display the prompt received by the selected model during operation.
Running method:
```bash
python tools/prompt_viewer.py CONFIG_PATH [-n] [-a] [-p PATTERN]
```
- `-n`: Do not enter interactive mode, select the first model (if any) and dataset by default.
- `-a`: View the prompts received by all models and all dataset combinations in the configuration.
- `-p PATTERN`: Do not enter interactive mode, select all datasets that match the input regular expression.
## Case Analyzer (To be updated)
Based on existing evaluation results, this tool produces inference error samples and full samples with annotation information.
Running method:
```bash
python tools/case_analyzer.py CONFIG_PATH [-w WORK_DIR]
```
- `-w`: Work path, default is `'./outputs/default'`.
## Lark Bot
Users can configure the Lark bot to implement real-time monitoring of task status. Please refer to [this document](https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN?lang=zh-CN#7a28964d) for setting up the Lark bot.
Configuration method:
- Open the `configs/secrets.py` file, and add the following line to the file:
```python
lark_bot_url = 'YOUR_WEBHOOK_URL'
```
- Normally, the Webhook URL format is like https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx .
- Inherit this file in the complete evaluation configuration
- To avoid the bot sending messages frequently and causing disturbance, the running status will not be reported automatically by default. If necessary, you can start status reporting through `-l` or `--lark`:
```bash
python run.py configs/eval_demo.py -l
```
## API Model Tester
This tool can quickly test whether the functionality of the API model is normal.
Running method:
```bash
python tools/test_api_model.py [CONFIG_PATH] -n
```
## Prediction Merger
This tool can merge patitioned predictions.
Running method:
```bash
python tools/prediction_merger.py CONFIG_PATH [-w WORK_DIR]
```
- `-w`: Work path, default is `'./outputs/default'`.
## List Configs
This tool can list or search all available model and dataset configurations. It supports fuzzy search, making it convenient for use in conjunction with `run.py`.
Usage:
```bash
python tools/list_configs.py [PATTERN1] [PATTERN2] [...]
```
If executed without any parameters, it will list all model configurations in the `configs/models` and `configs/dataset` directories by default.
Users can also pass any number of parameters. The script will list all configurations related to the provided strings, supporting fuzzy search and the use of the * wildcard. For example, the following command will list all configurations related to `mmlu` and `llama`:
```bash
python tools/list_configs.py mmlu llama
```
Its output could be:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| hf_llama2_7b | configs/models/hf_llama2_7b.py |
| hf_llama_13b | configs/models/hf_llama_13b.py |
| hf_llama_30b | configs/models/hf_llama_30b.py |
| hf_llama_65b | configs/models/hf_llama_65b.py |
| hf_llama_7b | configs/models/hf_llama_7b.py |
| llama2_13b_chat | configs/models/llama2_13b_chat.py |
| llama2_70b_chat | configs/models/llama2_70b_chat.py |
| llama2_7b_chat | configs/models/llama2_7b_chat.py |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| cmmlu_ppl | configs/datasets/cmmlu/cmmlu_ppl.py |
| cmmlu_ppl_fd1f2f | configs/datasets/cmmlu/cmmlu_ppl_fd1f2f.py |
| mmlu_gen | configs/datasets/mmlu/mmlu_gen.py |
| mmlu_gen_23a9a9 | configs/datasets/mmlu/mmlu_gen_23a9a9.py |
| mmlu_gen_5d1409 | configs/datasets/mmlu/mmlu_gen_5d1409.py |
| mmlu_gen_79e572 | configs/datasets/mmlu/mmlu_gen_79e572.py |
| mmlu_gen_a484b3 | configs/datasets/mmlu/mmlu_gen_a484b3.py |
| mmlu_ppl | configs/datasets/mmlu/mmlu_ppl.py |
| mmlu_ppl_ac766d | configs/datasets/mmlu/mmlu_ppl_ac766d.py |
+-------------------+---------------------------------------------------+
```
## Dataset Suffix Updater
This tool can quickly modify the suffixes of configuration files located under the `configs/dataset` directory, aligning them with the naming conventions based on prompt hash.
How to run:
```bash
python tools/update_dataset_suffix.py
```
# Learn About Config
OpenCompass uses the OpenMMLab modern style configuration files. If you are familiar with the OpenMMLab style
configuration files, you can directly refer to
[A Pure Python style Configuration File (Beta)](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta)
to understand the differences between the new-style and original configuration files. If you have not
encountered OpenMMLab style configuration files before, I will explain the usage of configuration files using
a simple example. Make sure you have installed the latest version of MMEngine to support the
new-style configuration files.
## Basic Format
OpenCompass configuration files are in Python format, following basic Python syntax. Each configuration item
is specified by defining variables. For example, when defining a model, we use the following configuration:
```python
# model_cfg.py
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
path='huggyllama/llama-7b',
model_kwargs=dict(device_map='auto'),
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
max_out_len=50,
run_cfg=dict(num_gpus=8, num_procs=1),
)
]
```
When reading the configuration file, use `Config.fromfile` from MMEngine for parsing:
```python
>>> from mmengine.config import Config
>>> cfg = Config.fromfile('./model_cfg.py')
>>> print(cfg.models[0])
{'type': HuggingFaceCausalLM, 'path': 'huggyllama/llama-7b', 'model_kwargs': {'device_map': 'auto'}, ...}
```
## Inheritance Mechanism
OpenCompass configuration files use Python's import mechanism for file inheritance. Note that when inheriting
configuration files, we need to use the `read_base` context manager.
```python
# inherit.py
from mmengine.config import read_base
with read_base():
from .model_cfg import models # Inherits the 'models' from model_cfg.py
```
Parse the configuration file using `Config.fromfile`:
```python
>>> from mmengine.config import Config
>>> cfg = Config.fromfile('./inherit.py')
>>> print(cfg.models[0])
{'type': HuggingFaceCausalLM, 'path': 'huggyllama/llama-7b', 'model_kwargs': {'device_map': 'auto'}, ...}
```
## Evaluation Configuration Example
```python
# configs/llama7b.py
from mmengine.config import read_base
with read_base():
# Read the required dataset configurations directly from the preset dataset configurations
from .datasets.piqa.piqa_ppl import piqa_datasets
from .datasets.siqa.siqa_gen import siqa_datasets
# Concatenate the datasets to be evaluated into the datasets field
datasets = [*piqa_datasets, *siqa_datasets]
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
# Initialization parameters for `HuggingFaceCausalLM`
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
# Common parameters for all models, not specific to HuggingFaceCausalLM's initialization parameters
abbr='llama-7b', # Model abbreviation for result display
max_out_len=100, # Maximum number of generated tokens
batch_size=16,
run_cfg=dict(num_gpus=1), # Run configuration for specifying resource requirements
)
]
```
## Dataset Configuration File Example
In the above example configuration file, we directly inherit the dataset-related configurations. Next, we will
use the PIQA dataset configuration file as an example to demonstrate the meanings of each field in the dataset
configuration file. If you do not intend to modify the prompt for model testing or add new datasets, you can
skip this section.
The PIQA dataset [configuration file](https://github.com/open-compass/opencompass/blob/main/configs/datasets/piqa/piqa_ppl_1cf9f0.py) is as follows.
It is a configuration for evaluating based on perplexity (PPL) and does not use In-Context Learning.
```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import HFDataset
# Reading configurations
# The loaded dataset is usually organized as dictionaries, specifying the input fields used to form the prompt
# and the output field used as the answer in each sample
piqa_reader_cfg = dict(
input_columns=['goal', 'sol1', 'sol2'],
output_column='label',
test_split='validation',
)
# Inference configurations
piqa_infer_cfg = dict(
# Prompt generation configuration
prompt_template=dict(
type=PromptTemplate,
# Prompt template, the template format matches the inferencer type specified later
# Here, to calculate PPL, we need to specify the prompt template for each answer
template={
0: 'The following makes sense: \nQ: {goal}\nA: {sol1}\n',
1: 'The following makes sense: \nQ: {goal}\nA: {sol2}\n'
}),
# In-Context example configuration, specifying `ZeroRetriever` here, which means not using in-context example.
retriever=dict(type=ZeroRetriever),
# Inference method configuration
# - PPLInferencer uses perplexity (PPL) to obtain answers
# - GenInferencer uses the model's generated results to obtain answers
inferencer=dict(type=PPLInferencer))
# Metric configuration, using Accuracy as the evaluation metric
piqa_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
# Dataset configuration, where all the above variables are parameters for this configuration
# It is a list used to specify the configurations of different evaluation subsets of a dataset.
piqa_datasets = [
dict(
type=HFDataset,
path='piqa',
reader_cfg=piqa_reader_cfg,
infer_cfg=piqa_infer_cfg,
eval_cfg=piqa_eval_cfg)
```
For detailed configuration of the **Prompt generation configuration**, you can refer to the [Prompt Template](../prompt/prompt_template.md).
## Advanced Evaluation Configuration
In OpenCompass, we support configuration options such as task partitioner and runner for more flexible and
efficient utilization of computational resources.
By default, we use size-based partitioning for inference tasks. You can specify the sample number threshold
for task partitioning using `--max-partition-size` when starting the task. Additionally, we use local
resources for inference and evaluation tasks by default. If you want to use Slurm cluster resources, you can
use the `--slurm` parameter and the `--partition` parameter to specify the Slurm runner backend when starting
the task.
Furthermore, if the above functionalities do not meet your requirements for task partitioning and runner
backend configuration, you can provide more detailed configurations in the configuration file. Please refer to
[Efficient Evaluation](./evaluation.md) for more information.
# Performance of Common Benchmarks
We have identified several well-known benchmarks for evaluating large language models (LLMs), and provide detailed performance results of famous LLMs on these datasets.
| Model | Version | Metric | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 |
| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- |
| MMLU | - | naive_average | gen | 83.6 | 84.2 | 84.6 | 80.5 | 77.2 |
| CMMLU | - | naive_average | gen | 71.9 | 72.4 | 74.2 | 70.1 | 59.7 |
| CEval-Test | - | naive_average | gen | 69.7 | 70.5 | 71.7 | 66.9 | 58.7 |
| GaokaoBench | - | weighted_average | gen | 74.8 | 76.0 | 74.2 | 67.8 | 60.0 |
| Triviaqa_wiki(1shot) | 01cf41 | score | gen | 73.1 | 82.9 | 82.4 | 89.8 | 89.7 |
| NQ_open(1shot) | eaf81e | score | gen | 27.9 | 30.4 | 39.4 | 40.1 | 46.8 |
| Race-High | 9a54b6 | accuracy | gen | 89.3 | 89.6 | 90.8 | 89.4 | 84.8 |
| WinoGrande | 6447e6 | accuracy | gen | 80.7 | 83.3 | 84.1 | 69.7 | 76.6 |
| HellaSwag | e42710 | accuracy | gen | 92.7 | 93.5 | 94.6 | 87.7 | 86.1 |
| BBH | - | naive_average | gen | 82.7 | 78.5 | 78.5 | 80.5 | 79.1 |
| GSM-8K | 1d7fe4 | accuracy | gen | 80.5 | 79.7 | 87.7 | 90.2 | 88.3 |
| Math | 393424 | accuracy | gen | 61.9 | 71.2 | 60.2 | 47.1 | 50 |
| TheoremQA | ef26ca | accuracy | gen | 28.4 | 23.3 | 29.6 | 25.4 | 13 |
| HumanEval | 8e312c | humaneval_pass@1 | gen | 74.4 | 82.3 | 76.2 | 72.6 | 72.0 |
| MBPP(sanitized) | 1e1056 | score | gen | 78.6 | 77.0 | 76.7 | 71.6 | 68.9 |
| GPQA_diamond | 4baadb | accuracy | gen | 40.4 | 48.5 | 46.5 | 38.9 | 36.4 |
| IFEval | 3321a3 | Prompt-level-strict-accuracy | gen | 71.9 | 79.9 | 80.0 | 77.1 | 65.8 |
# Configure Datasets
This tutorial mainly focuses on selecting datasets supported by OpenCompass and preparing their configs files. Please make sure you have downloaded the datasets following the steps in [Dataset Preparation](../get_started/installation.md#dataset-preparation).
## Directory Structure of Dataset Configuration Files
First, let's introduce the structure under the `configs/datasets` directory in OpenCompass, as shown below:
```
configs/datasets/
├── agieval
├── apps
├── ARC_c
├── ...
├── CLUE_afqmc # dataset
│   ├── CLUE_afqmc_gen_901306.py # different version of config
│   ├── CLUE_afqmc_gen.py
│   ├── CLUE_afqmc_ppl_378c5b.py
│   ├── CLUE_afqmc_ppl_6507d7.py
│   ├── CLUE_afqmc_ppl_7b0c1e.py
│   └── CLUE_afqmc_ppl.py
├── ...
├── XLSum
├── Xsum
└── z_bench
```
In the `configs/datasets` directory structure, we flatten all datasets directly, and there are multiple dataset configurations within the corresponding folders for each dataset.
The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`.
In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.
## Dataset Selection
In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
```python
afqmc_datasets = [
dict(
abbr="afqmc-dev",
type=AFQMCDatasetV2,
path="./data/CLUE/AFQMC/dev.json",
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
```
And `cmnli_datasets` in `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`.
```python
cmnli_datasets = [
dict(
type=HFDataset,
abbr='cmnli',
path='json',
split='train',
data_files='./data/CLUE/cmnli/cmnli_public/dev.json',
reader_cfg=cmnli_reader_cfg,
infer_cfg=cmnli_infer_cfg,
eval_cfg=cmnli_eval_cfg)
]
```
Take these two datasets as examples. If users want to evaluate these two datasets at the same time, they can create a new configuration file in the `configs` directory. We use the import mechanism in the `mmengine` configuration to build the part of the dataset parameters in the evaluation script, as shown below:
```python
from mmengine.config import read_base
with read_base():
from .datasets.CLUE_afqmc.CLUE_afqmc_gen_db509b import afqmc_datasets
from .datasets.CLUE_cmnli.CLUE_cmnli_ppl_b78ad4 import cmnli_datasets
datasets = []
datasets += afqmc_datasets
datasets += cmnli_datasets
```
Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.
For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.
### Multiple Evaluations on the Dataset
In the dataset configuration, you can set the parameter `n` to perform multiple evaluations on the same dataset and return the average metrics, for example:
```python
afqmc_datasets = [
dict(
abbr="afqmc-dev",
type=AFQMCDatasetV2,
path="./data/CLUE/AFQMC/dev.json",
n=10, # Perform 10 evaluations
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
```
Additionally, for binary evaluation metrics (such as accuracy, pass-rate, etc.), you can also set the parameter `k` in conjunction with `n` for [G-Pass@k](http://arxiv.org/abs/2412.13147) evaluation. The formula for G-Pass@k is:
```{math}
\text{G-Pass@}k_\tau=E_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],
```
where $n$ is the number of evaluations, and $c$ is the number of times that passed or were correct out of $n$ runs. An example configuration is as follows:
```python
aime2024_datasets = [
dict(
abbr='aime2024',
type=Aime2024Dataset,
path='opencompass/aime2024',
k=[2, 4], # Return results for G-Pass@2 and G-Pass@4
n=12, # 12 evaluations
...
)
]
```
# Tutorial for Evaluating Reasoning Models
OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).
- At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
- For datasets with limited samples, we employ multiple evaluation runs and take the average
- For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation
## Installation and Preparation
Please follow OpenCompass's installation guide.
## Evaluation Configuration Setup
We provide example configurations in `example/eval_deepseek_r1.py`. Below is the configuration explanation:
### Configuration Interpretation
#### 1. Dataset and Validator Configuration
```python
# Configuration supporting multiple runs (example)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # Replace with actual path
key='YOUR_API_KEY', # Use real API key
openai_api_base=['http://your-api-endpoint'], # Replace with API endpoint
query_per_second=16,
batch_size=1024,
temperature=0.001,
max_out_len=16384
)
# Apply validator to all datasets
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
```
#### 2. Model Configuration
We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)
```python
# LMDeploy model configuration example
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768
),
max_seq_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# Extendable 14B/32B configurations...
]
```
#### 3. Evaluation Process Configuration
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
# Evaluation configuration
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
```
#### 4. Summary Configuration
```python
# Multiple runs results average configuration
summary_groups = [
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
# Other dataset average configurations...
]
summarizer = dict(
dataset_abbrs=[
['AIME2024-Aveage8', 'naive_average'],
# Other dataset metrics...
],
summary_groups=summary_groups
)
# Work directory configuration
work_dir = "outputs/deepseek_r1_reasoning"
```
## Evaluation Execution
### Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU
```bash
opencompass example/eval_deepseek_r1.py --debug --dump-eval-details
```
Evaluation logs will be output in the command line.
### Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs
You need to modify the `infer` configuration in the configuration file and set `num_worker` to 8
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
At the same time, remove the `--debug` parameter from the evaluation command
```bash
opencompass example/eval_deepseek_r1.py --dump-eval-details
```
In this mode, OpenCompass will use multithreading to start `$num_worker` tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under `$work_dir`.
### Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs
Note that in the model configuration, `num_gpus` in `run_cfg` needs to be set to 2 (if using an inference backend, parameters such as `tp` in LMDeploy also need to be modified accordingly to 2), and at the same time, set `num_worker` in the `infer` configuration to 4
```python
models += [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-14b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=128,
run_cfg=dict(num_gpus=2),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
```
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
### Evaluation Results
The evaluation results are displayed as follows:
```bash
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
```
## Performance Baseline
Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.
| Model | Dataset | Metric | Value |
| ---------------------------- | -------- | -------- | ----- |
| DeepSeek-R1-Distill-Qwen-7B | AIME2024 | Accuracy | 56.3 |
| DeepSeek-R1-Distill-Qwen-14B | AIME2024 | Accuracy | 74.2 |
| DeepSeek-R1-Distill-Qwen-32B | AIME2024 | Accuracy | 74.2 |
# Efficient Evaluation
OpenCompass supports custom task partitioners (`Partitioner`), which enable flexible division of evaluation tasks. In conjunction with `Runner`, which controls the platform for task execution, such as a local machine or a cluster, OpenCompass can distribute large evaluation tasks to a vast number of computing nodes. This helps utilize computational resources efficiently and significantly accelerates the evaluation process.
By default, OpenCompass hides these details from users and automatically selects the recommended execution strategies. But users can still customize these strategies of the workflows according to their needs, just by adding the `infer` and/or `eval` fields to the configuration file:
```python
from opencompass.partitioners import SizePartitioner, NaivePartitioner
from opencompass.runners import SlurmRunner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=5000),
runner=dict(
type=SlurmRunner,
max_num_workers=64,
task=dict(type=OpenICLInferTask),
retry=5),
)
eval = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=32,
task=dict(type=OpenICLEvalTask)),
)
```
The example above demonstrates the way to configure the execution strategies for the inference and evaluation stages. At the inference stage, the task will be divided into several sub-tasks, each of 5000 samples, and then submitted to the Slurm cluster for execution, where there are at most 64 tasks running in parallel. At the evaluation stage, each single model-dataset pair forms a task, and 32 processes are launched locally to compute the metrics.
The following sections will introduce the involved modules in detail.
## Task Partition (Partitioner)
Due to the long inference time of large language models and the vast amount of evaluation datasets, serial execution of a single evaluation task can be quite time-consuming. OpenCompass allows custom task partitioners (`Partitioner`) to divide large evaluation tasks into numerous independent smaller tasks, thus fully utilizing computational resources via parallel execution. Users can configure the task partitioning strategies for the inference and evaluation stages via `infer.partitioner` and `eval.partitioner`. Below, we will introduce all the partitioning strategies supported by OpenCompass.
### `NaivePartitioner`
This partitioner dispatches each combination of a model and dataset as an independent task. It is the most basic partitioning strategy and does not have any additional parameters.
```python
from opencompass.partitioners import NaivePartitioner
infer = dict(
partitioner=dict(type=NaivePartitioner)
# ...
)
```
### `SizePartitioner`
```{warning}
For now, this partitioner is not suitable for evaluation tasks (`OpenICLEvalTask`).
```
This partitioner estimates the inference cost (time) of a dataset according to its size, multiplied by an empirical expansion coefficient. It then creates tasks by splitting larger datasets and merging smaller ones to ensure the inference costs of each sub-task are as equal as possible.
The commonly used parameters for this partitioner are as follows:
```python
from opencompass.partitioners import SizePartitioner
infer = dict(
partitioner=dict(
type=SizePartitioner,
max_task_size: int = 2000, # Maximum size of each task
gen_task_coef: int = 20, # Expansion coefficient for generative tasks
),
# ...
)
```
`SizePartitioner` estimates the inference cost of a dataset based on the type of the inference task and selects different expansion coefficients accordingly. For generative tasks, such as those using `GenInferencer`, a larger `gen_task_coef` is set; for discriminative tasks, like those using `PPLInferencer`, the number of labels in the prompt is used.
```{note}
Currently, this partitioning strategy is still rather crude and does not accurately reflect the computational difference between generative and discriminative tasks. We look forward to the community proposing better partitioning strategies :)
```
## Execution Backend (Runner)
In a multi-card, multi-machine cluster environment, if we want to implement parallel execution of multiple tasks, we usually need to rely on a cluster management system (like Slurm) for task allocation and scheduling. In OpenCompass, task allocation and execution are uniformly handled by the Runner. Currently, it supports both Slurm and PAI-DLC scheduling backends, and also provides a `LocalRunner` to directly launch tasks on the local machine.
### `LocalRunner`
`LocalRunner` is the most basic runner that can run tasks parallelly on the local machine.
```python
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=LocalRunner,
max_num_workers=16, # Maximum number of processes to run in parallel
task=dict(type=OpenICLEvalTask), # Task to be run
)
)
```
```{note}
The actual number of running tasks are both limited by the actual available GPU resources and the number of workers.
```
### `SlurmRunner`
`SlurmRunner` submits tasks to run on the Slurm cluster. The commonly used configuration fields are as follows:
```python
from opencompass.runners import SlurmRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=SlurmRunner,
task=dict(type=OpenICLEvalTask), # Task to be run
max_num_workers=16, # Maximum concurrent evaluation task count
retry=2, # Retry count for failed tasks, can prevent accidental errors
),
)
```
### `DLCRunner`
`DLCRunner` submits tasks to run on Alibaba's Deep Learning Center (DLC). This Runner depends on `dlc`. Firstly, you need to prepare `dlc` in the environment:
```bash
cd ~
wget https://dlc-cli.oss-cn-zhangjiakou.aliyuncs.com/light/binary/linux/amd64/dlc
chmod +x ./dlc
sudo ln -rs dlc /usr/local/bin
./dlc config
```
Fill in the necessary information according to the prompts and get the `dlc` configuration file (like `/user/.dlc/config`) to complete the preparation. Then, specify the `DLCRunner` configuration in the configuration file as per the format:
```python
from opencompass.runners import DLCRunner
from opencompass.tasks import OpenICLInferTask
infer = dict(
# ...
runner=dict(
type=DLCRunner,
task=dict(type=OpenICLEvalTask), # Task to be run
max_num_workers=16, # Maximum concurrent evaluation task count
aliyun_cfg=dict(
bashrc_path="/user/.bashrc", # Path to the bashrc for initializing the running environment
conda_env_name='opencompass', # Conda environment for OpenCompass
dlc_config_path="/user/.dlc/config", # Configuration file for dlc
workspace_id='ws-xxx', # DLC workspace ID
worker_image='xxx', # Image url for running tasks
),
retry=2, # Retry count for failed tasks, can prevent accidental errors
),
)
```
## Task
A Task is a fundamental module in OpenCompass, a standalone script that executes the computationally intensive operations. Each task is designed to load a configuration file to determine parameter settings, and it can be executed in two distinct ways:
2. Instantiate a Task object, then call `task.run()`.
3. Call `get_command` method by passing in the config path and the command template string that contains `{task_cmd}` as a placeholder (e.g. `srun {task_cmd}`). The returned command string will be the full command and can be executed directly.
As of now, OpenCompass supports the following task types:
- `OpenICLInferTask`: Perform LM Inference task based on OpenICL framework.
- `OpenICLEvalTask`: Perform LM Evaluation task based on OpenEval framework.
In the future, more task types will be supported.
# Task Execution and Monitoring
## Launching an Evaluation Task
The program entry for the evaluation task is `run.py`. The usage is as follows:
```shell
python run.py $EXP {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--debug] [-m MODE] [-r [REUSE]] [-w WORKDIR] [-l] [--dry-run] [--dump-eval-details]
```
Task Configuration (`$EXP`):
- `run.py` accepts a .py configuration file as task-related parameters, which must include the `datasets` and `models` fields.
```bash
python run.py configs/eval_demo.py
```
- If no configuration file is provided, users can also specify models and datasets using `--models MODEL1 MODEL2 ...` and `--datasets DATASET1 DATASET2 ...`:
```bash
python run.py --models hf_opt_350m hf_opt_125m --datasets siqa_gen winograd_ppl
```
- For HuggingFace related models, users can also define a model quickly in the command line through HuggingFace parameters and then specify datasets using `--datasets DATASET1 DATASET2 ...`.
```bash
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path huggyllama/llama-7b
```
Complete HuggingFace parameter descriptions:
- `--hf-path`: HuggingFace model path
- `--peft-path`: PEFT model path
- `--tokenizer-path`: HuggingFace tokenizer path (if it's the same as the model path, it can be omitted)
- `--model-kwargs`: Parameters for constructing the model
- `--tokenizer-kwargs`: Parameters for constructing the tokenizer
- `--max-out-len`: Maximum generated token count
- `--max-seq-len`: Maximum sequence length the model can accept
- `--batch-size`: Batch size
- `--hf-num-gpus`: Number of GPUs required to run the model. Please note that this parameter is only used to determine the number of GPUs required to run the model, and does not affect the actual number of GPUs used for the task. Refer to [Efficient Evaluation](./evaluation.md) for more details.
Starting Methods:
- Running on local machine: `run.py $EXP`.
- Running with slurm: `run.py $EXP --slurm -p $PARTITION_name`.
- Running with dlc: `run.py $EXP --dlc --aliyun-cfg $AliYun_Cfg`
- Customized starting: `run.py $EXP`. Here, $EXP is the configuration file which includes the `eval` and `infer` fields. For detailed configurations, please refer to [Efficient Evaluation](./evaluation.md).
The parameter explanation is as follows:
- `-p`: Specify the slurm partition;
- `-q`: Specify the slurm quotatype (default is None), with optional values being reserved, auto, spot. This parameter may only be used in some slurm variants;
- `--debug`: When enabled, inference and evaluation tasks will run in single-process mode, and output will be echoed in real-time for debugging;
- `-m`: Running mode, default is `all`. It can be specified as `infer` to only run inference and obtain output results; if there are already model outputs in `{WORKDIR}`, it can be specified as `eval` to only run evaluation and obtain evaluation results; if the evaluation results are ready, it can be specified as `viz` to only run visualization, which summarizes the results in tables; if specified as `all`, a full run will be performed, which includes inference, evaluation, and visualization.
- `-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
- `-w`: Specify the working path, default is `./outputs/default`.
- `-l`: Enable status reporting via Lark bot.
- `--dry-run`: When enabled, inference and evaluation tasks will be dispatched but won't actually run for debugging.
- `--dump-eval-details`: When enabled,evaluation under the `results` folder will include more details, such as the correctness of each sample.
Using run mode `-m all` as an example, the overall execution flow is as follows:
1. Read the configuration file, parse out the model, dataset, evaluator, and other configuration information
2. The evaluation task mainly includes three stages: inference `infer`, evaluation `eval`, and visualization `viz`. After task division by Partitioner, they are handed over to Runner for parallel execution. Individual inference and evaluation tasks are abstracted into `OpenICLInferTask` and `OpenICLEvalTask` respectively.
3. After each stage ends, the visualization stage will read the evaluation results in `results/` to generate a table.
## Task Monitoring: Lark Bot
Users can enable real-time monitoring of task status by setting up a Lark bot. Please refer to [this document](https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN?lang=zh-CN#7a28964d) for setting up the Lark bot.
Configuration method:
1. Open the `configs/lark.py` file, and add the following line:
```python
lark_bot_url = 'YOUR_WEBHOOK_URL'
```
Typically, the Webhook URL is formatted like this: https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxxxxxxxxxx .
2. Inherit this file in the complete evaluation configuration:
```python
from mmengine.config import read_base
with read_base():
from .lark import lark_bot_url
```
3. To avoid frequent messages from the bot becoming a nuisance, status updates are not automatically reported by default. You can start status reporting using `-l` or `--lark` when needed:
```bash
python run.py configs/eval_demo.py -p {PARTITION} -l
```
## Run Results
All run results will be placed in `outputs/default/` directory by default, the directory structure is shown below:
```
outputs/default/
├── 20200220_120000
├── ...
├── 20230220_183030
│ ├── configs
│ ├── logs
│ │ ├── eval
│ │ └── infer
│ ├── predictions
│ │ └── MODEL1
│ └── results
│ └── MODEL1
```
Each timestamp contains the following content:
- configs folder, which stores the configuration files corresponding to each run with this timestamp as the output directory;
- logs folder, which stores the output log files of the inference and evaluation phases, each folder will store logs in subfolders by model;
- predictions folder, which stores the inferred json results, with a model subfolder;
- results folder, which stores the evaluated json results, with a model subfolder.
Also, all `-r` without specifying a corresponding timestamp will select the newest folder by sorting as the output directory.
## Introduction of Summerizer (to be updated)
# Overview
## Evaluation Targets
The primary evaluation targets of this algorithm library are large language models. We introduce specific model types for evaluation using the large language model as an example.
- base Model: Typically obtained through training on massive textual data in a self-supervised manner (e.g., OpenAI's GPT-3, Meta's LLaMA). These models usually have powerful text continuation capabilities.
- Chat Model: Often built upon the base model and refined through directive fine-tuning or human preference alignment (e.g., OpenAI's ChatGPT, Shanghai AI Lab's Scholar Pu Tongue). These models can understand human instructions and have strong conversational skills.
## Tool Architecture
![framework-en](https://github.com/open-compass/opencompass/assets/17680578/b4d4bf4b-a673-4efe-b522-9337d4f7391a)
- Model Layer: This encompasses the primary model categories involved in large model evaluations. OpenCompass focuses on base models and chat models for in-depth evaluations.
- Capability Layer: OpenCompass evaluates models based on general capabilities and special features. In terms of general capabilities, models are evaluated on language, knowledge, understanding, reasoning, safety, and other dimensions. In terms of special capabilities, evaluations are based on long texts, code, tools, and knowledge enhancement.
- Method Layer: OpenCompass uses both objective and subjective evaluation methods. Objective evaluations can quickly assess a model's capability in tasks with definite answers (like multiple choice, fill in the blanks, closed-ended questions), while subjective evaluations measure user satisfaction with the model's replies. OpenCompass uses both model-assisted subjective evaluations and human feedback-driven subjective evaluations.
- Tool Layer: OpenCompass offers extensive functionalities for automated, efficient evaluations of large language models. This includes distributed evaluation techniques, prompt engineering, integration with evaluation databases, leaderboard publishing, report generation, and many more features.
## Capability Dimensions
### Design Philosophy
To accurately, comprehensively, and systematically assess the capabilities of large language models, OpenCompass takes a general AI perspective, integrating cutting-edge academic advancements and industrial best practices to propose an evaluation system tailored for real-world applications. OpenCompass's capability dimensions cover both general capabilities and special features.
### General Capabilities
General capabilities encompass examination, knowledge, language, understanding, reasoning, and safety, forming a comprehensive evaluation system across these six dimensions.
#### Examination Capability
This dimension aims to provide evaluation support from the perspective of human development, borrowing the classification logic from pedagogy. The core idea revolves around mandatory education, higher education, and vocational training, creating a comprehensive academic capability evaluation approach.
#### Knowledge Capability
Knowledge capability gauges the model's grasp on various knowledge types, including but not limited to general world knowledge and domain-specific expertise. This capability hopes that the model can answer a wide range of knowledge-based questions accurately and comprehensively.
#### Reasoning Capability
Reasoning is a crucial dimension for general AI. This evaluates the model's reasoning skills, including but not limited to mathematical computation, logical reasoning, causal inference, code generation and modification, and more.
#### Understanding Capability
This dimension evaluates the model's comprehension of text, including:
- Rhetorical techniques understanding and analysis: Grasping various rhetorical techniques used in text and analyzing and interpreting them.
- Text content summarization: Summarizing and extracting information from given content.
- Content creation: Open-ended or semi-open-ended content creation based on given themes or requirements.
#### Language Capability
This dimension evaluates the model's prior language knowledge, which includes but is not limited to:
- Word recognition and generation: Understanding language at the word level and tasks like word recognition, classification, definition, and generation.
- Grammar understanding and correction: Grasping grammar within the text and identifying and correcting grammatical errors.
- Cross-language translation: Translating given source language into target languages, assessing multilingual capabilities of current large models.
#### Safety Capability
In conjunction with the technical features of large language models, OpenCompass assesses the legality, compliance, and safety of model outputs, aiding the development of safe and responsible large models. This capability includes but is not limited to:
- Fairness
- Legality
- Harmlessness
- Ethical considerations
- Privacy protection
## Evaluation Methods
OpenCompass adopts a combination of objective and subjective evaluations. For capability dimensions and scenarios with definite answers, a comprehensive assessment of model capabilities is conducted using a well-constructed test set. For open-ended or semi-open-ended questions and model safety issues, a combination of objective and subjective evaluation methods is used.
### Objective Evaluation
For objective questions with standard answers, we can compare the discrepancy between the model's output and the standard answer using quantitative indicators. Given the high freedom in outputs of large language models, during evaluation, it's essential to standardize and design its inputs and outputs to minimize the influence of noisy outputs, ensuring a more comprehensive and objective assessment.
To better elicit the model's abilities in the evaluation domain and guide the model to output answers following specific templates, OpenCompass employs prompt engineering and in-context learning for objective evaluations.
In practice, we usually adopt the following two methods to evaluate model outputs:
- **Discriminative Evaluation**: This approach combines questions with candidate answers, calculates the model's perplexity on all combinations, and selects the answer with the lowest perplexity as the model's final output.
- **Generative Evaluation**: Used for generative tasks like language translation, code generation, logical analysis, etc. The question is used as the model's original input, leaving the answer area blank for the model to fill in. Post-processing of the output is often required to ensure it meets dataset requirements.
### Subjective Evaluation (Upcoming)
Language expression is lively and varied, and many scenarios and capabilities can't be judged solely by objective indicators. For evaluations like model safety and language capabilities, subjective evaluations based on human feelings better reflect the model's actual capabilities and align more with real-world applications.
OpenCompass's subjective evaluation approach relies on test subject's personal judgments to assess chat-capable large language models. In practice, we pre-construct a set of subjective test questions based on model capabilities and present different replies from various models to the same question to subjects, collecting their subjective scores. Given the high cost of subjective testing, this approach also uses high-performing large language models to simulate human subjective scoring. Actual evaluations will combine real human expert subjective evaluations with model-based subjective scores.
In conducting subjective evaluations, OpenCompass uses both **Single Model Reply Satisfaction Statistics** and **Multiple Model Satisfaction** Comparison methods.
# Metric Calculation
In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the **type of standard answer**, generally including the following types:
- **Choice**: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard-- `ACCEvaluator`.
- **Phrase**: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard--`EMEvaluator`.
- **Sentence**: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard--`BleuEvaluator`.
- **Paragraph**: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard--`RougeEvaluator`.
- **Code**: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and `pass@k` are usually used as the evaluation standard. At present, Opencompass supports `MBPPEvaluator` and `HumanEvalEvaluator`.
There is also a type of **scoring-type** evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports `ToxicEvaluator`, and currently, the realtoxicityprompts dataset uses this evaluation method.
## Supported Evaluation Metrics
Currently, in OpenCompass, commonly used Evaluators are mainly located in the [`opencompass/openicl/icl_evaluator`](https://github.com/open-compass/opencompass/tree/main/opencompass/openicl/icl_evaluator) folder. There are also some dataset-specific indicators that are placed in parts of [`opencompass/datasets`](https://github.com/open-compass/opencompass/tree/main/opencompass/datasets). Below is a summary:
| Evaluation Strategy | Evaluation Metrics | Common Postprocessing Method | Datasets |
| --------------------- | -------------------- | ---------------------------- | -------------------------------------------------------------------- |
| `ACCEvaluator` | Accuracy | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag |
| `EMEvaluator` | Match Rate | None, dataset-specific | drop, CLUE_CMRC, CLUE_DRCD |
| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs |
| `RougeEvaluator` | ROUGE | None, dataset-specific | truthfulqa, Xsum, XLSum |
| `JiebaRougeEvaluator` | ROUGE | None, dataset-specific | lcsts |
| `HumanEvalEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess |
| `MBPPEvaluator` | Execution Pass Rate | None | mbpp |
| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts |
| `AGIEvalEvaluator` | Accuracy | None | agieval |
| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments |
| `MATHEvaluator` | Accuracy | `math_postprocess` | math |
| `MccEvaluator` | Matthews Correlation | None | -- |
| `SquadEvaluator` | F1-scores | None | -- |
## How to Configure
The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to `dataset.infer_cfg` as an instantiation parameter.
Below is the definition of `govrepcrs_eval_cfg`, and you can refer to [configs/datasets/govrepcrs](https://github.com/open-compass/opencompass/tree/main/configs/datasets/govrepcrs).
```python
from opencompass.openicl.icl_evaluator import BleuEvaluator
from opencompass.datasets import GovRepcrsDataset
from opencompass.utils.text_postprocessors import general_cn_postprocess
govrepcrs_reader_cfg = dict(.......)
govrepcrs_infer_cfg = dict(.......)
# Configuration of evaluation metrics
govrepcrs_eval_cfg = dict(
evaluator=dict(type=BleuEvaluator), # Use the common translator evaluator BleuEvaluator
pred_role='BOT', # Accept 'BOT' role output
pred_postprocessor=dict(type=general_cn_postprocess), # Postprocessing of prediction results
dataset_postprocessor=dict(type=general_cn_postprocess)) # Postprocessing of dataset standard answers
govrepcrs_datasets = [
dict(
type=GovRepcrsDataset, # Dataset class name
path='./data/govrep/', # Dataset path
abbr='GovRepcrs', # Dataset alias
reader_cfg=govrepcrs_reader_cfg, # Dataset reading configuration file, configure its reading split, column, etc.
infer_cfg=govrepcrs_infer_cfg, # Dataset inference configuration file, mainly related to prompt
eval_cfg=govrepcrs_eval_cfg) # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing.
]
```
# Prepare Models
To support the evaluation of new models in OpenCompass, there are several ways:
1. HuggingFace-based models
2. API-based models
3. Custom models
## HuggingFace-based Models
In OpenCompass, we support constructing evaluation models directly from HuggingFace's
`AutoModel.from_pretrained` and `AutoModelForCausalLM.from_pretrained` interfaces. If the model to be
evaluated follows the typical generation interface of HuggingFace models, there is no need to write code. You
can simply specify the relevant configurations in the configuration file.
Here is an example configuration file for a HuggingFace-based model:
```python
# Use `HuggingFace` to evaluate models supported by AutoModel.
# Use `HuggingFaceCausalLM` to evaluate models supported by AutoModelForCausalLM.
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
# Parameters for `HuggingFaceCausalLM` initialization.
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
batch_padding=False,
# Common parameters shared by various models, not specific to `HuggingFaceCausalLM` initialization.
abbr='llama-7b', # Model abbreviation used for result display.
max_out_len=100, # Maximum number of generated tokens.
batch_size=16, # The size of a batch during inference.
run_cfg=dict(num_gpus=1), # Run configuration to specify resource requirements.
)
]
```
Explanation of some of the parameters:
- `batch_padding=False`: If set to False, each sample in a batch will be inferred individually. If set to True,
a batch of samples will be padded and inferred together. For some models, such padding may lead to
unexpected results. If the model being evaluated supports sample padding, you can set this parameter to True
to speed up inference.
- `padding_side='left'`: Perform padding on the left side. Not all models support padding, and padding on the
right side may interfere with the model's output.
- `truncation_side='left'`: Perform truncation on the left side. The input prompt for evaluation usually
consists of both the in-context examples prompt and the input prompt. If the right side of the input prompt
is truncated, it may cause the input of the generation model to be inconsistent with the expected format.
Therefore, if necessary, truncation should be performed on the left side.
During evaluation, OpenCompass will instantiate the evaluation model based on the `type` and the
initialization parameters specified in the configuration file. Other parameters are used for inference,
summarization, and other processes related to the model. For example, in the above configuration file, we will
instantiate the model as follows during evaluation:
```python
model = HuggingFaceCausalLM(
path='huggyllama/llama-7b',
tokenizer_path='huggyllama/llama-7b',
tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),
max_seq_len=2048,
)
```
## API-based Models
Currently, OpenCompass supports API-based model inference for the following:
- OpenAI (`opencompass.models.OpenAI`)
- ChatGLM (`opencompass.models.ZhiPuAI`)
- ABAB-Chat from MiniMax (`opencompass.models.MiniMax`)
- XunFei from XunFei (`opencompass.models.XunFei`)
Let's take the OpenAI configuration file as an example to see how API-based models are used in the
configuration file.
```python
from opencompass.models import OpenAI
models = [
dict(
type=OpenAI, # Using the OpenAI model
# Parameters for `OpenAI` initialization
path='gpt-4', # Specify the model type
key='YOUR_OPENAI_KEY', # OpenAI API Key
max_seq_len=2048, # The max input number of tokens
# Common parameters shared by various models, not specific to `OpenAI` initialization.
abbr='GPT-4', # Model abbreviation used for result display.
max_out_len=512, # Maximum number of generated tokens.
batch_size=1, # The size of a batch during inference.
run_cfg=dict(num_gpus=0), # Resource requirements (no GPU needed)
),
]
```
We have provided several examples for API-based models. Please refer to
```bash
configs
├── eval_zhipu.py
├── eval_xunfei.py
└── eval_minimax.py
```
## Custom Models
If the above methods do not support your model evaluation requirements, you can refer to
[Supporting New Models](../advanced_guides/new_model.md) to add support for new models in OpenCompass.
# Results Summary
After the evaluation is complete, the results need to be printed on the screen or saved. This process is controlled by the summarizer.
```{note}
If the summarizer appears in the overall config, all the evaluation results will be output according to the following logic.
If the summarizer does not appear in the overall config, the evaluation results will be output in the order they appear in the `dataset` config.
```
## Example
A typical summarizer configuration file is as follows:
```python
summarizer = dict(
dataset_abbrs = [
'race',
'race-high',
'race-middle',
],
summary_groups=[
{'name': 'race', 'subsets': ['race-high', 'race-middle']},
]
)
```
The output is:
```text
dataset version metric mode internlm-7b-hf
----------- --------- ------------- ------ ----------------
race - naive_average ppl 76.23
race-high 0c332f accuracy ppl 74.53
race-middle 0c332f accuracy ppl 77.92
```
The summarizer tries to read the evaluation scores from the `{work_dir}/results/` directory using the `models` and `datasets` in the config as the full set. It then displays them in the order of the `summarizer.dataset_abbrs` list. Moreover, the summarizer tries to compute some aggregated metrics using `summarizer.summary_groups`. The `name` metric is only generated if and only if all values in `subsets` exist. This means if some scores are missing, the aggregated metric will also be missing. If scores can't be fetched by the above methods, the summarizer will use `-` in the respective cell of the table.
In addition, the output consists of multiple columns:
- The `dataset` column corresponds to the `summarizer.dataset_abbrs` configuration.
- The `version` column is the hash value of the dataset, which considers the dataset's evaluation method, prompt words, output length limit, etc. Users can verify whether two evaluation results are comparable using this column.
- The `metric` column indicates the evaluation method of this metric. For specific details, [metrics](./metrics.md).
- The `mode` column indicates how the inference result is obtained. Possible values are `ppl` / `gen`. For items in `summarizer.summary_groups`, if the methods of obtaining `subsets` are consistent, its value will be the same as subsets, otherwise it will be `mixed`.
- The subsequent columns represent different models.
## Field Description
The fields of summarizer are explained as follows:
- `dataset_abbrs`: (list, optional) Display list items. If omitted, all evaluation results will be output.
- `summary_groups`: (list, optional) Configuration for aggregated metrics.
The fields in `summary_groups` are:
- `name`: (str) Name of the aggregated metric.
- `subsets`: (list) Names of the metrics that are aggregated. Note that it can not only be the original `dataset_abbr` but also the name of another aggregated metric.
- `weights`: (list, optional) Weights of the metrics being aggregated. If omitted, the default is to use unweighted averaging.
Please note that we have stored the summary groups of datasets like MMLU, C-Eval, etc., under the `configs/summarizers/groups` path. It's recommended to consider using them first.
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.8"
formats:
- epub
sphinx:
configuration: docs/zh_cn/conf.py
python:
install:
- requirements: requirements/docs.txt
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.header-logo {
background-image: url("../image/logo.svg");
background-size: 275px 80px;
height: 80px;
width: 275px;
}
@media screen and (min-width: 1100px) {
.header-logo {
top: -25px;
}
}
pre {
white-space: pre;
}
@media screen and (min-width: 2000px) {
.pytorch-content-left {
width: 1200px;
margin-left: 30px;
}
article.pytorch-article {
max-width: 1200px;
}
.pytorch-breadcrumbs-wrapper {
width: 1200px;
}
.pytorch-right-menu.scrolling-fixed {
position: fixed;
top: 45px;
left: 1580px;
}
}
article.pytorch-article section code {
padding: .2em .4em;
background-color: #f3f4f7;
border-radius: 5px;
}
/* Disable the change in tables */
article.pytorch-article section table code {
padding: unset;
background-color: unset;
border-radius: unset;
}
table.autosummary td {
width: 50%
}
img.align-center {
display: block;
margin-left: auto;
margin-right: auto;
}
article.pytorch-article p.rubric {
font-weight: bold;
}
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 27.3.1, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->
<svg version="1.1" id="图层_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
viewBox="0 0 210 36" style="enable-background:new 0 0 210 36;" xml:space="preserve">
<style type="text/css">
.st0{fill:#5878B4;}
.st1{fill:#36569B;}
.st2{fill:#1B3882;}
</style>
<g id="_x33_">
<g>
<path class="st0" d="M16.5,22.6l-6.4,3.1l5.3-0.2L16.5,22.6z M12.3,33.6l1.1-2.9l-5.3,0.2L12.3,33.6z M21.6,33.3l6.4-3.1l-5.3,0.2
L21.6,33.3z M25.8,22.4l-1.1,2.9l5.3-0.2L25.8,22.4z M31.5,26.2l-7.1,0.2l-1.7-1.1l1.5-4L22.2,20L19,21.5l-1.5,3.9l-2.7,1.3
l-7.1,0.2l-3.2,1.5l2.1,1.4l7.1-0.2l0,0l1.7,1.1l-1.5,4L16,36l3.2-1.5l1.5-3.9l0,0l2.6-1.2l0,0l7.2-0.2l3.2-1.5L31.5,26.2z
M20.2,28.7c-1,0.5-2.3,0.5-3,0.1c-0.6-0.4-0.4-1.2,0.6-1.6c1-0.5,2.3-0.5,3-0.1C21.5,27.5,21.2,28.2,20.2,28.7z"/>
</g>
</g>
<g id="_x32_">
<g>
<path class="st1" d="M33.5,19.8l-1.3-6.5l-1.5,1.9L33.5,19.8z M27.5,5.1l-4.2-2.7L26,7L27.5,5.1z M20.7,5.7l1.3,6.5l1.5-1.9
L20.7,5.7z M26.8,20.4l4.2,2.7l-2.7-4.6L26.8,20.4z M34,22.3l-3.6-6.2l0,0l-0.5-2.7l2-2.6l-0.6-3.2l-2.1-1.4l-2,2.6l-1.7-1.1
l-3.7-6.3L19.6,0l0.6,3.2l3.7,6.3l0,0l0.5,2.6l0,0l-2,2.6l0.6,3.2l2.1,1.4l1.9-2.5l1.7,1.1l3.7,6.3l2.1,1.4L34,22.3z M27.5,14.6
c-0.6-0.4-1.3-1.6-1.5-2.6c-0.2-1,0.2-1.5,0.8-1.1c0.6,0.4,1.3,1.6,1.5,2.6C28.5,14.6,28.1,15.1,27.5,14.6z"/>
</g>
</g>
<g id="_x31_">
<g>
<path class="st2" d="M12,2.8L5.6,5.9l3.8,1.7L12,2.8z M1.1,14.4l1.3,6.5l2.6-4.8L1.1,14.4z M9.1,24l6.4-3.1l-3.8-1.7L9.1,24z
M20,12.4l-1.3-6.5l-2.6,4.8L20,12.4z M20.4,14.9l-5.1-2.3l0,0l-0.5-2.7l3.5-6.5l-0.6-3.2l-3.2,1.5L11,8.1L8.3,9.4l0,0L3.2,7.1
L0,8.6l0.6,3.2l5.2,2.3l0.5,2.7v0l-3.5,6.6l0.6,3.2l3.2-1.5l3.5-6.5l2.6-1.2l0,0l5.2,2.4l3.2-1.5L20.4,14.9z M10.9,15.2
c-1,0.5-1.9,0-2.1-1c-0.2-1,0.4-2.2,1.4-2.7c1-0.5,1.9,0,2.1,1C12.5,13.5,11.9,14.7,10.9,15.2z"/>
</g>
</g>
<path id="字" class="st2" d="M49.5,26.5c-2.5,0-4.4-0.7-5.7-2c-1.8-1.6-2.6-4-2.6-7.1c0-3.2,0.9-5.5,2.6-7.1c1.3-1.3,3.2-2,5.7-2
c2.5,0,4.4,0.7,5.7,2c1.7,1.6,2.6,4,2.6,7.1c0,3.1-0.9,5.5-2.6,7.1C53.8,25.8,51.9,26.5,49.5,26.5z M52.9,21.8
c0.8-1.1,1.3-2.6,1.3-4.5c0-1.9-0.4-3.4-1.3-4.5c-0.8-1.1-2-1.6-3.4-1.6c-1.4,0-2.6,0.5-3.4,1.6c-0.9,1.1-1.3,2.6-1.3,4.5
c0,1.9,0.4,3.4,1.3,4.5c0.9,1.1,2,1.6,3.4,1.6C50.9,23.4,52,22.9,52.9,21.8z M70.9,14.6c1,1.1,1.5,2.7,1.5,4.9c0,2.2-0.5,4-1.5,5.1
c-1,1.2-2.3,1.8-3.9,1.8c-1,0-1.9-0.3-2.5-0.8c-0.4-0.3-0.7-0.7-1.1-1.2V31h-3.3V13.2h3.2v1.9c0.4-0.6,0.7-1,1.1-1.3
c0.7-0.6,1.6-0.9,2.6-0.9C68.6,12.9,69.9,13.5,70.9,14.6z M69,19.6c0-1-0.2-1.9-0.7-2.6c-0.4-0.8-1.2-1.1-2.2-1.1
c-1.2,0-2,0.6-2.5,1.7c-0.2,0.6-0.4,1.4-0.4,2.3c0,1.5,0.4,2.5,1.2,3.1c0.5,0.4,1,0.5,1.7,0.5c0.9,0,1.6-0.4,2.1-1.1
C68.8,21.8,69,20.8,69,19.6z M85.8,22.2c-0.1,0.8-0.5,1.5-1.2,2.3c-1.1,1.2-2.6,1.9-4.6,1.9c-1.6,0-3.1-0.5-4.3-1.6
c-1.2-1-1.9-2.8-1.9-5.1c0-2.2,0.6-3.9,1.7-5.1c1.1-1.2,2.6-1.8,4.4-1.8c1.1,0,2,0.2,2.9,0.6c0.9,0.4,1.6,1,2.1,1.9
c0.5,0.8,0.8,1.6,1,2.6c0.1,0.6,0.1,1.4,0.1,2.5h-8.7c0,1.3,0.4,2.2,1.2,2.7c0.5,0.3,1,0.5,1.7,0.5c0.7,0,1.2-0.2,1.7-0.6
c0.2-0.2,0.4-0.5,0.6-0.9H85.8z M82.5,18.3c-0.1-0.9-0.3-1.6-0.8-2c-0.5-0.5-1.1-0.7-1.8-0.7c-0.8,0-1.4,0.2-1.8,0.7
c-0.4,0.5-0.7,1.1-0.8,2H82.5z M94.3,15.7c-1.1,0-1.9,0.5-2.3,1.4c-0.2,0.5-0.3,1.2-0.3,1.9V26h-3.3V13.2h3.2v1.9
c0.4-0.7,0.8-1.1,1.2-1.4c0.7-0.5,1.6-0.8,2.6-0.8c1.3,0,2.4,0.3,3.2,1c0.8,0.7,1.3,1.8,1.3,3.4V26h-3.4v-7.8c0-0.7-0.1-1.2-0.3-1.5
C95.8,16,95.2,15.7,94.3,15.7z M115.4,24.7c-1.3,1.2-2.9,1.8-4.9,1.8c-2.5,0-4.4-0.8-5.9-2.4c-1.4-1.6-2.1-3.8-2.1-6.6
c0-3,0.8-5.3,2.4-7c1.4-1.4,3.2-2.1,5.4-2.1c2.9,0,5,1,6.4,2.9c0.7,1.1,1.1,2.1,1.2,3.2h-3.6c-0.2-0.8-0.5-1.5-0.9-1.9
c-0.7-0.8-1.6-1.1-2.9-1.1c-1.3,0-2.3,0.5-3.1,1.6c-0.8,1.1-1.1,2.6-1.1,4.5s0.4,3.4,1.2,4.4c0.8,1,1.8,1.4,3.1,1.4
c1.3,0,2.2-0.4,2.9-1.2c0.4-0.4,0.7-1.1,0.9-2h3.6C117.5,22,116.7,23.5,115.4,24.7z M130.9,14.8c1.1,1.4,1.6,2.9,1.6,4.8
c0,1.9-0.5,3.5-1.6,4.8c-1.1,1.3-2.7,2-4.9,2c-2.2,0-3.8-0.7-4.9-2c-1.1-1.3-1.6-2.9-1.6-4.8c0-1.8,0.5-3.4,1.6-4.8
c1.1-1.4,2.7-2,4.9-2C128.2,12.8,129.9,13.5,130.9,14.8z M126,15.6c-1,0-1.7,0.3-2.3,1c-0.5,0.7-0.8,1.7-0.8,3c0,1.3,0.3,2.3,0.8,3
c0.5,0.7,1.3,1,2.3,1c1,0,1.7-0.3,2.3-1c0.5-0.7,0.8-1.7,0.8-3c0-1.3-0.3-2.3-0.8-3C127.7,16,127,15.6,126,15.6z M142.1,16.7
c-0.3-0.6-0.8-0.9-1.7-0.9c-1,0-1.6,0.3-1.9,0.9c-0.2,0.4-0.3,0.9-0.3,1.6V26h-3.4V13.2h3.2v1.9c0.4-0.7,0.8-1.1,1.2-1.4
c0.6-0.5,1.5-0.8,2.5-0.8c1,0,1.8,0.2,2.4,0.6c0.5,0.4,0.9,0.9,1.1,1.5c0.4-0.8,1-1.3,1.6-1.7c0.7-0.4,1.5-0.5,2.3-0.5
c0.6,0,1.1,0.1,1.7,0.3c0.5,0.2,1,0.6,1.5,1.1c0.4,0.4,0.6,1,0.7,1.6c0.1,0.4,0.1,1.1,0.1,1.9l0,8.1h-3.4v-8.1
c0-0.5-0.1-0.9-0.2-1.2c-0.3-0.6-0.8-0.9-1.6-0.9c-0.9,0-1.6,0.4-1.9,1.1c-0.2,0.4-0.3,0.9-0.3,1.5V26h-3.4v-7.6
C142.4,17.6,142.3,17.1,142.1,16.7z M167,14.6c1,1.1,1.5,2.7,1.5,4.9c0,2.2-0.5,4-1.5,5.1c-1,1.2-2.3,1.8-3.9,1.8
c-1,0-1.9-0.3-2.5-0.8c-0.4-0.3-0.7-0.7-1.1-1.2V31h-3.3V13.2h3.2v1.9c0.4-0.6,0.7-1,1.1-1.3c0.7-0.6,1.6-0.9,2.6-0.9
C164.7,12.9,166,13.5,167,14.6z M165.1,19.6c0-1-0.2-1.9-0.7-2.6c-0.4-0.8-1.2-1.1-2.2-1.1c-1.2,0-2,0.6-2.5,1.7
c-0.2,0.6-0.4,1.4-0.4,2.3c0,1.5,0.4,2.5,1.2,3.1c0.5,0.4,1,0.5,1.7,0.5c0.9,0,1.6-0.4,2.1-1.1C164.9,21.8,165.1,20.8,165.1,19.6z
M171.5,14.6c0.9-1.1,2.4-1.7,4.5-1.7c1.4,0,2.6,0.3,3.7,0.8c1.1,0.6,1.6,1.6,1.6,3.1v5.9c0,0.4,0,0.9,0,1.5c0,0.4,0.1,0.7,0.2,0.9
c0.1,0.2,0.3,0.3,0.5,0.4V26h-3.6c-0.1-0.3-0.2-0.5-0.2-0.7c0-0.2-0.1-0.5-0.1-0.8c-0.5,0.5-1,0.9-1.6,1.3c-0.7,0.4-1.5,0.6-2.4,0.6
c-1.2,0-2.1-0.3-2.9-1c-0.8-0.7-1.1-1.6-1.1-2.8c0-1.6,0.6-2.7,1.8-3.4c0.7-0.4,1.6-0.7,2.9-0.8l1.1-0.1c0.6-0.1,1.1-0.2,1.3-0.3
c0.5-0.2,0.7-0.5,0.7-0.9c0-0.5-0.2-0.9-0.6-1.1c-0.4-0.2-0.9-0.3-1.6-0.3c-0.8,0-1.3,0.2-1.7,0.6c-0.2,0.3-0.4,0.7-0.5,1.2h-3.2
C170.6,16.2,170.9,15.3,171.5,14.6z M173.9,23.6c0.3,0.3,0.7,0.4,1.1,0.4c0.7,0,1.4-0.2,2-0.6c0.6-0.4,0.9-1.2,0.9-2.3v-1.2
c-0.2,0.1-0.4,0.2-0.6,0.3c-0.2,0.1-0.5,0.2-0.9,0.2l-0.8,0.1c-0.7,0.1-1.2,0.3-1.5,0.5c-0.5,0.3-0.8,0.8-0.8,1.4
C173.5,22.9,173.6,23.3,173.9,23.6z M193.1,13.8c1,0.6,1.6,1.7,1.7,3.3h-3.3c0-0.4-0.2-0.8-0.4-1c-0.4-0.5-1-0.7-1.9-0.7
c-0.7,0-1.2,0.1-1.6,0.3c-0.3,0.2-0.5,0.5-0.5,0.8c0,0.4,0.2,0.7,0.5,0.8c0.3,0.2,1.5,0.5,3.5,0.9c1.3,0.3,2.3,0.8,3,1.4
c0.7,0.6,1,1.4,1,2.4c0,1.3-0.5,2.3-1.4,3.1c-0.9,0.8-2.4,1.2-4.4,1.2c-2,0-3.5-0.4-4.5-1.3c-1-0.9-1.4-1.9-1.4-3.2h3.4
c0.1,0.6,0.2,1,0.5,1.3c0.4,0.4,1.2,0.7,2.3,0.7c0.7,0,1.2-0.1,1.6-0.3c0.4-0.2,0.6-0.5,0.6-0.9c0-0.4-0.2-0.7-0.5-0.9
c-0.3-0.2-1.5-0.5-3.5-1c-1.4-0.4-2.5-0.8-3.1-1.3c-0.6-0.5-0.9-1.3-0.9-2.3c0-1.2,0.5-2.2,1.4-3c0.9-0.9,2.2-1.3,3.9-1.3
C190.8,12.9,192.1,13.2,193.1,13.8z M206.5,13.8c1,0.6,1.6,1.7,1.7,3.3h-3.3c0-0.4-0.2-0.8-0.4-1c-0.4-0.5-1-0.7-1.9-0.7
c-0.7,0-1.2,0.1-1.6,0.3c-0.3,0.2-0.5,0.5-0.5,0.8c0,0.4,0.2,0.7,0.5,0.8c0.3,0.2,1.5,0.5,3.5,0.9c1.3,0.3,2.3,0.8,3,1.4
c0.7,0.6,1,1.4,1,2.4c0,1.3-0.5,2.3-1.4,3.1c-0.9,0.8-2.4,1.2-4.4,1.2c-2,0-3.5-0.4-4.5-1.3c-1-0.9-1.4-1.9-1.4-3.2h3.4
c0.1,0.6,0.2,1,0.5,1.3c0.4,0.4,1.2,0.7,2.3,0.7c0.7,0,1.2-0.1,1.6-0.3c0.4-0.2,0.6-0.5,0.6-0.9c0-0.4-0.2-0.7-0.5-0.9
c-0.3-0.2-1.5-0.5-3.5-1c-1.4-0.4-2.5-0.8-3.1-1.3c-0.6-0.5-0.9-1.3-0.9-2.3c0-1.2,0.5-2.2,1.4-3c0.9-0.9,2.2-1.3,3.9-1.3
C204.2,12.9,205.5,13.2,206.5,13.8z"/>
</svg>
<?xml version="1.0" encoding="UTF-8"?>
<svg id="_图层_2" data-name="图层 2" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 34.59 36">
<defs>
<style>
.cls-1 {
fill: #36569b;
}
.cls-2 {
fill: #1b3882;
}
.cls-3 {
fill: #5878b4;
}
</style>
</defs>
<g id="_图层_1-2" data-name="图层 1">
<g>
<g id="_3" data-name="3">
<path class="cls-3" d="m16.53,22.65l-6.37,3.07,5.27-.16,1.1-2.91Zm-4.19,10.95l1.12-2.91-5.27.17,4.15,2.74Zm9.3-.29l6.37-3.07-5.27.16-1.1,2.91Zm4.19-10.95l-1.12,2.91,5.27-.17-4.15-2.74Zm5.72,3.81l-7.08.23-1.73-1.14,1.5-3.95-2.06-1.36-3.16,1.53-1.48,3.89-2.67,1.29-7.14.23-3.16,1.53,2.07,1.36,7.13-.23h0s1.69,1.11,1.69,1.11l-1.51,3.98,2.06,1.36,3.16-1.53,1.5-3.95h0s2.56-1.24,2.56-1.24h0s7.23-.24,7.23-.24l3.16-1.53-2.06-1.36Zm-11.29,2.56c-.99.48-2.31.52-2.96.1-.65-.42-.37-1.15.62-1.63.99-.48,2.31-.52,2.96-.1.65.42.37,1.15-.62,1.63Z"/>
</g>
<g id="_2" data-name="2">
<path class="cls-1" d="m33.5,19.84l-1.26-6.51-1.46,1.88,2.72,4.63Zm-6.05-14.69l-4.16-2.74,2.71,4.64,1.45-1.89Zm-6.73.58l1.26,6.51,1.46-1.88-2.72-4.63Zm6.05,14.69l4.16,2.74-2.71-4.64-1.45,1.89Zm7.19,1.91l-3.63-6.2h0s-.53-2.74-.53-2.74l1.96-2.56-.63-3.23-2.07-1.36-1.96,2.56-1.69-1.11-3.71-6.33-2.07-1.36.63,3.23,3.68,6.28h0s.51,2.62.51,2.62h0s-1.99,2.6-1.99,2.6l.63,3.23,2.06,1.36,1.95-2.54,1.73,1.14,3.69,6.29,2.07,1.36-.63-3.23Zm-6.47-7.7c-.65-.42-1.33-1.59-1.52-2.6-.2-1.01.17-1.49.81-1.06.65.42,1.33,1.59,1.52,2.6.2,1.01-.17,1.49-.81,1.06Z"/>
</g>
<g id="_1" data-name="1">
<path class="cls-2" d="m11.96,2.82l-6.37,3.07,3.81,1.74,2.55-4.81ZM1.07,14.37l1.26,6.53,2.56-4.8-3.82-1.73Zm7.99,9.59l6.37-3.07-3.81-1.74-2.55,4.81Zm10.89-11.55l-1.26-6.53-2.56,4.8,3.82,1.73Zm.45,2.53l-5.13-2.32h0s-.53-2.71-.53-2.71l3.47-6.53-.63-3.24-3.16,1.53-3.42,6.43-2.67,1.29h0s-5.17-2.34-5.17-2.34l-3.16,1.53.63,3.24,5.17,2.33.51,2.65h0s-3.49,6.57-3.49,6.57l.63,3.24,3.16-1.53,3.46-6.52,2.56-1.24h0s5.24,2.37,5.24,2.37l3.16-1.53-.63-3.24Zm-9.52.24c-.99.48-1.95.04-2.14-.97-.2-1.01.44-2.22,1.43-2.69.99-.48,1.95-.04,2.14.97.2,1.01-.44,2.22-1.43,2.7Z"/>
</g>
</g>
</g>
</svg>
var collapsedSections = ['数据集统计'];
$(document).ready(function () {
$('.dataset').DataTable({
"stateSave": false,
"lengthChange": false,
"pageLength": 20,
"order": [],
"language": {
"info": "显示 _START_ 至 _END_ 条目(总计 _TOTAL_ )",
"infoFiltered": "(筛选自 _MAX_ 条目)",
"search": "搜索:",
"zeroRecords": "没有找到任何条目",
"paginate": {
"next": "下一页",
"previous": "上一页"
},
}
});
});
{% extends "layout.html" %}
{% block body %}
<h1>Page Not Found</h1>
<p>
The page you are looking for cannot be found.
</p>
<p>
If you just switched documentation versions, it is likely that the page you were on is moved. You can look for it in
the content table left, or go to <a href="{{ pathto(root_doc) }}">the homepage</a>.
</p>
<!-- <p>
If you cannot find documentation you want, please <a
href="">open an issue</a> to tell us!
</p> -->
{% endblock %}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment