Commit c289ecc0 authored by xinghao's avatar xinghao
Browse files

Initial commit

parents
Pipeline #3004 canceled with stages
# General Math Evaluation Guidance
## Introduction
Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.
## Dataset Format
The math evaluation dataset should be in either JSON Lines (.jsonl) or CSV format. Each problem should contain at least:
- A problem statement
- A solution/answer (typically in LaTeX format with the final answer in \\boxed{})
Example JSONL format:
```json
{"problem": "Find the value of x if 2x + 3 = 7", "solution": "Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"}
```
Example CSV format:
```csv
problem,solution
"Find the value of x if 2x + 3 = 7","Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"
```
## Configuration
To evaluate mathematical reasoning, you'll need to set up three main components:
1. Dataset Reader Configuration
```python
math_reader_cfg = dict(
input_columns=['problem'], # Column name for the question
output_column='solution' # Column name for the answer
)
```
2. Inference Configuration
```python
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
```
3. Evaluation Configuration
```python
math_eval_cfg = dict(
evaluator=dict(type=MATHVerifyEvaluator),
)
```
## Using CustomDataset
Here's how to set up a complete configuration for math evaluation:
```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
math_datasets = [
dict(
type=CustomDataset,
abbr='my-math-dataset', # Dataset abbreviation
path='path/to/your/dataset', # Path to your dataset file
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg,
)
]
```
## MATHVerifyEvaluator
The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
The MATHVerifyEvaluator implements:
1. Extracts answers from both predictions and references using LaTeX extraction
2. Handles various LaTeX formats and environments
3. Verifies mathematical equivalence between predicted and reference answers
4. Provides detailed evaluation results including:
- Accuracy score
- Detailed comparison between predictions and references
- Parse results of both predicted and reference answers
The evaluator supports:
- Basic arithmetic operations
- Fractions and decimals
- Algebraic expressions
- Trigonometric functions
- Roots and exponents
- Mathematical symbols and operators
Example evaluation output:
```python
{
'accuracy': 85.0, # Percentage of correct answers
'details': [
{
'predictions': 'x = 2', # Parsed prediction
'references': 'x = 2', # Parsed reference
'correct': True # Whether they match
},
# ... more results
]
}
```
## Complete Example
Here's a complete example of how to set up math evaluation:
```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
# Dataset reader configuration
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
# Inference configuration
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
math_eval_cfg = dict(
evaluator=dict(type=MATHVerifyEvaluator),
)
# Dataset configuration
math_datasets = [
dict(
type=CustomDataset,
abbr='my-math-dataset',
path='path/to/your/dataset.jsonl', # or .csv
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg,
)
]
# Model configuration
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='your-model-name',
path='your/model/path',
# ... other model configurations
)
]
# Output directory
work_dir = './outputs/math_eval'
```
# Needle In A Haystack Evaluation
## Introduction to the Needle In A Haystack Test
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method where key information is randomly inserted into long texts to form the prompt for large language models (LLMs). This test aims to assess whether LLMs can extract critical information from long texts, thereby evaluating their fundamental ability to comprehend and process long-context documents.
## Task Overview
Within the `OpenCompass` framework, under `NeedleBench`, we designed a series of progressively challenging evaluation tasks to comprehensively assess LLMs' long-text information extraction and reasoning capabilities. For a complete description, please refer to our [technical report](https://arxiv.org/abs/2407.11963).
- **Single-Needle Retrieval Task (S-RT)**: Evaluates the LLM's ability to retrieve a single piece of key information from a long text, testing precise recall of specific details within extensive narratives. This corresponds to the **original Needle In A Haystack test** setup.
- **Multi-Needle Retrieval Task (M-RT)**: Explores the LLM's ability to retrieve multiple relevant pieces of information from long texts, simulating complex queries over comprehensive documents.
- **Multi-Needle Reasoning Task (M-RS)**: Assesses LLMs' abilities to integrate multiple key pieces of information extracted from long texts for reasoning, requiring a comprehensive understanding of content.
- **Ancestral Trace Challenge (ATC)**: Tests LLMs' capabilities in handling multi-layer logical challenges within realistic long-text contexts through "kinship trace needles." In the ATC task, no irrelevant (haystack) texts are added; every piece of text is critical, and models must reason through all details for accurate answers.
> **Note:** NeedleBench (v2) includes several optimizations and adjustments in dataset construction and task details. For a detailed comparison between the old and new versions, as well as a summary of updates, please refer to [opencompass/configs/datasets/needlebench_v2/readme.md](https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/needlebench_v2/readme.md).
## Evaluation Steps
> Note: In the latest `OpenCompass` codebase, the NeedleBench dataset is automatically loaded from the [Huggingface interface](https://huggingface.co/datasets/opencompass/NeedleBench), with no need for manual download or configuration.
### `OpenCompass` Environment Setup
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
```
### Dataset Configuration
We have pre-configured various long-context settings (4k, 8k, 32k, 128k, 200k, 1000k) in `opencompass/configs/datasets/needlebench_v2`, and you can flexibly define your parameters by adjusting the configuration files.
### Evaluation Example
#### Evaluating with `VLLM` Deployed `Qwen2-5-7B` Model
To evaluate the `Qwen2-5-7B` model deployed with `VLLM` on all tasks under NeedleBench-128K, use the following command. This leverages pre-defined model and dataset configuration files without needing additional configuration:
##### Local Evaluation
If evaluating locally, the command will use all available GPUs. You can control GPU visibility using `CUDA_VISIBLE_DEVICES`:
```bash
# Local evaluation
python run.py --datasets needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer
```
##### Evaluation on Slurm Cluster
For Slurm environments, you can add options like `--slurm -p partition_name -q reserved --max-num-workers 16`:
```bash
# Slurm evaluation
python run.py --datasets needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
##### Evaluating Specific Subsets
If you only want to test the original Needle In A Haystack task (e.g., single-needle 128k), adjust the dataset parameter:
```bash
python run.py --datasets needlebench_v2_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
To evaluate only Chinese versions, specify the subset dataset after `/`:
```bash
python run.py --datasets needlebench_v2_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
Ensure `VLLM` is installed beforehand:
```bash
# Install vLLM with CUDA 12.4.
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
pip install vllm
```
#### Evaluating Other `Huggingface` Models
For other models, it is recommended to write your own config file (such as `examples/eval_needlebench_v2.py`) to adjust `max_seq_len` and `max_out_len`, so that the model can process the full context.
You can then run evaluation with:
```bash
python run.py examples/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
```
No need to manually specify `--datasets`, `--models`, or `--summarizer` again.
### Visualization
NeedleBench's latest version has built-in visualization integrated into the summarizer. You can find corresponding visualizations in the `plots` directory under the output folder without needing additional scripts.
### Citation
If you use NeedleBench, please cite us:
```bibtex
@misc{li2025needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
year={2025},
eprint={2407.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.11963},
}
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished={\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@misc{LLMTest_NeedleInAHaystack,
title={LLMTest Needle In A Haystack - Pressure Testing LLMs},
author={gkamradt},
year={2023},
howpublished={\url{https://github.com/gkamradt/LLMTest_NeedleInAHaystack}}
}
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei L\"u and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# Add a dataset
Although OpenCompass has already included most commonly used datasets, users need to follow the steps below to support a new dataset if wanted:
1. Add a dataset script `mydataset.py` to the `opencompass/datasets` folder. This script should include:
- The dataset and its loading method. Define a `MyDataset` class that implements the data loading method `load` as a static method. This method should return data of type `datasets.Dataset`. We use the Hugging Face dataset as the unified interface for datasets to avoid introducing additional logic. Here's an example:
```python
import datasets
from .base import BaseDataset
class MyDataset(BaseDataset):
@staticmethod
def load(**kwargs) -> datasets.Dataset:
pass
```
- (Optional) If the existing evaluators in OpenCompass do not meet your needs, you need to define a `MyDatasetEvaluator` class that implements the scoring method `score`. This method should take `predictions` and `references` as input and return the desired dictionary. Since a dataset may have multiple metrics, the method should return a dictionary containing the metrics and their corresponding scores. Here's an example:
```python
from opencompass.openicl.icl_evaluator import BaseEvaluator
class MyDatasetEvaluator(BaseEvaluator):
def score(self, predictions: List, references: List) -> dict:
pass
```
- (Optional) If the existing postprocessors in OpenCompass do not meet your needs, you need to define the `mydataset_postprocess` method. This method takes an input string and returns the corresponding postprocessed result string. Here's an example:
```python
def mydataset_postprocess(text: str) -> str:
pass
```
2. After defining the dataset loading, data postprocessing, and evaluator methods, you need to add the following configurations to the configuration file:
```python
from opencompass.datasets import MyDataset, MyDatasetEvaluator, mydataset_postprocess
mydataset_eval_cfg = dict(
evaluator=dict(type=MyDatasetEvaluator),
pred_postprocessor=dict(type=mydataset_postprocess))
mydataset_datasets = [
dict(
type=MyDataset,
...,
reader_cfg=...,
infer_cfg=...,
eval_cfg=mydataset_eval_cfg)
]
```
- To facilitate the access of your datasets to other users, you need to specify the channels for downloading the datasets in the configuration file. Specifically, you need to first fill in a dataset name given by yourself in the `path` field in the `mydataset_datasets` configuration, and this name will be mapped to the actual download path in the `opencompass/utils/datasets_info.py` file. Here's an example:
```python
mmlu_datasets = [an
dict(
...,
path='opencompass/mmlu',
...,
)
]
```
- Next, you need to create a dictionary key in `opencompass/utils/datasets_info.py` with the same name as the one you provided above. If you have already hosted the dataset on HuggingFace or Modelscope, please add a dictionary key to the `DATASETS_MAPPING` dictionary and fill in the HuggingFace or Modelscope dataset address in the `hf_id` or `ms_id` key, respectively. You can also specify a default local address. Here's an example:
```python
"opencompass/mmlu": {
"ms_id": "opencompass/mmlu",
"hf_id": "opencompass/mmlu",
"local": "./data/mmlu/",
}
```
- If you wish for the provided dataset to be directly accessible from the OpenCompass OSS repository when used by others, you need to submit the dataset files in the Pull Request phase. We will then transfer the dataset to the OSS on your behalf and create a new dictionary key in the `DATASET_URL`.
- To ensure the optionality of data sources, you need to improve the method `load` in the dataset script `mydataset.py`. Specifically, you need to implement a functionality to switch among different download sources based on the setting of the environment variable `DATASET_SOURCE`. It should be noted that if the environment variable `DATASET_SOURCE` is not set, the dataset will default to being downloaded from the OSS repository. Here's an example from `opencompass/dataset/cmmlu.py`:
```python
def load(path: str, name: str, **kwargs):
...
if environ.get('DATASET_SOURCE') == 'ModelScope':
...
else:
...
return dataset
```
3. After completing the dataset script and config file, you need to register the information of your new dataset in the file `dataset-index.yml` at the main directory, so that it can be added to the dataset statistics list on the OpenCompass website.
- The keys that need to be filled in include `name`: the name of your dataset, `category`: the category of your dataset, `paper`: the URL of the paper or project, and `configpath`: the path to the dataset config file. Here's an example:
```
- mydataset:
name: MyDataset
category: Understanding
paper: https://arxiv.org/pdf/xxxxxxx
configpath: opencompass/configs/datasets/MyDataset
```
Detailed dataset configuration files and other required configuration files can be referred to in the [Configuration Files](../user_guides/config.md) tutorial. For guides on launching tasks, please refer to the [Quick Start](../get_started/quick_start.md) tutorial.
# Add a Model
Currently, we support HF models, some model APIs, and some third-party models.
## Adding API Models
To add a new API-based model, you need to create a new file named `mymodel_api.py` under `opencompass/models` directory. In this file, you should inherit from `BaseAPIModel` and implement the `generate` method for inference and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file.
```python
from ..base_api import BaseAPIModel
class MyModelAPI(BaseAPIModel):
is_api: bool = True
def __init__(self,
path: str,
max_seq_len: int = 2048,
query_per_second: int = 1,
retry: int = 2,
**kwargs):
super().__init__(path=path,
max_seq_len=max_seq_len,
meta_template=meta_template,
query_per_second=query_per_second,
retry=retry)
...
def generate(
self,
inputs,
max_out_len: int = 512,
temperature: float = 0.7,
) -> List[str]:
"""Generate results given a list of inputs."""
pass
def get_token_len(self, prompt: str) -> int:
"""Get lengths of the tokenized string."""
pass
```
## Adding Third-Party Models
To add a new third-party model, you need to create a new file named `mymodel.py` under `opencompass/models` directory. In this file, you should inherit from `BaseModel` and implement the `generate` method for generative inference, the `get_ppl` method for discriminative inference, and the `get_token_len` method to calculate the length of tokens. Once you have defined the model, you can modify the corresponding configuration file.
```python
from ..base import BaseModel
class MyModel(BaseModel):
def __init__(self,
pkg_root: str,
ckpt_path: str,
tokenizer_only: bool = False,
meta_template: Optional[Dict] = None,
**kwargs):
...
def get_token_len(self, prompt: str) -> int:
"""Get lengths of the tokenized strings."""
pass
def generate(self, inputs: List[str], max_out_len: int) -> List[str]:
"""Generate results given a list of inputs. """
pass
def get_ppl(self,
inputs: List[str],
mask_length: Optional[List[int]] = None) -> List[float]:
"""Get perplexity scores given a list of inputs."""
pass
```
# Using Large Models as JudgeLLM for Objective Evaluation
## Introduction
Traditional objective evaluations often rely on standard answers for reference. However, in practical applications, the predicted results of models may vary due to differences in the model's instruction-following capabilities or imperfections in post-processing functions. This can lead to incorrect extraction of answers and comparison with standard answers, resulting in potentially inaccurate evaluation outcomes. To address this issue, we have adopted a process similar to subjective evaluations by introducing JudgeLLM post-prediction to assess the consistency between model responses and standard answers. ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
Currently, all models supported by the opencompass repository can be directly used as JudgeLLM. Additionally, we are planning to support dedicated JudgeLLMs.
## Currently Supported Objective Evaluation Datasets
1. MATH ([https://github.com/hendrycks/math](https://github.com/hendrycks/math))
## Custom JudgeLLM Objective Dataset Evaluation
OpenCompass currently supports most datasets that use `GenInferencer` for inference. The specific process for custom JudgeLLM objective evaluation includes:
1. Building evaluation configurations using API models or open-source models for inference of question answers.
2. Employing a selected evaluation model (JudgeLLM) to assess the outputs of the model.
### Step One: Building Evaluation Configurations, Using MATH as an Example
Below is the Config for evaluating the MATH dataset with JudgeLLM, with the evaluation model being *Llama3-8b-instruct* and the JudgeLLM being *Llama3-70b-instruct*. For more detailed config settings, please refer to `examples/eval_math_llm_judge.py`. The following is a brief version of the annotations to help users understand the meaning of the configuration file.
```python
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
from mmengine.config import read_base
with read_base():
from .models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
from .models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model # noqa: F401, F403
from .datasets.math.math_llm_judge import math_datasets # noqa: F401, F403
from opencompass.datasets import math_judement_preprocess
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import AllObjSummarizer
from opencompass.openicl.icl_evaluator import LMEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
# ------------- Prompt Settings ----------------------------------------
# Evaluation template, please modify the template as needed, JudgeLLM typically uses [Yes] or [No] as the response. For the MATH dataset, the evaluation template is as follows:
eng_obj_prompt = """
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
[Yes]
Expression 1: 3/2
Expression 2: 1.5
[Yes]
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
[No]
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
[Yes]
Expression 1: 3245/5
Expression 2: 649
[No]
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
[Yes]
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2:
[No]
(only mark as equivalent if both expressions are nonempty)
---
YOUR TASK
Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
Expression 1: {obj_gold}
Expression 2: {prediction}
"""
# ------------- Inference Phase ----------------------------------------
# Models to be evaluated
models = [*hf_llama3_8b_instruct_model]
# Evaluation models
judge_models = hf_llama3_70b_instruct_model
eng_datasets = [*math_datasets]
chn_datasets = []
datasets = eng_datasets + chn_datasets
for d in eng_datasets:
d['eval_cfg']= dict(
evaluator=dict(
type=LMEvaluator,
# If you need to preprocess model predictions before judging,
# you can specify a pred_postprocessor function here
pred_postprocessor=dict(type=math_judement_preprocess),
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt = eng_obj_prompt
),
]),
),
),
pred_role="BOT",
)
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLInferTask)),
)
# ------------- Evaluation Configuration --------------------------------
eval = dict(
partitioner=dict(
type=SubjectiveSizePartitioner, max_task_size=80000, mode='singlescore', models=models, judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(
type=AllObjSummarizer
)
# Output folder
work_dir = 'outputs/obj_all/'
```
### Step Two: Launch Evaluation and Output Results
```shell
python run.py eval_math_llm_judge.py
```
This will initiate two rounds of evaluation. The first round involves model inference to obtain predicted answers to questions, and the second round involves JudgeLLM evaluating the consistency between the predicted answers and the standard answers, and scoring them.
- The results of model predictions will be saved in `output/.../timestamp/predictions/xxmodel/xxx.json`
- The JudgeLLM's evaluation responses will be saved in `output/.../timestamp/results/xxmodel/xxx.json`
- The evaluation report will be output to `output/.../timestamp/summary/timestamp/xxx.csv`
## Results
Using the Llama3-8b-instruct as the evaluation model and the Llama3-70b-instruct as the evaluator, the MATH dataset was assessed with the following results:
| Model | JudgeLLM Evaluation | Naive Evaluation |
| ------------------- | ------------------- | ---------------- |
| llama-3-8b-instruct | 27.7 | 27.8 |
# Evaluation Results Persistence
## Introduction
Normally, the evaluation results of OpenCompass will be saved to your work directory. But in some cases, there may be a need for data sharing among users or quickly browsing existing public evaluation results. Therefore, we provide an interface that can quickly transfer evaluation results to external public data stations, and on this basis, provide functions such as uploading, overwriting, and reading.
## Quick Start
### Uploading
By adding `args` to the evaluation command or adding configuration in the Eval script, the results of evaluation can be stored in the path you specify. Here are the examples:
(Approach 1) Add an `args` option to the command and specify your public path address.
```bash
opencompass ... -sp '/your_path'
```
(Approach 2) Add configuration in the Eval script.
```pythonE
station_path = '/your_path'
```
### Overwriting
The above storage method will first determine whether the same task result already exists in the data station based on the `abbr` attribute in the model and dataset configuration before uploading data. If results already exists, cancel this storage. If you need to update these results, please add the `station-overwrite` option to the command, here is an example:
```bash
opencompass ... -sp '/your_path' --station-overwrite
```
### Reading
You can directly read existing results from the data station to avoid duplicate evaluation tasks. The read results will directly participate in the 'summarize' step. When using this configuration, only tasks that do not store results in the data station will be initiated. Here is an example:
```bash
opencompass ... -sp '/your_path' --read-from-station
```
### Command Combination
1. Only upload the results under your latest working directory to the data station, without supplementing tasks that missing results:
```bash
opencompass ... -sp '/your_path' -r latest -m viz
```
## Storage Format of the Data Station
In the data station, the evaluation results are stored as `json` files for each `model-dataset` pair. The specific directory form is `/your_path/dataset_name/model_name.json `. Each `json` file stores a dictionary corresponding to the results, including `predictions`, `results`, and `cfg`, here is an example:
```pythonE
Result = {
'predictions': List[Dict],
'results': Dict,
'cfg': Dict = {
'models': Dict,
'datasets': Dict,
(Only subjective datasets)'judge_models': Dict
}
}
```
Among this three keys, `predictions` records the predictions of the model on each item of data in the dataset. `results` records the total score of the model on the dataset. `cfg` records detailed configurations of the model and the dataset in this evaluation task.
# Prompt Attack
We support prompt attack following the idea of [PromptBench](https://github.com/microsoft/promptbench). The main purpose here is to evaluate the robustness of prompt instruction, which means when attack/modify the prompt to instruct the task, how well can this task perform as the original task.
## Set up environment
Some components are necessary to prompt attack experiment, therefore we need to set up environments.
```shell
git clone https://github.com/microsoft/promptbench.git
pip install textattack==0.3.8
export PYTHONPATH=$PYTHONPATH:promptbench/
```
## How to attack
### Add a dataset config
We will use GLUE-wnli dataset as example, most configuration settings can refer to [config.md](../user_guides/config.md) for help.
First we need support the basic dataset config, you can find the existing config files in `configs` or support your own config according to [new-dataset](./new_dataset.md)
Take the following `infer_cfg` as example, we need to define the prompt template. `adv_prompt` is the basic prompt placeholder to be attacked in the experiment. `sentence1` and `sentence2` are the input columns of this dataset. The attack will only modify the `adv_prompt` here.
Then, we should use `AttackInferencer` with `original_prompt_list` and `adv_key` to tell the inferencer where to attack and what text to be attacked.
More details can refer to `configs/datasets/promptbench/promptbench_wnli_gen_50662f.py` config file.
```python
original_prompt_list = [
'Are the following two sentences entailment or not_entailment? Answer me with "A. entailment" or "B. not_entailment", just one word. ',
"Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'.",
...,
]
wnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role="HUMAN",
prompt="""{adv_prompt}
Sentence 1: {sentence1}
Sentence 2: {sentence2}
Answer:"""),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(
type=AttackInferencer,
original_prompt_list=original_prompt_list,
adv_key='adv_prompt'))
```
### Add a eval config
We should use `OpenICLAttackTask` here for attack task. Also `NaivePartitioner` should be used because the attack experiment will run the whole dataset repeatedly for nearly hurdurds times to search the best attack, we do not want to split the dataset for convenience.
```note
Please choose a small dataset(example < 1000) for attack, due to the aforementioned repeated search, otherwise the time cost is enumerous.
```
There are several other options in `attack` config:
- `attack`: attack type, available options includes `textfooler`, `textbugger`, `deepwordbug`, `bertattack`, `checklist`, `stresstest`;
- `query_budget`: upper boundary of queries, which means the total numbers of running the dataset;
- `prompt_topk`: number of topk prompt to be attacked. In most case, the original prompt list is great than 10, running the whole set is time consuming.
```python
# Please run whole dataset at a time, aka use `NaivePartitioner` only
# Please use `OpenICLAttackTask` if want to perform attack experiment
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=SlurmRunner,
max_num_workers=8,
task=dict(type=OpenICLAttackTask),
retry=0),
)
attack = dict(
attack='textfooler',
query_budget=100,
prompt_topk=2,
)
```
### Run the experiment
Please use `--mode infer` when run the attack experiment, and set `PYTHONPATH` env.
```shell
python run.py examples/eval_attack.py --mode infer
```
All the results will be saved in `attack` folder.
The content includes the original prompt accuracy and the attacked prompt with dropped accuracy of `topk` prompt, for instance:
```
Prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'., acc: 59.15%
Prompt: Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'., acc: 57.75%
Prompt: Analyze the two provided sentences and decide if their relationship is 'A. entailment' or 'B. not_entailment'., acc: 56.34%
Prompt: Identify whether the given pair of sentences demonstrates entailment or not_entailment. Answer with 'A. entailment' or 'B. not_entailment'., acc: 54.93%
...
Original prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'.
Attacked prompt: b"Assess the attach between the following sentences and sorted it as 'A. entailment' or 'B. not_entailment'."
Original acc: 59.15%, attacked acc: 40.85%, dropped acc: 18.31%
```
# Subjective Evaluation Guidance
## Introduction
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
A popular evaluation method involves
- Compare Mode: comparing model responses pairwise to calculate their win rate
- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
## Currently Supported Subjective Evaluation Datasets
1. AlignBench Chinese Scoring Dataset (https://github.com/THUDM/AlignBench)
2. MTBench English Scoring Dataset, two-turn dialogue (https://github.com/lm-sys/FastChat)
3. MTBench101 English Scoring Dataset, multi-turn dialogue (https://github.com/mtbench101/mt-bench-101)
4. AlpacaEvalv2 English Compare Dataset (https://github.com/tatsu-lab/alpaca_eval)
5. ArenaHard English Compare Dataset, mainly focused on coding (https://github.com/lm-sys/arena-hard/tree/main)
6. Fofo English Scoring Dataset (https://github.com/SalesforceAIResearch/FoFo/)
7. Wildbench English Score and Compare Dataset(https://github.com/allenai/WildBench)
## Initiating Subjective Evaluation
Similar to existing objective evaluation methods, you can configure related settings in `examples/eval_subjective.py`.
### Basic Parameters: Specifying models, datasets, and judgemodels
Similar to objective evaluation, import the models and datasets that need to be evaluated, for example:
```
with read_base():
from .datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from .datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import subjective_datasets as alpacav2
from .models.qwen.hf_qwen_7b import models
```
It is worth noting that since the model setup parameters for subjective evaluation are often different from those for objective evaluation, it often requires setting up `do_sample` for inference instead of `greedy`. You can modify the relevant parameters in the configuration file as needed, for example:
```
models = [
dict(
type=HuggingFaceChatGLM3,
abbr='chatglm3-6b-hf2',
path='THUDM/chatglm3-6b',
tokenizer_path='THUDM/chatglm3-6b',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
generation_kwargs=dict(
do_sample=True,
),
meta_template=api_meta_template,
max_out_len=2048,
max_seq_len=4096,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
```
The judgemodel is usually set to a powerful model like GPT4, and you can directly enter your API key according to the configuration in the config file, or use a custom model as the judgemodel.
### Specifying Other Parameters
In addition to the basic parameters, you can also modify the `infer` and `eval` fields in the config to set a more appropriate partitioning method. The currently supported partitioning methods mainly include three types: NaivePartitioner, SizePartitioner, and NumberWorkPartitioner. You can also specify your own workdir to save related files.
## Subjective Evaluation with Custom Dataset
The specific process includes:
1. Data preparation
2. Model response generation
3. Evaluate the response with a JudgeLLM
4. Generate JudgeLLM's response and calculate the metric
### Step-1: Data Preparation
This step requires preparing the dataset file and implementing your own dataset class under `Opencompass/datasets/subjective/`, returning the read data in the format of `list of dict`.
Actually, you can prepare the data in any format you like (csv, json, jsonl, etc.). However, to make it easier to get started, it is recommended to construct the data according to the format of the existing subjective datasets or according to the following json format.
We provide mini test-set for **Compare Mode** and **Score Mode** as below:
```python
###COREV2
[
{
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
"capability": "知识-社会常识",
"others": {
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
"evaluating_guidance": "",
"reference_answer": "上"
}
},...]
###CreationV0.1
[
{
"question": "请你扮演一个邮件管家,我让你给谁发送什么主题的邮件,你就帮我扩充好邮件正文,并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题,来斟酌用词,并使用合适的敬语。现在请给导师发送邮件,询问他是否可以下周三下午15:00进行科研同步会,大约200字。",
"capability": "邮件通知",
"others": ""
},
```
The json must includes the following fields:
- 'question': Question description
- 'capability': The capability dimension of the question.
- 'others': Other needed information.
If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
### Step-2: Evaluation Configuration(Compare Mode)
Taking Alignbench as an example, `configs/datasets/subjective/alignbench/alignbench_judgeby_critiquellm.py`:
1. First, you need to set `subjective_reader_cfg` to receive the relevant fields returned from the custom Dataset class and specify the output fields when saving files.
2. Then, you need to specify the root path `data_path` of the dataset and the dataset filename `subjective_all_sets`. If there are multiple sub-files, you can add them to this list.
3. Specify `subjective_infer_cfg` and `subjective_eval_cfg` to configure the corresponding inference and evaluation prompts.
4. Specify additional information such as `mode` at the corresponding location. Note that the fields required for different subjective datasets may vary.
5. Define post-processing and score statistics. For example, the postprocessing function `alignbench_postprocess` located under `opencompass/opencompass/datasets/subjective/alignbench`.
### Step-3: Launch the Evaluation
```shell
python run.py config/eval_subjective_score.py -r
```
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.
## Multi-round Subjective Evaluation in OpenCompass
In OpenCompass, we also support subjective multi-turn dialogue evaluation. For instance, the evaluation of MT-Bench can be referred to in `configs/datasets/subjective/multiround`.
In the multi-turn dialogue evaluation, you need to organize the data format into the following dialogue structure:
```
"dialogue": [
{
"role": "user",
"content": "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position? Where is the person you just overtook?"
},
{
"role": "assistant",
"content": ""
},
{
"role": "user",
"content": "If the \"second person\" is changed to \"last person\" in the above question, what would the answer be?"
},
{
"role": "assistant",
"content": ""
}
],
```
It's important to note that due to the different question types in MTBench having different temperature settings, we need to divide the original data files into three different subsets according to the temperature for separate inference. For different subsets, we can set different temperatures. For specific settings, please refer to `configs\datasets\subjective\multiround\mtbench_single_judge_diff_temp.py`.
# flake8: noqa
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import subprocess
import sys
import pytorch_sphinx_theme
from sphinx.builders.html import StandaloneHTMLBuilder
sys.path.insert(0, os.path.abspath('../../'))
# -- Project information -----------------------------------------------------
project = 'OpenCompass'
copyright = '2023, OpenCompass'
author = 'OpenCompass Authors'
# The full version, including alpha/beta/rc tags
version_file = '../../opencompass/__init__.py'
def get_version():
with open(version_file, 'r') as f:
exec(compile(f.read(), version_file, 'exec'))
return locals()['__version__']
release = get_version()
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.intersphinx',
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'myst_parser',
'sphinx_copybutton',
'sphinx_tabs.tabs',
'notfound.extension',
'sphinxcontrib.jquery',
'sphinx_design',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
source_suffix = {
'.rst': 'restructuredtext',
'.md': 'markdown',
}
language = 'en'
# The master toctree document.
root_doc = 'index'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'pytorch_sphinx_theme'
html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
# yapf: disable
html_theme_options = {
'menu': [
{
'name': 'GitHub',
'url': 'https://github.com/open-compass/opencompass'
},
],
# Specify the language of shared menu
'menu_lang': 'en',
# Disable the default edit on GitHub
'default_edit_on_github': False,
}
# yapf: enable
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_css_files = [
'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
'css/readthedocs.css'
]
html_js_files = [
'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
'js/custom.js'
]
html_context = {
'github_version': 'main',
}
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'opencompassdoc'
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(root_doc, 'opencompass.tex', 'OpenCompass Documentation', author,
'manual'),
]
# -- Options for manual page output ------------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [(root_doc, 'opencompass', 'OpenCompass Documentation', [author],
1)]
# -- Options for Texinfo output ----------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(root_doc, 'opencompass', 'OpenCompass Documentation', author,
'OpenCompass Authors', 'AGI evaluation toolbox and benchmark.',
'Miscellaneous'),
]
# -- Options for Epub output -------------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
# set priority when building html
StandaloneHTMLBuilder.supported_image_types = [
'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
]
# -- Extension configuration -------------------------------------------------
# Ignore >>> when copying code
copybutton_prompt_text = r'>>> |\.\.\. '
copybutton_prompt_is_regexp = True
# Auto-generated header anchors
myst_heading_anchors = 3
# Enable "colon_fence" extension of myst.
myst_enable_extensions = ['colon_fence', 'dollarmath']
# Configuration for intersphinx
intersphinx_mapping = {
'python': ('https://docs.python.org/3', None),
'numpy': ('https://numpy.org/doc/stable', None),
'torch': ('https://pytorch.org/docs/stable/', None),
'mmengine': ('https://mmengine.readthedocs.io/en/latest/', None),
'transformers':
('https://huggingface.co/docs/transformers/main/en/', None),
}
napoleon_custom_sections = [
# Custom sections for data elements.
('Meta fields', 'params_style'),
('Data fields', 'params_style'),
]
# Disable docstring inheritance
autodoc_inherit_docstrings = False
# Mock some imports during generate API docs.
autodoc_mock_imports = ['rich', 'attr', 'einops']
# Disable displaying type annotations, these can be very verbose
autodoc_typehints = 'none'
# The not found page
notfound_template = '404.html'
def builder_inited_handler(app):
subprocess.run(['./statis.py'])
def setup(app):
app.connect('builder-inited', builder_inited_handler)
\ No newline at end of file
[html writers]
table_style: colwidths-auto
# FAQ
## General
### What are the differences and connections between `ppl` and `gen`?
`ppl` stands for perplexity, an index used to evaluate a model's language modeling capabilities. In the context of OpenCompass, it generally refers to a method of answering multiple-choice questions: given a context, the model needs to choose the most appropriate option from multiple choices. In this case, we concatenate the n options with the context to form n sequences, then calculate the model's perplexity for these n sequences. We consider the option corresponding to the sequence with the lowest perplexity as the model's reasoning result for this question. This evaluation method is simple and direct in post-processing, with high certainty.
`gen` is an abbreviation for generate. In the context of OpenCompass, it refers to the model's continuation writing result given a context as the reasoning result for a question. Generally, the string obtained from continuation writing requires a heavier post-processing process to extract reliable answers and complete the evaluation.
In terms of usage, multiple-choice questions and some multiple-choice-like questions of the base model use `ppl`, while the base model's multiple-selection and non-multiple-choice questions use `gen`. All questions of the chat model use `gen`, as many commercial API models do not expose the `ppl` interface. However, there are exceptions, such as when we want the base model to output the problem-solving process (e.g., Let's think step by step), we will also use `gen`, but the overall usage is as shown in the following table:
| | ppl | gen |
| ---------- | -------------- | -------------------- |
| Base Model | Only MCQ Tasks | Tasks Other Than MCQ |
| Chat Model | None | All Tasks |
Similar to `ppl`, conditional log probability (`clp`) calculates the probability of the next token given a context. It is also only applicable to multiple-choice questions, and the range of probability calculation is limited to the tokens corresponding to the option numbers. The option corresponding to the token with the highest probability is considered the model's reasoning result. Compared to `ppl`, `clp` calculation is more efficient, requiring only one inference, whereas `ppl` requires n inferences. However, the drawback is that `clp` is subject to the tokenizer. For example, the presence or absence of space symbols before and after an option can change the tokenizer's encoding result, leading to unreliable test results. Therefore, `clp` is rarely used in OpenCompass.
### How does OpenCompass control the number of shots in few-shot evaluations?
In the dataset configuration file, there is a retriever field indicating how to recall samples from the dataset as context examples. The most commonly used is `FixKRetriever`, which means using a fixed k samples, hence k-shot. There is also `ZeroRetriever`, which means not using any samples, which in most cases implies 0-shot.
On the other hand, in-context samples can also be directly specified in the dataset template. In this case, `ZeroRetriever` is also used, but the evaluation is not 0-shot and needs to be determined based on the specific template. Refer to [prompt](../prompt/prompt_template.md) for more details
### How does OpenCompass allocate GPUs?
OpenCompass processes evaluation requests using the unit termed as "task". Each task is an independent combination of model(s) and dataset(s). The GPU resources needed for a task are determined entirely by the model being evaluated, specifically by the `num_gpus` parameter.
During evaluation, OpenCompass deploys multiple workers to execute tasks in parallel. These workers continuously try to secure GPU resources and run tasks until they succeed. As a result, OpenCompass always strives to leverage all available GPU resources to their maximum capacity.
For instance, if you're using OpenCompass on a local machine equipped with 8 GPUs, and each task demands 4 GPUs, then by default, OpenCompass will employ all 8 GPUs to concurrently run 2 tasks. However, if you adjust the `--max-num-workers` setting to 1, then only one task will be processed at a time, utilizing just 4 GPUs.
### Why doesn't the GPU behavior of HuggingFace models align with my expectations?
This is a complex issue that needs to be explained from both the supply and demand sides:
The supply side refers to how many tasks are being run. A task is a combination of a model and a dataset, and it primarily depends on how many models and datasets need to be tested. Additionally, since OpenCompass splits a larger task into multiple smaller tasks, the number of data entries per sub-task (`--max-partition-size`) also affects the number of tasks. (The `--max-partition-size` is proportional to the actual number of data entries, but the relationship is not 1:1).
The demand side refers to how many workers are running. Since OpenCompass instantiates multiple models for inference simultaneously, we use `--hf-num-gpus` to specify how many GPUs each instance uses. Note that `--hf-num-gpus` is a parameter specific to HuggingFace models and setting this parameter for non-HuggingFace models will not have any effect. We also use `--max-num-workers` to indicate the maximum number of instances running at the same time. Lastly, due to issues like GPU memory and insufficient load, OpenCompass also supports running multiple instances on the same GPU, which is managed by the parameter `--max-num-workers-per-gpu`. Therefore, it can be generally assumed that we will use a total of `--hf-num-gpus` * `--max-num-workers` / `--max-num-workers-per-gpu` GPUs.
In summary, when tasks run slowly or the GPU load is low, we first need to check if the supply is sufficient. If not, consider reducing `--max-partition-size` to split the tasks into finer parts. Next, we need to check if the demand is sufficient. If not, consider increasing `--max-num-workers` and `--max-num-workers-per-gpu`. Generally, **we set `--hf-num-gpus` to the minimum value that meets the demand and do not adjust it further.**
### How do I control the number of GPUs that OpenCompass occupies?
Currently, there isn't a direct method to specify the number of GPUs OpenCompass can utilize. However, the following are some indirect strategies:
**If evaluating locally:**
You can limit OpenCompass's GPU access by setting the `CUDA_VISIBLE_DEVICES` environment variable. For instance, using `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` will only expose the first four GPUs to OpenCompass, ensuring it uses no more than these four GPUs simultaneously.
**If using Slurm or DLC:**
Although OpenCompass doesn't have direct access to the resource pool, you can adjust the `--max-num-workers` parameter to restrict the number of evaluation tasks being submitted simultaneously. This will indirectly manage the number of GPUs that OpenCompass employs. For instance, if each task requires 4 GPUs, and you wish to allocate a total of 8 GPUs, then you should set `--max-num-workers` to 2.
### `libGL.so.1` not foune
opencv-python depends on some dynamic libraries that are not present in the environment. The simplest solution is to uninstall opencv-python and then install opencv-python-headless.
```bash
pip uninstall opencv-python
pip install opencv-python-headless
```
Alternatively, you can install the corresponding dependency libraries according to the error message
```bash
sudo apt-get update
sudo apt-get install -y libgl1 libglib2.0-0
```
## Network
### My tasks failed with error: `('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))` or `urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443)`
Because of HuggingFace's implementation, OpenCompass requires network (especially the connection to HuggingFace) for the first time it loads some datasets and models. Additionally, it connects to HuggingFace each time it is launched. For a successful run, you may:
- Work behind a proxy by specifying the environment variables `http_proxy` and `https_proxy`;
- Use the cache files from other machines. You may first run the experiment on a machine that has access to the Internet, and then copy the cached files to the offline one. The cached files are located at `~/.cache/huggingface/` by default ([doc](https://huggingface.co/docs/datasets/cache#cache-directory)). When the cached files are ready, you can start the evaluation in offline mode:
```python
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 HF_EVALUATE_OFFLINE=1 python run.py ...
```
With which no more network connection is needed for the evaluation. However, error will still be raised if the files any dataset or model is missing from the cache.
- Use mirror like [hf-mirror](https://hf-mirror.com/)
```python
HF_ENDPOINT=https://hf-mirror.com python run.py ...
```
### My server cannot connect to the Internet, how can I use OpenCompass?
Use the cache files from other machines, as suggested in the answer to [Network-Q1](#my-tasks-failed-with-error-connection-aborted-connectionreseterror104-connection-reset-by-peer-or-urllib3exceptionsmaxretryerror-httpsconnectionpoolhostcdn-lfshuggingfaceco-port443).
### In evaluation phase, I'm running into an error saying that `FileNotFoundError: Couldn't find a module script at opencompass/accuracy.py. Module 'accuracy' doesn't exist on the Hugging Face Hub either.`
HuggingFace tries to load the metric (e.g. `accuracy`) as an module online, and it could fail if the network is unreachable. Please refer to [Network-Q1](#my-tasks-failed-with-error-connection-aborted-connectionreseterror104-connection-reset-by-peer-or-urllib3exceptionsmaxretryerror-httpsconnectionpoolhostcdn-lfshuggingfaceco-port443) for guidelines to fix your network issue.
The issue has been fixed in the latest version of OpenCompass, so you might also consider pull from the latest version.
## Efficiency
### Why does OpenCompass partition each evaluation request into tasks?
Given the extensive evaluation time and the vast quantity of datasets, conducting a comprehensive linear evaluation on LLM models can be immensely time-consuming. To address this, OpenCompass divides the evaluation request into multiple independent "tasks". These tasks are then dispatched to various GPU groups or nodes, achieving full parallelism and maximizing the efficiency of computational resources.
### How does task partitioning work?
Each task in OpenCompass represents a combination of specific model(s) and portions of the dataset awaiting evaluation. OpenCompass offers a variety of task partitioning strategies, each tailored for different scenarios. During the inference stage, the prevalent partitioning method seeks to balance task size, or computational cost. This cost is heuristically derived from the dataset size and the type of inference.
### Why does it take more time to evaluate LLM models on OpenCompass?
There is a tradeoff between the number of tasks and the time to load the model. For example, if we partition an request that evaluates a model against a dataset into 100 tasks, the model will be loaded 100 times in total. When resources are abundant, these 100 tasks can be executed in parallel, so the additional time spent on model loading can be ignored. However, if resources are limited, these 100 tasks will operate more sequentially, and repeated loadings can become a bottleneck in execution time.
Hence, if users find that the number of tasks greatly exceeds the available GPUs, we advise setting the `--max-partition-size` to a larger value.
## Model
### How to use the downloaded huggingface models?
If you have already download the checkpoints of the model, you can specify the local path of the model. For example
```bash
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path /path/to/model
```
## Dataset
### How to build a new dataset?
- For building new objective dataset: [new_dataset](../advanced_guides/new_dataset.md)
- For building new subjective dataset: [subjective_evaluation](../advanced_guides/subjective_evaluation.md)
# Installation
## Basic Installation
1. Prepare the OpenCompass runtime environment using Conda:
```conda create --name opencompass python=3.10 -y
# conda create --name opencompass_lmdeploy python=3.10 -y
conda activate opencompass
```
If you want to customize the PyTorch version or related CUDA version, please refer to the [official documentation](https://pytorch.org/get-started/locally/) to set up the PyTorch environment. Note that OpenCompass requires `pytorch>=1.13`.
2. Install OpenCompass:
- pip Installation
```bash
# For support of most datasets and models
pip install -U opencompass
# Complete installation (supports more datasets)
# pip install "opencompass[full]"
# API Testing (e.g., OpenAI, Qwen)
# pip install "opencompass[api]"
```
- Building from Source Code If you want to use the latest features of OpenCompass
```bash
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
```
## Other Installations
### Inference Backends
```bash
# Model inference backends. Since these backends often have dependency conflicts,
# we recommend using separate virtual environments to manage them.
pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"
```
- LMDeploy
You can check if the inference backend has been installed successfully with the following command. For more information, refer to the [official documentation](https://lmdeploy.readthedocs.io/en/latest/get_started.html)
```bash
lmdeploy chat internlm/internlm2_5-1_8b-chat --backend turbomind
```
- vLLM
You can check if the inference backend has been installed successfully with the following command. For more information, refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
```bash
vllm serve facebook/opt-125m
```
### API
OpenCompass supports different commercial model API calls, which you can install via pip or by referring to the [API dependencies](https://github.com/open-compass/opencompass/blob/main/requirements/api.txt) for specific API model dependencies.
```bash
pip install "opencompass[api]"
# pip install openai # GPT-3.5-Turbo / GPT-4-Turbo / GPT-4 / GPT-4o (API)
# pip install anthropic # Claude (API)
# pip install dashscope # Qwen (API)
# pip install volcengine-python-sdk # ByteDance Volcano Engine (API)
# ...
```
### Datasets
The basic installation supports most fundamental datasets. For certain datasets (e.g., Alpaca-eval, Longbench, etc.), additional dependencies need to be installed.
You can install these through pip or refer to the [additional dependencies](<(https://github.com/open-compass/opencompass/blob/main/requirements/extra.txt)>) for specific dependencies.
```bash
pip install "opencompass[full]"
```
For HumanEvalX / HumanEval+ / MBPP+, you need to manually clone the Git repository and install it.
```bash
git clone --recurse-submodules git@github.com:open-compass/human-eval.git
cd human-eval
pip install -e .
pip install -e evalplus
```
Some agent evaluations require installing numerous dependencies, which may conflict with existing runtime environments. We recommend creating separate conda environments to manage these.
```bash
# T-Eval
pip install lagent==0.1.2
# CIBench
pip install -r requirements/agent.txt
```
# Dataset Preparation
The datasets supported by OpenCompass mainly include three parts:
1. Huggingface datasets: The [Huggingface Datasets](https://huggingface.co/datasets) provide a large number of datasets, which will **automatically download** when running with this option.
Translate the paragraph into English:
2. ModelScope Datasets: [ModelScope OpenCompass Dataset](https://modelscope.cn/organization/opencompass) supports automatic downloading of datasets from ModelScope.
To enable this feature, set the environment variable: `export DATASET_SOURCE=ModelScope`. The available datasets include (sourced from OpenCompassData-core.zip):
```plain
humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ceval, math, LCSTS, Xsum, winogrande, openbookqa, AGIEval, gsm8k, nq, race, siqa, mbpp, mmlu, hellaswag, ARC, BBH, xstory_cloze, summedits, GAOKAO-BENCH, OCNLI, cmnli
```
3. Custom dataset: OpenCompass also provides some Chinese custom **self-built** datasets. Please run the following command to **manually download and extract** them.
Run the following commands to download and place the datasets in the `${OpenCompass}/data` directory can complete dataset preparation.
```bash
# Run in the OpenCompass directory
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
```
If you need to use the more comprehensive dataset (~500M) provided by OpenCompass, You can download and `unzip` it using the following command:
```bash
# For proxy and resumable downloads, try `aria2c -x16 -s16 -k1M "http://ghfast.top/https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-complete-20240207.zip" `
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-complete-20240207.zip
unzip OpenCompassData-complete-20240207.zip
cd ./data
find . -name "*.zip" -exec unzip "{}" \;
```
The list of datasets included in both `.zip` can be found [here](https://github.com/open-compass/opencompass/releases/tag/0.2.2.rc1)
OpenCompass has supported most of the datasets commonly used for performance comparison, please refer to `configs/dataset` for the specific list of supported datasets.
For next step, please read [Quick Start](./quick_start.md).
# Quick Start
![image](https://github.com/open-compass/opencompass/assets/22607038/d063cae0-3297-4fd2-921a-366e0a24890b)
## Overview
OpenCompass provides a streamlined workflow for evaluating a model, which consists of the following stages: **Configure** -> **Inference** -> **Evaluation** -> **Visualization**.
**Configure**: This is your starting point. Here, you'll set up the entire evaluation process, choosing the model(s) and dataset(s) to assess. You also have the option to select an evaluation strategy, the computation backend, and define how you'd like the results displayed.
**Inference & Evaluation**: OpenCompass efficiently manages the heavy lifting, conducting parallel inference and evaluation on your chosen model(s) and dataset(s). The **Inference** phase is all about producing outputs from your datasets, whereas the **Evaluation** phase measures how well these outputs align with the gold standard answers. While this procedure is broken down into multiple "tasks" that run concurrently for greater efficiency, be aware that working with limited computational resources might introduce some unexpected overheads, and resulting in generally slower evaluation. To understand this issue and know how to solve it, check out [FAQ: Efficiency](faq.md#efficiency).
**Visualization**: Once the evaluation is done, OpenCompass collates the results into an easy-to-read table and saves them as both CSV and TXT files. If you need real-time updates, you can activate lark reporting and get immediate status reports in your Lark clients.
Coming up, we'll walk you through the basics of OpenCompass, showcasing evaluations of pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on the [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winograd_wsc) benchmark tasks. Their configuration files can be found at [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py).
Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/open-compass/opencompass/tree/main/configs).
## Configuring an Evaluation Task
In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is `run.py`. Users can select the model and dataset to be tested either via command line or configuration files.
`````{tabs}
````{tab} Command Line (Custom HF Model)
For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the `facebook/opt-125m` model, you can evaluate it with the following command:
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-type base \
--hf-path facebook/opt-125m
```
Note that in this way, OpenCompass only evaluates one model at a time, while other ways can evaluate multiple models at once.
```{caution}
`--hf-num-gpus` does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. [More](faq.md#how-does-opencompass-allocate-gpus)
```
:::{dropdown} More detailed example
:animate: fade-in-slide-down
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-type base \ # HuggingFace model type, base or chat
--hf-path facebook/opt-125m \ # HuggingFace model path
--tokenizer-path facebook/opt-125m \ # HuggingFace tokenizer path (if the same as the model path, can be omitted)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # Arguments to construct the tokenizer
--model-kwargs device_map='auto' \ # Arguments to construct the model
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--max-out-len 100 \ # Maximum number of tokens to generate
--min-out-len 100 \ # Minimum number of tokens to generate
--batch-size 64 \ # Batch size
--hf-num-gpus 1 # Number of GPUs required to run the model
```
```{seealso}
For all HuggingFace related parameters supported by `run.py`, please read [Launching Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task).
```
:::
````
````{tab} Command Line
Users can combine the models and datasets they want to test using `--models` and `--datasets`.
```bash
python run.py --models hf_opt_125m hf_opt_350m --datasets siqa_gen winograd_ppl
```
The models and datasets are pre-stored in the form of configuration files in `configs/models` and `configs/datasets`. Users can view or filter the currently available model and dataset configurations using `tools/list_configs.py`.
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
:::{dropdown} More about `list_configs`
:animate: fade-in-slide-down
Running `python tools/list_configs.py llama mmlu` gives the output like:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| ... | ... |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| ... | ... |
+-------------------+---------------------------------------------------+
```
Users can use the names in the first column as input parameters for `--models` and `--datasets` in `python run.py`. For datasets, the same name with different suffixes generally indicates that its prompts or evaluation methods are different.
:::
:::{dropdown} Model not on the list?
:animate: fade-in-slide-down
If you want to evaluate other models, please check out the "Command Line (Custom HF Model)" tab for the way to construct a custom HF model without a configuration file, or "Configuration File" tab to learn the general way to prepare your model configurations.
:::
````
````{tab} Configuration File
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. The configuration file is organized in Python format and must include the `datasets` and `models` fields.
The test configuration for this time is [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](../user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
```python
from mmengine.config import read_base
with read_base():
from .datasets.siqa.siqa_gen import siqa_datasets
from .datasets.winograd.winograd_ppl import winograd_datasets
from .models.opt.hf_opt_125m import opt125m
from .models.opt.hf_opt_350m import opt350m
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
```
When running tasks, we just need to pass the path of the configuration file to `run.py`:
```bash
python run.py configs/eval_demo.py
```
:::{dropdown} More about `models`
:animate: fade-in-slide-down
OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](https://github.com/open-compass/opencompass/blob/main/configs/models/opt/hf_opt_350m.py) (`configs/models/opt/hf_opt_350m.py`):
```python
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceBaseModel`
from opencompass.models import HuggingFaceBaseModel
models = [
# OPT-350M
dict(
type=HuggingFaceBaseModel,
# Initialization parameters for `HuggingFaceBaseModel`
path='facebook/opt-350m',
# Below are common parameters for all models, not specific to HuggingFaceBaseModel
abbr='opt-350m-hf', # Model abbreviation
max_out_len=1024, # Maximum number of generated tokens
batch_size=32, # Batch size
run_cfg=dict(num_gpus=1), # The required GPU numbers for this model
)
]
```
When using configurations, we can specify the relevant files through the command-line argument ` --models` or import the model configurations into the `models` list in the configuration file using the inheritance mechanism.
```{seealso}
More information about model configuration can be found in [Prepare Models](../user_guides/models.md).
```
:::
:::{dropdown} More about `datasets`
:animate: fade-in-slide-down
Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance
Below is a dataset-related configuration snippet from `configs/eval_demo.py`:
```python
from mmengine.config import read_base # Use mmengine.read_base() to read the base configuration
with read_base():
# Directly read the required dataset configurations from the preset dataset configurations
from .datasets.winograd.winograd_ppl import winograd_datasets # Read Winograd configuration, evaluated based on PPL (perplexity)
from .datasets.siqa.siqa_gen import siqa_datasets # Read SIQA configuration, evaluated based on generation
datasets = [*siqa_datasets, *winograd_datasets] # The final config needs to contain the required evaluation dataset list 'datasets'
```
Dataset configurations are typically of two types: 'ppl' and 'gen', indicating the evaluation method used. Where `ppl` means discriminative evaluation and `gen` means generative evaluation.
Moreover, [configs/datasets/collections](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections) houses various dataset collections, making it convenient for comprehensive evaluations. OpenCompass often uses [`base_medium.py`](/configs/datasets/collections/base_medium.py) for full-scale model testing. To replicate results, simply import that file, for example:
```bash
python run.py --models hf_llama_7b --datasets base_medium
```
```{seealso}
You can find more information from [Dataset Preparation](../user_guides/datasets.md).
```
:::
````
`````
```{warning}
OpenCompass usually assumes network is available. If you encounter network issues or wish to run OpenCompass in an offline environment, please refer to [FAQ - Network - Q1](./faq.md#network) for solutions.
```
The following sections will use configuration-based method as an example to explain the other features.
## Launching Evaluation
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation in `--debug` mode for the first run and check if there is any problem. In `--debug` mode, the tasks will be executed sequentially and output will be printed in real time.
```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
If everything is fine, you should see "Starting inference process" on screen:
```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
Then you can press `ctrl+c` to interrupt the program, and run the following command in normal mode:
```bash
python run.py configs/eval_demo.py -w outputs/demo
```
In normal mode, the evaluation tasks will be executed parallelly in the background, and their output will be redirected to the output directory `outputs/demo/{TIMESTAMP}`. The progress bar on the frontend only indicates the number of completed tasks, regardless of their success or failure. **Any backend task failures will only trigger a warning message in the terminal.**
:::{dropdown} More parameters in `run.py`
:animate: fade-in-slide-down
Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:
- `-w outputs/demo`: Work directory to save evaluation logs and results. In this case, the experiment result will be saved to `outputs/demo/{TIMESTAMP}`.
- `-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
- `--mode all`: Specify a specific stage of the task.
- all: (Default) Perform a complete evaluation, including inference and evaluation.
- infer: Perform inference on each dataset.
- eval: Perform evaluation based on the inference results.
- viz: Display evaluation results only.
- `--max-partition-size 2000`: Dataset partition size. Some datasets may be large, and using this parameter can split them into multiple sub-tasks to efficiently utilize resources. However, if the partition is too fine, the overall speed may be slower due to longer model loading times.
- `--max-num-workers 32`: Maximum number of parallel tasks. In distributed environments such as Slurm, this parameter specifies the maximum number of submitted tasks. In a local environment, it specifies the maximum number of tasks executed in parallel. Note that the actual number of parallel tasks depends on the available GPU resources and may not be equal to this number.
If you are not performing the evaluation on your local machine but using a Slurm cluster, you can specify the following parameters:
- `--slurm`: Submit tasks using Slurm on the cluster.
- `--partition(-p) my_part`: Slurm cluster partition.
- `--retry 2`: Number of retries for failed tasks.
```{seealso}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task) for details.
```
:::
## Visualizing Evaluation Results
After the evaluation is complete, the evaluation results table will be printed as follows:
```text
dataset version metric mode opt350m opt125m
--------- --------- -------- ------ --------- ---------
siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
```
All run outputs will be directed to `outputs/demo/` directory with following structure:
```text
outputs/default/
├── 20200220_120000
├── 20230220_183030 # one experiment pre folder
│ ├── configs # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
│ ├── logs # log files for both inference and evaluation stages
│ │ ├── eval
│ │ └── infer
│   ├── predictions # Prediction results for each task
│   ├── results # Evaluation results for each task
│   └── summary # Summarized evaluation results for a single experiment
├── ...
```
The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).
More information about obtaining evaluation results can be found in [Results Summary](../user_guides/summarizer.md).
## Additional Tutorials
To learn more about using OpenCompass, explore the following tutorials:
- [Prepare Datasets](../user_guides/datasets.md)
- [Prepare Models](../user_guides/models.md)
- [Task Execution and Monitoring](../user_guides/experimentation.md)
- [Understand Prompts](../prompt/overview.md)
- [Results Summary](../user_guides/summarizer.md)
- [Learn about Config](../user_guides/config.md)
Welcome to OpenCompass' documentation!
==========================================
Getting started with OpenCompass
-------------------------------
To help you quickly familiarized with OpenCompass, we recommend you to walk through the following documents in order:
- First read the GetStarted_ section set up the environment, and run a mini experiment.
- Then learn its basic usage through the UserGuides_.
- If you want to tune the prompts, refer to the Prompt_.
- If you want to customize some modules, like adding a new dataset or model, we have provided the AdvancedGuides_.
- There are more handy tools, such as prompt viewer and lark bot reporter, all presented in Tools_.
We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
.. _GetStarted:
.. toctree::
:maxdepth: 1
:caption: Get Started
get_started/installation.md
get_started/quick_start.md
get_started/faq.md
.. _UserGuides:
.. toctree::
:maxdepth: 1
:caption: User Guides
user_guides/framework_overview.md
user_guides/config.md
user_guides/datasets.md
user_guides/models.md
user_guides/evaluation.md
user_guides/experimentation.md
user_guides/metrics.md
user_guides/deepseek_r1.md
user_guides/interns1.md
.. _Prompt:
.. toctree::
:maxdepth: 1
:caption: Prompt
prompt/overview.md
prompt/prompt_template.md
prompt/meta_template.md
prompt/chain_of_thought.md
.. _AdvancedGuides:
.. toctree::
:maxdepth: 1
:caption: Advanced Guides
advanced_guides/new_dataset.md
advanced_guides/custom_dataset.md
advanced_guides/new_model.md
advanced_guides/evaluation_lmdeploy.md
advanced_guides/accelerator_intro.md
advanced_guides/math_verify.md
advanced_guides/llm_judge.md
advanced_guides/code_eval.md
advanced_guides/code_eval_service.md
advanced_guides/subjective_evaluation.md
advanced_guides/persistence.md
.. _Tools:
.. toctree::
:maxdepth: 1
:caption: Tools
tools.md
.. _Dataset List:
.. toctree::
:maxdepth: 1
:caption: Dataset List
dataset_statistics.md
.. _Notes:
.. toctree::
:maxdepth: 1
:caption: Notes
notes/contribution_guide.md
notes/academic.md
Indexes & Tables
==================
* :ref:`genindex`
* :ref:`search`
# Guide to Reproducing CompassAcademic Leaderboard Results
To provide users with a quick and intuitive overview of the performance of mainstream open-source and commercial models on widely-used datasets, we maintain the [CompassAcademic Leaderboard](https://rank.opencompass.org.cn/leaderboard-llm-academic/?m=REALTIME) for LLMs on our official website, updating it typically every two weeks.
Given the continuous iteration of models and datasets, along with ongoing upgrades to the OpenCompass, the configuration settings for the CompassAcademic leaderboard may evolve. Specifically, we adhere to the following update principles:
- Newly released models are promptly included, while models published six months to one year (or more) ago are removed from the leaderboard.
- New datasets are incorporated, while datasets nearing performance saturation are phased out.
- Existing evaluation results on the leaderboard are updated in sync with changes to the evaluation configuration.
To support rapid reproducibility, OpenCompass provides the real-time configuration files used in the academic leaderboard.
## CompassAcademic Leaderboard Reproduction
[eval_academic_leaderboard_REALTIME.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_academic_leaderboard_REALTIME.py) contains the configuration currently used for academic ranking evaluation. You can replicate the evaluation by following the steps as follows.
### 1: Model Configs
Firstly, modify the Model List code block in [eval_academic_leaderboard_REALTIME.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_academic_leaderboard_REALTIME.py) to include the model you wish to evaluate.
```python
# Models (add your models here)
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model
```
The original example calls an lmdeploy-based model configuration in OpenCompass.
You can also build your new model configuration based on [this document](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/models.html).
An example of a configuration that calls the deployed service of Qwen3-235B-A22B based on OpenAISDK is as follows:
```python
from opencompass.models import OpenAISDK
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
qwen3_235b_a22b_model = dict(
abbr="qwen_3_235b_a22b_thinking", # Used to identify the model configuration
key="YOUR_SERVE_API_KEY",
openai_api_base="YOUR_SERVE_API_URL",
type=OpenAISDK, # The model configuration types, commonly used such as OpenAISDK, TurboMindModelwithChatTemplate, HuggingFacewithChatTemplate
path="Qwen/Qwen3-235B-A22B",
temperature=0.6,
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
),
query_per_second=1,
max_out_len=32000,
max_seq_len=32768,
batch_size=8,
retry=10,
extra_body={
'chat_template_kwargs': {'enable_thinking': True},
}, # Additional configurations of the model, such as the option in Qwen3 series to control whether they thinks or not
pred_postprocessor=dict(type=extract_non_reasoning_content), # adding this pred_postprocessor can extract the non-reasoning content from models that output with a think tag
)
models = [
qwen3_235b_a22b_model,
]
```
Here are the commonly used parameters for reference.
- `max_seq_len` = 65536 or 32768
- `max_out_len` = 64000 or 32000
- `temperature` = 0.6
- `top_p` = 0.95
### 2: Verifier Configs
Complete your verifier model information in `judge_cfg`.
For detailed information about LLM verifiers, please refer to [this document](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/llm_judge.html).
At present, CompassAcademic use [CompassVerifier-32B](https://huggingface.co/opencompass/CompassVerifier-32B), here is the config example using OpenAISDK:
```python
judge_cfg = dict(
abbr='CompassVerifier',
type=OpenAISDK,
path='opencompass/CompassVerifier-32B',
key='YOUR_API_KEY',
openai_api_base='YOUR_API_BASE',
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]),
query_per_second=1,
batch_size=8,
temperature=0.001,
max_out_len=8192,
max_seq_len=32768,
mode='mid',
)
```
### 3: Execute evaluation
After completing the above configuration file, you can enter the following content in the CLI to start the evaluation:
```bash
opencompass examples/eval_academic_leaderboard_REALTIME.py
```
For more detailed CLI parameters, please refer to [this document](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/experimentation.html)
# Contributing to OpenCompass
- [Contributing to OpenCompass](#contributing-to-opencompass)
- [What is PR](#what-is-pr)
- [Basic Workflow](#basic-workflow)
- [Procedures in detail](#procedures-in-detail)
- [1. Get the most recent codebase](#1-get-the-most-recent-codebase)
- [2. Checkout a new branch from `main` branch](#2-checkout-a-new-branch-from-main-branch)
- [3. Commit your changes](#3-commit-your-changes)
- [4. Push your changes to the forked repository and create a PR](#4-push-your-changes-to-the-forked-repository-and-create-a-pr)
- [5. Discuss and review your code](#5-discuss-and-review-your-code)
- [6. Merge your branch to `main` branch and delete the branch](#6--merge-your-branch-to-main-branch-and-delete-the-branch)
- [Code style](#code-style)
- [Python](#python)
- [About Contributing Test Datasets](#about-contributing-test-datasets)
Thanks for your interest in contributing to OpenCompass! All kinds of contributions are welcome, including but not limited to the following.
- Fix typo or bugs
- Add documentation or translate the documentation into other languages
- Add new features and components
## What is PR
`PR` is the abbreviation of `Pull Request`. Here's the definition of `PR` in the [official document](https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) of Github.
```
Pull requests let you tell others about changes you have pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.
```
## Basic Workflow
1. Get the most recent codebase
2. Checkout a new branch from `main` branch.
3. Commit your changes ([Don't forget to use pre-commit hooks!](#3-commit-your-changes))
4. Push your changes and create a PR
5. Discuss and review your code
6. Merge your branch to `main` branch
## Procedures in detail
### 1. Get the most recent codebase
- When you work on your first PR
Fork the OpenCompass repository: click the **fork** button at the top right corner of Github page
![avatar](https://github.com/open-compass/opencompass/assets/22607038/851ed33d-02db-49c9-bf94-7c62eee89eb2)
Clone forked repository to local
```bash
git clone git@github.com:XXX/opencompass.git
```
Add source repository to upstream
```bash
git remote add upstream git@github.com:InternLM/opencompass.git
```
- After your first PR
Checkout the latest branch of the local repository and pull the latest branch of the source repository.
```bash
git checkout main
git pull upstream main
```
### 2. Checkout a new branch from `main` branch
```bash
git checkout main -b branchname
```
### 3. Commit your changes
- If you are a first-time contributor, please install and initialize pre-commit hooks from the repository root directory first.
```bash
pip install -U pre-commit
pre-commit install
```
- Commit your changes as usual. Pre-commit hooks will be triggered to stylize your code before each commit.
```bash
# coding
git add [files]
git commit -m 'messages'
```
```{note}
Sometimes your code may be changed by pre-commit hooks. In this case, please remember to re-stage the modified files and commit again.
```
### 4. Push your changes to the forked repository and create a PR
- Push the branch to your forked remote repository
```bash
git push origin branchname
```
- Create a PR
![avatar](https://github.com/open-compass/opencompass/assets/22607038/08feb221-b145-4ea8-8e20-05f143081604)
- Revise PR message template to describe your motivation and modifications made in this PR. You can also link the related issue to the PR manually in the PR message (For more information, checkout the [official guidance](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue)).
- You can also ask a specific person to review the changes you've proposed.
### 5. Discuss and review your code
- Modify your codes according to reviewers' suggestions and then push your changes.
### 6. Merge your branch to `main` branch and delete the branch
- After the PR is merged by the maintainer, you can delete the branch you created in your forked repository.
```bash
git branch -d branchname # delete local branch
git push origin --delete branchname # delete remote branch
```
## Code style
### Python
We adopt [PEP8](https://www.python.org/dev/peps/pep-0008/) as the preferred code style.
We use the following tools for linting and formatting:
- [flake8](https://github.com/PyCQA/flake8): A wrapper around some linter tools.
- [isort](https://github.com/timothycrosley/isort): A Python utility to sort imports.
- [yapf](https://github.com/google/yapf): A formatter for Python files.
- [codespell](https://github.com/codespell-project/codespell): A Python utility to fix common misspellings in text files.
- [mdformat](https://github.com/executablebooks/mdformat): Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
- [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/OpenCompass/blob/main/setup.cfg).
## About Contributing Test Datasets
- Submitting Test Datasets
- Please implement logic for automatic dataset downloading in the code; or provide a method for obtaining the dataset in the PR. The OpenCompass maintainers will follow up accordingly. If the dataset is not yet public, please indicate so.
- Submitting Data Configuration Files
- Provide a README in the same directory as the data configuration. The README should include, but is not limited to:
- A brief description of the dataset
- The official link to the dataset
- Some test examples from the dataset
- Evaluation results of the dataset on relevant models
- Citation of the dataset
- (Optional) Summarizer of the dataset
- (Optional) If the testing process cannot be achieved simply by concatenating the dataset and model configuration files, a configuration file for conducting the test is also required.
- (Optional) If necessary, please add a description of the dataset in the relevant documentation sections. This is very necessary to help users understand the testing scheme. You can refer to the following types of documents in OpenCompass:
- [Circular Evaluation](../advanced_guides/circular_eval.md)
- [Code Evaluation](../advanced_guides/code_eval.md)
- [Contamination Assessment](../advanced_guides/contamination_eval.md)
# News
- **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
- **\[2024.04.30\]** We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an [external corpora](configs/datasets/llm_compression/README.md) ([official paper](https://github.com/hkust-nlp/llm-compression-intelligence)). Check out the [llm-compression](configs/eval_llm_compression.py) evaluation config now! 🔥🔥🔥
- **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
- **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
- **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py) welcome to try!🔥🔥🔥.
- **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py)[LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
- **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html)
- **\[2024.01.30\]** We release OpenCompass 2.0. Click [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information !
- **\[2024.01.17\]** We supported the evaluation of [InternLM2](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_keyset.py) and [InternLM2-Chat](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_chat_keyset.py), InternLM2 showed extremely strong performance in these tests, welcome to try!
- **\[2024.01.17\]** We supported the needle in a haystack test with multiple needles, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html#id8).
- **\[2023.12.28\]** We have enabled seamless evaluation of all models developed using [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory), a powerful toolkit for comprehensive LLM development.
- **\[2023.12.22\]** We have released [T-Eval](https://github.com/open-compass/T-Eval), a step-by-step evaluation benchmark to gauge your LLMs on tool utilization. Welcome to our [Leaderboard](https://open-compass.github.io/T-Eval/leaderboard.html) for more details!
- **\[2023.12.10\]** We have released [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), a toolkit for evaluating vision-language models (VLMs), currently support 20+ VLMs and 7 multi-modal benchmarks (including MMBench series).
- **\[2023.12.10\]** We have supported Mistral AI's MoE LLM: **Mixtral-8x7B-32K**. Welcome to [MixtralKit](https://github.com/open-compass/MixtralKit) for more details about inference and evaluation.
- **\[2023.11.22\]** We have supported many API-based models, include **Baidu, ByteDance, Huawei, 360**. Welcome to [Models](https://opencompass.readthedocs.io/en/latest/user_guides/models.html) section for more details.
- **\[2023.11.20\]** Thanks [helloyongyang](https://github.com/helloyongyang) for supporting the evaluation with [LightLLM](https://github.com/ModelTC/lightllm) as backent. Welcome to [Evaluation With LightLLM](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lightllm.html) for more details.
- **\[2023.11.13\]** We are delighted to announce the release of OpenCompass v0.1.8. This version enables local loading of evaluation benchmarks, thereby eliminating the need for an internet connection. Please note that with this update, **you must re-download all evaluation datasets** to ensure accurate and up-to-date results.
- **\[2023.11.06\]** We have supported several API-based models, include **ChatGLM Pro@Zhipu, ABAB-Chat@MiniMax and Xunfei**. Welcome to [Models](https://opencompass.readthedocs.io/en/latest/user_guides/models.html) section for more details.
- **\[2023.10.24\]** We release a new benchmark for evaluating LLMs’ capabilities of having multi-turn dialogues. Welcome to [BotChat](https://github.com/open-compass/BotChat) for more details.
- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
- **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.06\]** [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass.
- **\[2023.08.25\]** [**TigerBot**](https://github.com/TigerResearch/TigerBot) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- **\[2023.08.21\]** [**Lagent**](https://github.com/InternLM/lagent) has been released, which is a lightweight framework for building LLM-based agents. We are working with Lagent team to support the evaluation of general tool-use capability, stay tuned!
- **\[2023.08.18\]** We have supported evaluation for **multi-modality learning**, include **MMBench, SEED-Bench, COCO-Caption, Flickr-30K, OCR-VQA, ScienceQA** and so on. Leaderboard is on the road. Feel free to try multi-modality evaluation with OpenCompass !
- **\[2023.08.18\]** [Dataset card](https://opencompass.org.cn/dataset-detail/MMLU) is now online. Welcome new evaluation benchmark OpenCompass !
- **\[2023.08.11\]** [Model comparison](https://opencompass.org.cn/model-compare/GPT-4,ChatGPT,LLaMA-2-70B,LLaMA-65B) is now online. We hope this feature offers deeper insights!
- **\[2023.08.11\]** We have supported [LEval](https://github.com/OpenLMLab/LEval).
- **\[2023.08.10\]** OpenCompass is compatible with [LMDeploy](https://github.com/InternLM/lmdeploy). Now you can follow this [instruction](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lmdeploy.html#) to evaluate the accelerated models provide by the **Turbomind**.
- **\[2023.08.10\]** We have supported [Qwen-7B](https://github.com/QwenLM/Qwen-7B) and [XVERSE-13B](https://github.com/xverse-ai/XVERSE-13B) ! Go to our [leaderboard](https://opencompass.org.cn/leaderboard-llm) for more results! More models are welcome to join OpenCompass.
- **\[2023.08.09\]** Several new datasets(**CMMLU, TydiQA, SQuAD2.0, DROP**) are updated on our [leaderboard](https://opencompass.org.cn/leaderboard-llm)! More datasets are welcomed to join OpenCompass.
- **\[2023.08.07\]** We have added a [script](tools/eval_mmbench.py) for users to evaluate the inference results of [MMBench](https://opencompass.org.cn/MMBench)-dev.
- **\[2023.08.05\]** We have supported [GPT-4](https://openai.com/gpt-4)! Go to our [leaderboard](https://opencompass.org.cn/leaderboard-llm) for more results! More models are welcome to join OpenCompass.
- **\[2023.07.27\]** We have supported [CMMLU](https://github.com/haonan-li/CMMLU)! More datasets are welcome to join OpenCompass.
# Chain of Thought
## Background
During the process of reasoning, CoT (Chain of Thought) method is an efficient way to help LLMs deal complex questions, for example: math problem and relation inference. In OpenCompass, we support multiple types of CoT method.
![image](https://github.com/open-compass/opencompass/assets/28834990/45d60e0e-02a1-49aa-b792-40a1f95f9b9e)
## 1. Zero Shot CoT
You can change the `PromptTemplate` of the dataset config, by simply add *Let's think step by step* to realize a Zero-Shot CoT prompt for your evaluation:
```python
qa_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template="Answer the question:\nQ: {question}?\nLet's think step by step:\n"
),
retriever=dict(type=ZeroRetriever)
)
```
## 2. Few Shot CoT
Few-shot CoT can make LLMs easy to follow your instructions and get better answers. For few-shot CoT, add your CoT template to `PromptTemplate` like following config to create a one-shot prompt:
```python
qa_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=
'''Question: Mark's basketball team scores 25 2 pointers, 8 3 pointers and 10 free throws. Their opponents score double the 2 pointers but half the 3 pointers and free throws. What's the total number of points scored by both teams added together?
Let's think step by step
Answer:
Mark's team scores 25 2 pointers, meaning they scored 25*2= 50 points in 2 pointers.
His team also scores 6 3 pointers, meaning they scored 8*3= 24 points in 3 pointers
They scored 10 free throws, and free throws count as one point so they scored 10*1=10 points in free throws.
All together his team scored 50+24+10= 84 points
Mark's opponents scored double his team's number of 2 pointers, meaning they scored 50*2=100 points in 2 pointers.
His opponents scored half his team's number of 3 pointers, meaning they scored 24/2= 12 points in 3 pointers.
They also scored half Mark's team's points in free throws, meaning they scored 10/2=5 points in free throws.
All together Mark's opponents scored 100+12+5=117 points
The total score for the game is both team's scores added together, so it is 84+117=201 points
The answer is 201
Question: {question}\nLet's think step by step:\n{answer}
'''),
retriever=dict(type=ZeroRetriever)
)
```
## 3. Self-Consistency
The SC (Self-Consistency) method is proposed in [this paper](https://arxiv.org/abs/2203.11171), which will sample multiple reasoning paths for the question, and make majority voting to the generated answers for LLMs. This method displays remarkable proficiency among reasoning tasks with high accuracy but may consume more time and resources when inferencing, because of the majority voting strategy. In OpenCompass, You can easily implement the SC method by replacing `GenInferencer` with `SCInferencer` in the dataset configuration and setting the corresponding parameters like:
```python
# This SC gsm8k config can be found at: opencompass.configs.datasets.gsm8k.gsm8k_gen_a3e34a.py
gsm8k_infer_cfg = dict(
inferencer=dict(
type=SCInferencer, # Replace GenInferencer with SCInferencer.
generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40), # Set sample parameters to make sure model generate various output, only works for models load from HuggingFace now.
infer_type='SC',
sc_size = SAMPLE_SIZE
)
)
gsm8k_eval_cfg = dict(sc_size=SAMPLE_SIZE)
```
```{note}
OpenCompass defaults to use argmax for sampling the next token. Therefore, if the sampling parameters are not specified, the model's inference results will be completely consistent each time, and multiple rounds of evaluation will be ineffective.
```
Where `SAMPLE_SIZE` is the number of reasoning paths in Self-Consistency, higher value usually outcome higher performance. The following figure from the original SC paper demonstrates the relation between reasoning paths and performance in several reasoning tasks:
![image](https://github.com/open-compass/opencompass/assets/28834990/05c7d850-7076-43ca-b165-e6251f9b3001)
From the figure, it can be seen that in different reasoning tasks, performance tends to improve as the number of reasoning paths increases. However, for some tasks, increasing the number of reasoning paths may reach a limit, and further increasing the number of paths may not bring significant performance improvement. Therefore, it is necessary to conduct experiments and adjustments on specific tasks to find the optimal number of reasoning paths that best suit the task.
## 4. Tree-of-Thoughts
In contrast to the conventional CoT approach that considers only a single reasoning path, Tree-of-Thoughts (ToT) allows the language model to explore multiple diverse reasoning paths simultaneously. The model evaluates the reasoning process through self-assessment and makes global choices by conducting lookahead or backtracking when necessary. Specifically, this process is divided into the following four stages:
**1. Thought Decomposition**
Based on the nature of the problem, break down the problem into multiple intermediate steps. Each step can be a phrase, equation, or writing plan, depending on the nature of the problem.
**2. Thought Generation**
Assuming that solving the problem requires k steps, there are two methods to generate reasoning content:
- Independent sampling: For each state, the model independently extracts k reasoning contents from the CoT prompts, without relying on other reasoning contents.
- Sequential generation: Sequentially use "prompts" to guide the generation of reasoning content, where each reasoning content may depend on the previous one.
**3. Heuristic Evaluation**
Use heuristic methods to evaluate the contribution of each generated reasoning content to problem-solving. This self-evaluation is based on the model's self-feedback and involves designing prompts to have the model score multiple generated results.
**4. Search Algorithm Selection**
Based on the methods of generating and evaluating reasoning content, select an appropriate search algorithm. For example, you can use breadth-first search (BFS) or depth-first search (DFS) algorithms to systematically explore the thought tree, conducting lookahead and backtracking.
In OpenCompass, ToT parameters need to be set according to the requirements. Below is an example configuration for the 24-Point game from the [official paper](https://arxiv.org/pdf/2305.10601.pdf). Currently, ToT inference is supported only with Huggingface models:
```python
# This ToT Game24 config can be found at: opencompass/configs/datasets/game24/game24_gen_8dfde3.py.
from opencompass.datasets import (Game24Dataset, game24_postprocess,
Game24Evaluator, Game24PromptWrapper)
generation_kwargs = dict(temperature=0.7)
game24_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{input}'), # Directly pass the input content, as the Prompt needs to be specified in steps
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=ToTInferencer, # Replace GenInferencer with ToTInferencer
generation_kwargs=generation_kwargs,
method_generate='propose', # Method for generating reasoning content, can be independent sampling (sample) or sequential generation (propose)
method_evaluate='value', # Method for evaluating reasoning content, can be voting (vote) or scoring (value)
method_select='greedy', # Method for selecting reasoning content, can be greedy (greedy) or random (sample)
n_evaluate_sample=3,
n_select_sample=5,
task_wrapper=dict(type=Game24PromptWrapper) # This Wrapper class includes the prompts for each step and methods for generating and evaluating reasoning content, needs customization according to the task
))
```
If you want to use the ToT method on a custom dataset, you'll need to make additional configurations in the `opencompass.datasets.YourDataConfig.py` file to set up the `YourDataPromptWrapper` class. This is required for handling the thought generation and heuristic evaluation step within the ToT framework. For reasoning tasks similar to the game 24-Point, you can refer to the implementation in `opencompass/datasets/game24.py` for guidance.
# Meta Template
## Background
In the Supervised Fine-Tuning (SFT) process of Language Model Learning (LLM), we often inject some predefined strings into the conversation according to actual requirements, in order to prompt the model to output content according to certain guidelines. For example, in some `chat` model fine-tuning, we may add system-level instructions at the beginning of each dialogue, and establish a format to represent the conversation between the user and the model. In a conversation, the model may expect the text format to be as follows:
```bash
Meta instruction: You are now a helpful and harmless AI assistant.
HUMAN: Hi!<eoh>\n
Bot: Hello! How may I assist you?<eob>\n
```
During evaluation, we also need to enter questions according to the agreed format for the model to perform its best.
In addition, similar situations exist in API models. General API dialogue models allow users to pass in historical dialogues when calling, and some models also allow the input of SYSTEM level instructions. To better evaluate the ability of API models, we hope to make the data as close as possible to the multi-round dialogue template of the API model itself during the evaluation, rather than stuffing all the content into an instruction.
Therefore, we need to specify different parsing templates for different models. In OpenCompass, we call this set of parsing templates **Meta Template**. Meta Template is tied to the model's configuration and is combined with the dialogue template of the dataset during runtime to ultimately generate the most suitable prompt for the current model.
```python
# When specifying, just pass the meta_template field into the model
models = [
dict(
type='AnyModel',
meta_template = ..., # meta template
)
]
```
Next, we will introduce how to configure Meta Template on two types of models.
You are recommended to read [here](./prompt_template.md#dialogue-prompt) for the basic syntax of the dialogue template before reading this chapter.
```{note}
In some cases (such as testing the base station), we don't need to inject any instructions into the normal dialogue, in which case we can leave the meta template empty. In this case, the prompt received by the model is defined only by the dataset configuration and is a regular string. If the dataset configuration uses a dialogue template, speeches from different roles will be concatenated with \n.
```
## Application on Language Models
The following figure shows several situations where the data is built into a prompt through the prompt template and meta template from the dataset in the case of 2-shot learning. Readers can use this figure as a reference to help understand the following sections.
![](https://user-images.githubusercontent.com/22607038/251195073-85808807-6359-44df-8a19-9f5d00c591ec.png)
We will explain how to define the meta template with several examples.
Suppose that according to the dialogue template of the dataset, the following dialogue was produced:
```python
PromptList([
dict(role='HUMAN', prompt='1+1=?'),
dict(role='BOT', prompt='2'),
dict(role='HUMAN', prompt='2+2=?'),
dict(role='BOT', prompt='4'),
])
```
We want to pass this dialogue to a model that has already gone through SFT. The model's agreed dialogue begins with the speech of different roles with `<Role Name>:` and ends with a special token and \\n. Here is the complete string the model expects to receive:
```Plain
<HUMAN>: 1+1=?<eoh>
<BOT>: 2<eob>
<HUMAN>: 2+2=?<eoh>
<BOT>: 4<eob>
```
In the meta template, we only need to abstract the format of each round of dialogue into the following configuration:
```python
# model meta template
meta_template = dict(
round=[
dict(role='HUMAN', begin='<HUMAN>: ', end='<eoh>\n'),
dict(role='BOT', begin='<BOT>: ', end='<eob>\n'),
],
)
```
______________________________________________________________________
Some datasets may introduce SYSTEM-level roles:
```python
PromptList([
dict(role='SYSTEM', fallback_role='HUMAN', prompt='Solve the following math questions'),
dict(role='HUMAN', prompt='1+1=?'),
dict(role='BOT', prompt='2'),
dict(role='HUMAN', prompt='2+2=?'),
dict(role='BOT', prompt='4'),
])
```
Assuming the model also accepts the SYSTEM role, and expects the input to be:
```
<SYSTEM>: Solve the following math questions<eosys>\n
<HUMAN>: 1+1=?<eoh>\n
<BOT>: 2<eob>\n
<HUMAN>: 2+2=?<eoh>\n
<BOT>: 4<eob>\n
end of conversation
```
We can put the definition of the SYSTEM role into `reserved_roles`. Roles in `reserved_roles` will not appear in regular conversations, but they allow the dialogue template of the dataset configuration to call them in `begin` or `end`.
```python
# model meta template
meta_template = dict(
round=[
dict(role='HUMAN', begin='<HUMAN>: ', end='<eoh>\n'),
dict(role='BOT', begin='<BOT>: ', end='<eob>\n'),
],
reserved_roles=[dict(role='SYSTEM', begin='<SYSTEM>: ', end='<eosys>\n'),],
),
```
If the model does not accept the SYSTEM role, it is not necessary to configure this item, and it can still run normally. In this case, the string received by the model becomes:
```
<HUMAN>: Solve the following math questions<eoh>\n
<HUMAN>: 1+1=?<eoh>\n
<BOT>: 2<eob>\n
<HUMAN>: 2+2=?<eoh>\n
<BOT>: 4<eob>\n
end of conversation
```
This is because in the predefined datasets in OpenCompass, each `SYSTEM` speech has a `fallback_role='HUMAN'`, that is, if the `SYSTEM` role in the meta template does not exist, the speaker will be switched to the `HUMAN` role.
______________________________________________________________________
Some models may need to consider embedding other strings at the beginning or end of the conversation, such as system instructions:
```
Meta instruction: You are now a helpful and harmless AI assistant.
<SYSTEM>: Solve the following math questions<eosys>\n
<HUMAN>: 1+1=?<eoh>\n
<BOT>: 2<eob>\n
<HUMAN>: 2+2=?<eoh>\n
<BOT>: 4<eob>\n
end of conversation
```
In this case, we can specify these strings by specifying the begin and end parameters.
```python
meta_template = dict(
round=[
dict(role='HUMAN', begin='<HUMAN>: ', end='<eoh>\n'),
dict(role='BOT', begin='<BOT>: ', end='<eob>\n'),
],
reserved_roles=[dict(role='SYSTEM', begin='<SYSTEM>: ', end='<eosys>\n'),],
begin="Meta instruction: You are now a helpful and harmless AI assistant.",
end="end of conversation",
),
```
______________________________________________________________________
In **generative** task evaluation, we will not directly input the answer to the model, but by truncating the prompt, while retaining the previous text, we leave the answer output by the model blank.
```
Meta instruction: You are now a helpful and harmless AI assistant.
<SYSTEM>: Solve the following math questions<eosys>\n
<HUMAN>: 1+1=?<eoh>\n
<BOT>: 2<eob>\n
<HUMAN>: 2+2=?<eoh>\n
<BOT>:
```
We only need to set the `generate` field in BOT's configuration to True, and OpenCompass will automatically leave the last utterance of BOT blank:
```python
# model meta template
meta_template = dict(
round=[
dict(role='HUMAN', begin='<HUMAN>: ', end='<eoh>\n'),
dict(role='BOT', begin='<BOT>: ', end='<eob>\n', generate=True),
],
reserved_roles=[dict(role='SYSTEM', begin='<SYSTEM>: ', end='<eosys>\n'),],
begin="Meta instruction: You are now a helpful and harmless AI assistant.",
end="end of conversation",
),
```
Note that `generate` only affects generative inference. When performing discriminative inference, the prompt received by the model is still complete.
### Full Definition
```bash
models = [
dict(meta_template = dict(
begin="Meta instruction: You are now a helpful and harmless AI assistant.",
round=[
dict(role='HUMAN', begin='HUMAN: ', end='<eoh>\n'), # begin and end can be a list of strings or integers.
dict(role='THOUGHTS', begin='THOUGHTS: ', end='<eot>\n', prompt='None'), # Here we can set the default prompt, which may be overridden by the specific dataset
dict(role='BOT', begin='BOT: ', generate=True, end='<eob>\n'),
],
end="end of conversion",
reserved_roles=[dict(role='SYSTEM', begin='SYSTEM: ', end='\n'),],
eos_token_id=10000,
),
)
]
```
The `meta_template` is a dictionary that can contain the following fields:
- `begin`, `end`: (str, optional) The beginning and ending of the prompt, typically some system-level instructions.
- `round`: (list) The template format of each round of dialogue. The content of the prompt for each round of dialogue is controlled by the dialogue template configured in the dataset.
- `reserved_roles`: (list, optional) Specify roles that do not appear in `round` but may be used in the dataset configuration, such as the `SYSTEM` role.
- `eos_token_id`: (int, optional): Specifies the ID of the model's eos token. If not set, it defaults to the eos token id in the tokenizer. Its main role is to trim the output of the model in generative tasks, so it should generally be set to the first token id of the end corresponding to the item with generate=True.
The `round` of the `meta_template` specifies the format of each role's speech in a round of dialogue. It accepts a list of dictionaries, each dictionary's keys are as follows:
- `role` (str): The name of the role participating in the dialogue. This string does not affect the actual prompt.
- `begin`, `end` (str): Specifies the fixed beginning or end when this role speaks.
- `prompt` (str): The role's prompt. It is allowed to leave it blank in the meta template, but in this case, it must be specified in the prompt of the dataset configuration.
- `generate` (bool): When specified as True, this role is the one the model plays. In generation tasks, the prompt received by the model will be cut off at the `begin` of this role, and the remaining content will be filled by the model.
## Application to API Models
The meta template of the API model is similar to the meta template of the general model, but the configuration is simpler. Users can, as per their requirements, directly use one of the two configurations below to evaluate the API model in a multi-turn dialogue manner:
```bash
# If the API model does not support system instructions
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True)
],
)
# If the API model supports system instructions
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True)
],
reserved_roles=[
dict(role='SYSTEM', api_role='SYSTEM'),
],
)
```
### Principle
Even though different API models accept different data structures, there are commonalities overall. Interfaces that accept dialogue history generally allow users to pass in prompts from the following three roles:
- User
- Robot
- System (optional)
In this regard, OpenCompass has preset three `api_role` values for API models: `HUMAN`, `BOT`, `SYSTEM`, and stipulates that in addition to regular strings, the input accepted by API models includes a middle format of dialogue represented by `PromptList`. The API model will repackage the dialogue in a multi-turn dialogue format and send it to the backend. However, to activate this feature, users need to map the roles `role` in the dataset prompt template to the corresponding `api_role` in the above meta template. The following figure illustrates the relationship between the input accepted by the API model and the Prompt Template and Meta Template.
![](https://user-images.githubusercontent.com/22607038/251195872-63aa7d30-045a-4837-84b5-11b09f07fb18.png)
## Debugging
If you need to debug the prompt, it is recommended to use the `tools/prompt_viewer.py` script to preview the actual prompt received by the model after preparing the configuration file. Read [here](../tools.md#prompt-viewer) for more.
# Prompt Overview
The prompt is the input to the Language Model (LLM), used to guide the model to generate text or calculate perplexity (PPL). The selection of prompts can significantly impact the accuracy of the evaluated model. The process of converting the dataset into a series of prompts is defined by templates.
In OpenCompass, we split the template into two parts: the data-side template and the model-side template. When evaluating a model, the data will pass through both the data-side template and the model-side template, ultimately transforming into the input required by the model.
The data-side template is referred to as [prompt_template](./prompt_template.md), which represents the process of converting the fields in the dataset into prompts.
The model-side template is referred to as [meta_template](./meta_template.md), which represents how the model transforms these prompts into its expected input.
We also offer some prompting examples regarding [Chain of Thought](./chain_of_thought.md).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment