Commit be3dfa50 authored by jerrrrry's avatar jerrrrry
Browse files

Initial commit

parents
Pipeline #2876 failed with stages
in 0 seconds
<div align="center">
<img src="docs/en/_static/image/logo.svg" width="500px"/>
<br />
<br />
[![][github-release-shield]][github-release-link]
[![][github-releasedate-shield]][github-releasedate-link]
[![][github-contributors-shield]][github-contributors-link]<br>
[![][github-forks-shield]][github-forks-link]
[![][github-stars-shield]][github-stars-link]
[![][github-issues-shield]][github-issues-link]
[![][github-license-shield]][github-license-link]
<!-- [![PyPI](https://badge.fury.io/py/opencompass.svg)](https://pypi.org/project/opencompass/) -->
[🌐Website](https://opencompass.org.cn/) |
[📖CompassHub](https://hub.opencompass.org.cn/home) |
[📊CompassRank](https://rank.opencompass.org.cn/home) |
[📘Documentation](https://opencompass.readthedocs.io/en/latest/) |
[🛠️Installation](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) |
[🤔Reporting Issues](https://github.com/open-compass/opencompass/issues/new/choose)
English | [简体中文](README_zh-CN.md)
[![][github-trending-shield]][github-trending-url]
</div>
<p align="center">
👋 join us on <a href="https://discord.gg/KKwfEbFj7U" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=opencompass" target="_blank">WeChat</a>
</p>
> \[!IMPORTANT\]
>
> **Star Us**, You will receive all release notifications from GitHub without any delay ~ ⭐️
<details>
<summary><kbd>Star History</kbd></summary>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&theme=dark&type=Date">
<img width="100%" src="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&type=Date">
</picture>
</details>
## 🧭 Welcome
to **OpenCompass**!
Just like a compass guides us on our journey, OpenCompass will guide you through the complex landscape of evaluating large language models. With its powerful algorithms and intuitive interface, OpenCompass makes it easy to assess the quality and effectiveness of your NLP models.
🚩🚩🚩 Explore opportunities at OpenCompass! We're currently **hiring full-time researchers/engineers and interns**. If you're passionate about LLM and OpenCompass, don't hesitate to reach out to us via [email](mailto:zhangsongyang@pjlab.org.cn). We'd love to hear from you!
🔥🔥🔥 We are delighted to announce that **the OpenCompass has been recommended by the Meta AI**, click [Get Started](https://ai.meta.com/llama/get-started/#validation) of Llama for more information.
> **Attention**<br />
> Breaking Change Notice: In version 0.4.0, we are consolidating all AMOTIC configuration files (previously located in ./configs/datasets, ./configs/models, and ./configs/summarizers) into the opencompass package. Users are advised to update their configuration references to reflect this structural change.
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
- **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
- **\[2024.11.14\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [BABILong](https://arxiv.org/pdf/2406.10149). Have a look at the [demo](examples/eval_babilong.py) and give it a try! 🔥🔥🔥
- **\[2024.10.14\]** We now support the OpenAI multilingual QA dataset [MMMLU](https://huggingface.co/datasets/openai/MMMLU). Feel free to give it a try! 🔥🔥🔥
- **\[2024.09.19\]** We now support [Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) with multiple backend(huggingface/vllm/lmdeploy). Feel free to give them a try! 🔥🔥🔥
- **\[2024.09.17\]** We now support OpenAI o1(`o1-mini-2024-09-12` and `o1-preview-2024-09-12`). Feel free to give them a try! 🔥🔥🔥
- **\[2024.09.05\]** We now support answer extraction through model post-processing to provide a more accurate representation of the model's capabilities. As part of this update, we have integrated [XFinder](https://github.com/IAAR-Shanghai/xFinder) as our first post-processing model. For more detailed information, please refer to the [documentation](opencompass/utils/postprocessors/xfinder/README.md), and give it a try! 🔥🔥🔥
- **\[2024.08.20\]** OpenCompass now supports the [SciCode](https://github.com/scicode-bench/SciCode): A Research Coding Benchmark Curated by Scientists. 🔥🔥🔥
- **\[2024.08.16\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [RULER](https://arxiv.org/pdf/2404.06654). RULER provides an evaluation of long-context including retrieval, multi-hop tracing, aggregation, and question answering through flexible configurations. Check out the [RULER](configs/datasets/ruler/README.md) evaluation config now! 🔥🔥🔥
- **\[2024.08.09\]** We have released the example data and configuration for the CompassBench-202408, welcome to [CompassBench](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/compassbench_intro.html) for more details. 🔥🔥🔥
- **\[2024.08.01\]** We supported the [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) models. Welcome to try! 🔥🔥🔥
- **\[2024.07.23\]** We supported the [ModelScope](www.modelscope.cn) datasets, you can load them on demand without downloading all the data to your local disk. Welcome to try! 🔥🔥🔥
- **\[2024.07.17\]** We are excited to announce the release of NeedleBench's [technical report](http://arxiv.org/abs/2407.11963). We invite you to visit our [support documentation](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html) for detailed evaluation guidelines. 🔥🔥🔥
- **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
- **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
> [More](docs/en/notes/news.md)
## 📊 Leaderboard
We provide [OpenCompass Leaderboard](https://rank.opencompass.org.cn/home) for the community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address `opencompass@pjlab.org.cn`.
You can also refer to [CompassAcademic](configs/eval_academic_leaderboard_202412.py) to quickly reproduce the leaderboard results. The currently selected datasets include Knowledge Reasoning (MMLU-Pro/GPQA Diamond), Logical Reasoning (BBH), Mathematical Reasoning (MATH-500, AIME), Code Generation (LiveCodeBench, HumanEval), and Instruction Following (IFEval)."
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🛠️ Installation
Below are the steps for quick installation and datasets preparation.
### 💻 Environment Setup
We highly recommend using conda to manage your python environment.
- #### Create your virtual environment
```bash
conda create --name opencompass python=3.10 -y
conda activate opencompass
```
- #### Install OpenCompass via pip
```bash
pip install -U opencompass
## Full installation (with support for more datasets)
# pip install "opencompass[full]"
## Environment with model acceleration frameworks
## Manage different acceleration frameworks using virtual environments
## since they usually have dependency conflicts with each other.
# pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"
## API evaluation (i.e. Openai, Qwen)
# pip install "opencompass[api]"
```
- #### Install OpenCompass from source
If you want to use opencompass's latest features, or develop new features, you can also build it from source
```bash
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# pip install -e ".[full]"
# pip install -e ".[vllm]"
```
### 📂 Data Preparation
You can choose one for the following method to prepare datasets.
#### Offline Preparation
You can download and extract the datasets with the following commands:
```bash
# Download dataset to data/ folder
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
```
#### Automatic Download from OpenCompass
We have supported download datasets automatic from the OpenCompass storage server. You can run the evaluation with extra `--dry-run` to download these datasets.
Currently, the supported datasets are listed in [here](https://github.com/open-compass/opencompass/blob/main/opencompass/utils/datasets_info.py#L259). More datasets will be uploaded recently.
#### (Optional) Automatic Download with ModelScope
Also you can use the [ModelScope](www.modelscope.cn) to load the datasets on demand.
Installation:
```bash
pip install modelscope[framework]
export DATASET_SOURCE=ModelScope
```
Then submit the evaluation task without downloading all the data to your local disk. Available datasets include:
```bash
humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ceval, math, LCSTS, Xsum, winogrande, openbookqa, AGIEval, gsm8k, nq, race, siqa, mbpp, mmlu, hellaswag, ARC, BBH, xstory_cloze, summedits, GAOKAO-BENCH, OCNLI, cmnli
```
Some third-party features, like Humaneval and Llama, may require additional steps to work properly, for detailed steps please refer to the [Installation Guide](https://opencompass.readthedocs.io/en/latest/get_started/installation.html).
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🏗️ ️Evaluation
After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared. Now you can start your first evaluation using OpenCompass!
- Your first evaluation with OpenCompass!
OpenCompass support setting your configs via CLI or a python script. For simple evaluation settings we recommend using CLI, for more complex evaluation, it is suggested using the script way. You can find more example scripts under the configs folder.
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass examples/eval_chat_demo.py
```
You can find more script examples under [examples](./examples) folder.
- API evaluation
OpenCompass, by its design, does not really discriminate between open-source models and API models. You can evaluate both model types in the same way or even in one settings.
```bash
export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# CLI
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass examples/eval_api_demo.py
# You can use o1_mini_2024_09_12/o1_preview_2024_09_12 for o1 models, we set max_completion_tokens=8192 as default.
```
- Accelerated Evaluation
Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
# Python scripts
opencompass examples/eval_lmdeploy_demo.py
```
- Supported Models
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
If the model is not on the list but supported by Huggingface AutoModel class, you can also evaluate it with OpenCompass. You are welcome to contribute to the maintenance of the OpenCompass supported model and dataset lists.
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
```
> \[!TIP\]
>
> `--hf-num-gpus` is used for model parallel(huggingface format), `--max-num-worker` is used for data parallel.
> \[!TIP\]
>
> configuration with `_ppl` is designed for base model typically.
> configuration with `_gen` can be used for both base model and chat model.
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📣 OpenCompass 2.0
We are thrilled to introduce OpenCompass 2.0, an advanced suite featuring three key components: [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home).
![oc20](https://github.com/tonysy/opencompass/assets/7881589/90dbe1c0-c323-470a-991e-2b37ab5350b2)
**CompassRank** has been significantly enhanced into the leaderboards that now incorporates both open-source benchmarks and proprietary benchmarks. This upgrade allows for a more comprehensive evaluation of models across the industry.
**CompassHub** presents a pioneering benchmark browser interface, designed to simplify and expedite the exploration and utilization of an extensive array of benchmarks for researchers and practitioners alike. To enhance the visibility of your own benchmark within the community, we warmly invite you to contribute it to CompassHub. You may initiate the submission process by clicking [here](https://hub.opencompass.org.cn/dataset-submit).
**CompassKit** is a powerful collection of evaluation toolkits specifically tailored for Large Language Models and Large Vision-language Models. It provides an extensive set of tools to assess and measure the performance of these complex models effectively. Welcome to try our toolkits for in your research and products.
## ✨ Introduction
![image](https://github.com/open-compass/opencompass/assets/22607038/f45fe125-4aed-4f8c-8fe8-df4efb41a8ea)
OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include:
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
- **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours.
- **Diversified evaluation paradigms**: Support for zero-shot, few-shot, and chain-of-thought evaluations, combined with standard or dialogue-type prompt templates, to easily stimulate the maximum performance of various models.
- **Modular design with high extensibility**: Want to add new models or datasets, customize an advanced task division strategy, or even support a new cluster management system? Everything about OpenCompass can be easily expanded!
- **Experiment management and reporting mechanism**: Use config files to fully record each experiment, and support real-time reporting of results.
## 📖 Dataset Support
We have supported a statistical list of all datasets that can be used on this platform in the documentation on the OpenCompass website.
You can quickly find the dataset you need from the list through sorting, filtering, and searching functions.
Please refer to the dataset statistics chapter of [official document](https://opencompass.org.cn/doc) for details.
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Model Support
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Open-source Models</b>
</td>
<td>
<b>API Models</b>
</td>
<!-- <td>
<b>Custom Models</b>
</td> -->
</tr>
<tr valign="top">
<td>
- [Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- [Baichuan](https://github.com/baichuan-inc)
- [BlueLM](https://github.com/vivo-ai-lab/BlueLM)
- [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B)
- [ChatGLM3](https://github.com/THUDM/ChatGLM3-6B)
- [Gemma](https://huggingface.co/google/gemma-7b)
- [InternLM](https://github.com/InternLM/InternLM)
- [LLaMA](https://github.com/facebookresearch/llama)
- [LLaMA3](https://github.com/meta-llama/llama3)
- [Qwen](https://github.com/QwenLM/Qwen)
- [TigerBot](https://github.com/TigerResearch/TigerBot)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [WizardLM](https://github.com/nlpxucan/WizardLM)
- [Yi](https://github.com/01-ai/Yi)
- ……
</td>
<td>
- OpenAI
- Gemini
- Claude
- ZhipuAI(ChatGLM)
- Baichuan
- ByteDance(YunQue)
- Huawei(PanGu)
- 360
- Baidu(ERNIEBot)
- MiniMax(ABAB-Chat)
- SenseTime(nova)
- Xunfei(Spark)
- ……
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🔜 Roadmap
- [x] Subjective Evaluation
- [x] Release CompassAreana.
- [x] Subjective evaluation.
- [x] Long-context
- [x] Long-context evaluation with extensive datasets.
- [ ] Long-context leaderboard.
- [x] Coding
- [ ] Coding evaluation leaderboard.
- [x] Non-python language evaluation service.
- [x] Agent
- [ ] Support various agent frameworks.
- [x] Evaluation of tool use of the LLMs.
- [x] Robustness
- [x] Support various attack methods.
## 👷‍♂️ Contributing
We appreciate all contributions to improving OpenCompass. Please refer to the [contributing guideline](https://opencompass.readthedocs.io/en/latest/notes/contribution_guide.html) for the best practice.
<!-- Copy-paste in your Readme.md file -->
<!-- Made with [OSS Insight](https://ossinsight.io/) -->
<a href="https://github.com/open-compass/opencompass/graphs/contributors" target="_blank">
<table>
<tr>
<th colspan="2">
<br><img src="https://contrib.rocks/image?repo=open-compass/opencompass"><br><br>
</th>
</tr>
</table>
</a>
## 🤝 Acknowledgements
Some code in this project is cited and modified from [OpenICL](https://github.com/Shark-NLP/OpenICL).
Some datasets and prompt implementations are modified from [chain-of-thought-hub](https://github.com/FranxYao/chain-of-thought-hub) and [instruct-eval](https://github.com/declare-lab/instruct-eval).
## 🖊️ Citation
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
```
<p align="right"><a href="#top">🔝Back to top</a></p>
[github-contributors-link]: https://github.com/open-compass/opencompass/graphs/contributors
[github-contributors-shield]: https://img.shields.io/github/contributors/open-compass/opencompass?color=c4f042&labelColor=black&style=flat-square
[github-forks-link]: https://github.com/open-compass/opencompass/network/members
[github-forks-shield]: https://img.shields.io/github/forks/open-compass/opencompass?color=8ae8ff&labelColor=black&style=flat-square
[github-issues-link]: https://github.com/open-compass/opencompass/issues
[github-issues-shield]: https://img.shields.io/github/issues/open-compass/opencompass?color=ff80eb&labelColor=black&style=flat-square
[github-license-link]: https://github.com/open-compass/opencompass/blob/main/LICENSE
[github-license-shield]: https://img.shields.io/github/license/open-compass/opencompass?color=white&labelColor=black&style=flat-square
[github-release-link]: https://github.com/open-compass/opencompass/releases
[github-release-shield]: https://img.shields.io/github/v/release/open-compass/opencompass?color=369eff&labelColor=black&logo=github&style=flat-square
[github-releasedate-link]: https://github.com/open-compass/opencompass/releases
[github-releasedate-shield]: https://img.shields.io/github/release-date/open-compass/opencompass?labelColor=black&style=flat-square
[github-stars-link]: https://github.com/open-compass/opencompass/stargazers
[github-stars-shield]: https://img.shields.io/github/stars/open-compass/opencompass?color=ffcb47&labelColor=black&style=flat-square
[github-trending-shield]: https://trendshift.io/api/badge/repositories/6630
[github-trending-url]: https://trendshift.io/repositories/6630
\ No newline at end of file
<div align="center">
<img src="docs/zh_cn/_static/image/logo.svg" width="500px"/>
<br />
<br />
[![][github-release-shield]][github-release-link]
[![][github-releasedate-shield]][github-releasedate-link]
[![][github-contributors-shield]][github-contributors-link]<br>
[![][github-forks-shield]][github-forks-link]
[![][github-stars-shield]][github-stars-link]
[![][github-issues-shield]][github-issues-link]
[![][github-license-shield]][github-license-link]
<!-- [![PyPI](https://badge.fury.io/py/opencompass.svg)](https://pypi.org/project/opencompass/) -->
[🌐官方网站](https://opencompass.org.cn/) |
[📖数据集社区](https://hub.opencompass.org.cn/home) |
[📊性能榜单](https://rank.opencompass.org.cn/home) |
[📘文档教程](https://opencompass.readthedocs.io/zh_CN/latest/index.html) |
[🛠️安装](https://opencompass.readthedocs.io/zh_CN/latest/get_started/installation.html) |
[🤔报告问题](https://github.com/open-compass/opencompass/issues/new/choose)
[English](/README.md) | 简体中文
[![][github-trending-shield]][github-trending-url]
</div>
<p align="center">
👋 加入我们的 <a href="https://discord.gg/KKwfEbFj7U" target="_blank">Discord</a><a href="https://r.vansin.top/?r=opencompass" target="_blank">微信社区</a>
</p>
> \[!IMPORTANT\]
>
> **收藏项目**,你将能第一时间获取 OpenCompass 的最新动态~⭐️
<details>
<summary><kbd>Star History</kbd></summary>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&theme=dark&type=Date">
<img width="100%" src="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&type=Date">
</picture>
</details>
## 🧭 欢迎
来到**OpenCompass**
就像指南针在我们的旅程中为我们导航一样,我们希望OpenCompass能够帮助你穿越评估大型语言模型的重重迷雾。OpenCompass提供丰富的算法和功能支持,期待OpenCompass能够帮助社区更便捷地对NLP模型的性能进行公平全面的评估。
🚩🚩🚩 欢迎加入 OpenCompass!我们目前**招聘全职研究人员/工程师和实习生**。如果您对 LLM 和 OpenCompass 充满热情,请随时通过[电子邮件](mailto:zhangsongyang@pjlab.org.cn)与我们联系。我们非常期待与您交流!
🔥🔥🔥 祝贺 **OpenCompass 作为大模型标准测试工具被Meta AI官方推荐**, 点击 Llama 的 [入门文档](https://ai.meta.com/llama/get-started/#validation) 获取更多信息。
> **注意**<br />
> 重要通知:从 v0.4.0 版本开始,所有位于 ./configs/datasets、./configs/models 和 ./configs/summarizers 目录下的 AMOTIC 配置文件将迁移至 opencompass 包中。请及时更新您的配置文件路径。
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/en/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具:用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。
- **\[2024.10.14\]** 现已支持OpenAI多语言问答数据集[MMMLU](https://huggingface.co/datasets/openai/MMMLU),欢迎尝试! 🔥🔥🔥
- **\[2024.09.19\]** 现已支持[Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) ,可以使用多种推理后端(huggingface/vllm/lmdeploy), 欢迎尝试! 🔥🔥🔥
- **\[2024.09.05\]** 现已支持OpenAI o1 模型(`o1-mini-2024-09-12` and `o1-preview-2024-09-12`), 欢迎尝试! 🔥🔥🔥
- **\[2024.09.05\]** OpenCompass 现在支持通过模型后处理来进行答案提取,以更准确地展示模型的能力。作为此次更新的一部分,我们集成了 [XFinder](https://github.com/IAAR-Shanghai/xFinder) 作为首个后处理模型。具体信息请参阅 [文档](opencompass/utils/postprocessors/xfinder/README.md),欢迎尝试! 🔥🔥🔥
- **\[2024.08.20\]** OpenCompass 现已支持 [SciCode](https://github.com/scicode-bench/SciCode): A Research Coding Benchmark Curated by Scientists。 🔥🔥🔥
- **\[2024.08.16\]** OpenCompass 现已支持全新的长上下文语言模型评估基准——[RULER](https://arxiv.org/pdf/2404.06654)。RULER 通过灵活的配置,提供了对长上下文包括检索、多跳追踪、聚合和问答等多种任务类型的评测,欢迎访问[RULER](configs/datasets/ruler/README.md)。🔥🔥🔥
- **\[2024.07.23\]** 我们支持了[Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)模型,欢迎试用!🔥🔥🔥
- **\[2024.07.23\]** 我们支持了[ModelScope](www.modelscope.cn)数据集,您可以按需加载,无需事先下载全部数据到本地,欢迎试用!🔥🔥🔥
- **\[2024.07.17\]** 我们发布了CompassBench-202407榜单的示例数据和评测规则,敬请访问 [CompassBench](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/compassbench_intro.html) 获取更多信息。 🔥🔥🔥
- **\[2024.07.17\]** 我们正式发布 NeedleBench 的[技术报告](http://arxiv.org/abs/2407.11963)。诚邀您访问我们的[帮助文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/needleinahaystack_eval.html)进行评估。🔥🔥🔥
- **\[2024.07.04\]** OpenCompass 现已支持 InternLM2.5, 它拥有卓越的推理性能、有效支持百万字超长上下文以及工具调用能力整体升级,欢迎访问[OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm)[InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
- **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端,助力评测过程更加高效。除了默认的HuggingFace推理后端外,还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy)[vLLM](https://github.com/vllm-project/vllm) ,支持命令行一键切换和部署 API 加速服务两种方式,详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。欢迎试用!🔥🔥🔥.
> [更多](docs/zh_cn/notes/news.md)
## 📊 性能榜单
我们将陆续提供开源模型和 API 模型的具体性能榜单,请见 [OpenCompass Leaderboard](https://rank.opencompass.org.cn/home) 。如需加入评测,请提供模型仓库地址或标准的 API 接口至邮箱 `opencompass@pjlab.org.cn`.
你也可以参考[CompassAcademic](configs/eval_academic_leaderboard_202412.py),快速地复现榜单的结果,目前选取的数据集包括 综合知识推理 (MMLU-Pro/GPQA Diamond) ,逻辑推理 (BBH) ,数学推理 (MATH-500, AIME) ,代码生成 (LiveCodeBench, HumanEval) ,指令跟随 (IFEval) 。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🛠️ 安装指南
下面提供了快速安装和数据集准备的步骤。
### 💻 环境搭建
我们强烈建议使用 `conda` 来管理您的 Python 环境。
- #### 创建虚拟环境
```bash
conda create --name opencompass python=3.10 -y
conda activate opencompass
```
- #### 通过pip安装OpenCompass
```bash
# 支持绝大多数数据集及模型
pip install -U opencompass
# 完整安装(支持更多数据集)
# pip install "opencompass[full]"
# 模型推理后端,由于这些推理后端通常存在依赖冲突,建议使用不同的虚拟环境来管理它们。
# pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"
# API 测试(例如 OpenAI、Qwen)
# pip install "opencompass[api]"
```
- #### 基于源码安装OpenCompass
如果希望使用 OpenCompass 的最新功能,也可以从源代码构建它:
```bash
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# pip install -e ".[full]"
# pip install -e ".[vllm]"
```
### 📂 数据准备
#### 提前离线下载
OpenCompass支持使用本地数据集进行评测,数据集的下载和解压可以通过以下命令完成:
```bash
# 下载数据集到 data/ 处
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
```
#### 从 OpenCompass 自动下载
我们已经支持从OpenCompass存储服务器自动下载数据集。您可以通过额外的 `--dry-run` 参数来运行评估以下载这些数据集。
目前支持的数据集列表在[这里](https://github.com/open-compass/opencompass/blob/main/opencompass/utils/datasets_info.py#L259)。更多数据集将会很快上传。
#### (可选) 使用 ModelScope 自动下载
另外,您还可以使用[ModelScope](www.modelscope.cn)来加载数据集:
环境准备:
```bash
pip install modelscope
export DATASET_SOURCE=ModelScope
```
配置好环境后,无需下载全部数据,直接提交评测任务即可。目前支持的数据集有:
```bash
humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ceval, math, LCSTS, Xsum, winogrande, openbookqa, AGIEval, gsm8k, nq, race, siqa, mbpp, mmlu, hellaswag, ARC, BBH, xstory_cloze, summedits, GAOKAO-BENCH, OCNLI, cmnli
```
有部分第三方功能,如 Humaneval 以及 Llama,可能需要额外步骤才能正常运行,详细步骤请参考[安装指南](https://opencompass.readthedocs.io/zh_CN/latest/get_started/installation.html)
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🏗️ ️评测
在确保按照上述步骤正确安装了 OpenCompass 并准备好了数据集之后,现在您可以开始使用 OpenCompass 进行首次评估!
- ### 首次评测
OpenCompass 支持通过命令行界面 (CLI) 或 Python 脚本来设置配置。对于简单的评估设置,我们推荐使用 CLI;而对于更复杂的评估,则建议使用脚本方式。你可以在examples文件夹下找到更多脚本示例。
```bash
# 命令行界面 (CLI)
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
# Python 脚本
opencompass examples/eval_chat_demo.py
```
你可以在[examples](./examples) 文件夹下找到更多的脚本示例。
- ### API评测
OpenCompass 在设计上并不区分开源模型与 API 模型。您可以以相同的方式或甚至在同一设置中评估这两种类型的模型。
```bash
export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# 命令行界面 (CLI)
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
# Python 脚本
opencompass examples/eval_api_demo.py
# 现已支持 o1_mini_2024_09_12/o1_preview_2024_09_12 模型, 默认情况下 max_completion_tokens=8192.
```
- ### 推理后端
另外,如果您想使用除 HuggingFace 之外的推理后端来进行加速评估,比如 LMDeploy 或 vLLM,可以通过以下命令进行。请确保您已经为所选的后端安装了必要的软件包,并且您的模型支持该后端的加速推理。更多信息,请参阅关于推理加速后端的文档 [这里](docs/zh_cn/advanced_guides/accelerator_intro.md)。以下是使用 LMDeploy 的示例:
```bash
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
```
OpenCompass 预定义了许多模型和数据集的配置,你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
- ### 支持的模型
```bash
# 列出所有配置
python tools/list_configs.py
# 列出所有跟 llama 及 mmlu 相关的配置
python tools/list_configs.py llama mmlu
```
如果模型不在列表中但支持 Huggingface AutoModel 类,您仍然可以使用 OpenCompass 对其进行评估。欢迎您贡献维护 OpenCompass 支持的模型和数据集列表。
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
如果你想在多块 GPU 上使用模型进行推理,您可以使用 `--max-num-worker` 参数。
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
```
> \[!TIP\]
>
> `--hf-num-gpus` 用于 模型并行(huggingface 格式),`--max-num-worker` 用于数据并行。
> \[!TIP\]
>
> configuration with `_ppl` is designed for base model typically.
> 配置带 `_ppl` 的配置设计给基础模型使用。
> 配置带 `_gen` 的配置可以同时用于基础模型和对话模型。
通过命令行或配置文件,OpenCompass 还支持评测 API 或自定义模型,以及更多样化的评测策略。请阅读[快速开始](https://opencompass.readthedocs.io/zh_CN/latest/get_started/quick_start.html)了解如何运行一个评测任务。
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📣 OpenCompass 2.0
我们很高兴发布 OpenCompass 司南 2.0 大模型评测体系,它主要由三大核心模块构建而成:[CompassKit](https://github.com/open-compass)[CompassHub](https://hub.opencompass.org.cn/home)以及[CompassRank](https://rank.opencompass.org.cn/home)
**CompassRank** 系统进行了重大革新与提升,现已成为一个兼容并蓄的排行榜体系,不仅囊括了开源基准测试项目,还包含了私有基准测试。此番升级极大地拓宽了对行业内各类模型进行全面而深入测评的可能性。
**CompassHub** 创新性地推出了一个基准测试资源导航平台,其设计初衷旨在简化和加快研究人员及行业从业者在多样化的基准测试库中进行搜索与利用的过程。为了让更多独具特色的基准测试成果得以在业内广泛传播和应用,我们热忱欢迎各位将自定义的基准数据贡献至CompassHub平台。只需轻点鼠标,通过访问[这里](https://hub.opencompass.org.cn/dataset-submit),即可启动提交流程。
**CompassKit** 是一系列专为大型语言模型和大型视觉-语言模型打造的强大评估工具合集,它所提供的全面评测工具集能够有效地对这些复杂模型的功能性能进行精准测量和科学评估。在此,我们诚挚邀请您在学术研究或产品研发过程中积极尝试运用我们的工具包,以助您取得更加丰硕的研究成果和产品优化效果。
## ✨ 介绍
![image](https://github.com/open-compass/opencompass/assets/22607038/30bcb2e2-3969-4ac5-9f29-ad3f4abb4f3b)
OpenCompass 是面向大模型评测的一站式平台。其主要特点如下:
- **开源可复现**:提供公平、公开、可复现的大模型评测方案
- **全面的能力维度**:五大维度设计,提供 70+ 个数据集约 40 万题的的模型评测方案,全面评估模型能力
- **丰富的模型支持**:已支持 20+ HuggingFace 及 API 模型
- **分布式高效评测**:一行命令实现任务分割和分布式评测,数小时即可完成千亿模型全量评测
- **多样化评测范式**:支持零样本、小样本及思维链评测,结合标准型或对话型提示词模板,轻松激发各种模型最大性能
- **灵活化拓展**:想增加新模型或数据集?想要自定义更高级的任务分割策略,甚至接入新的集群管理系统?OpenCompass 的一切均可轻松扩展!
## 📖 数据集支持
我们已经在OpenCompass官网的文档中支持了所有可在本平台上使用的数据集的统计列表。
您可以通过排序、筛选和搜索等功能从列表中快速找到您需要的数据集。
详情请参阅 [官方文档](https://opencompass.org.cn/doc) 的数据集统计章节。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📖 模型支持
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>开源模型</b>
</td>
<td>
<b>API 模型</b>
</td>
<!-- <td>
<b>自定义模型</b>
</td> -->
</tr>
<tr valign="top">
<td>
- [Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- [Baichuan](https://github.com/baichuan-inc)
- [BlueLM](https://github.com/vivo-ai-lab/BlueLM)
- [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B)
- [ChatGLM3](https://github.com/THUDM/ChatGLM3-6B)
- [Gemma](https://huggingface.co/google/gemma-7b)
- [InternLM](https://github.com/InternLM/InternLM)
- [LLaMA](https://github.com/facebookresearch/llama)
- [LLaMA3](https://github.com/meta-llama/llama3)
- [Qwen](https://github.com/QwenLM/Qwen)
- [TigerBot](https://github.com/TigerResearch/TigerBot)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [WizardLM](https://github.com/nlpxucan/WizardLM)
- [Yi](https://github.com/01-ai/Yi)
- ……
</td>
<td>
- OpenAI
- Gemini
- Claude
- ZhipuAI(ChatGLM)
- Baichuan
- ByteDance(YunQue)
- Huawei(PanGu)
- 360
- Baidu(ERNIEBot)
- MiniMax(ABAB-Chat)
- SenseTime(nova)
- Xunfei(Spark)
- ……
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🔜 路线图
- [x] 主观评测
- [x] 发布主观评测榜单
- [x] 发布主观评测数据集
- [x] 长文本
- [x] 支持广泛的长文本评测集
- [ ] 发布长文本评测榜单
- [x] 代码能力
- [ ] 发布代码能力评测榜单
- [x] 提供非Python语言的评测服务
- [x] 智能体
- [ ] 支持丰富的智能体方案
- [x] 提供智能体评测榜单
- [x] 鲁棒性
- [x] 支持各类攻击方法
## 👷‍♂️ 贡献
我们感谢所有的贡献者为改进和提升 OpenCompass 所作出的努力。请参考[贡献指南](https://opencompass.readthedocs.io/zh_CN/latest/notes/contribution_guide.html)来了解参与项目贡献的相关指引。
<a href="https://github.com/open-compass/opencompass/graphs/contributors" target="_blank">
<table>
<tr>
<th colspan="2">
<br><img src="https://contrib.rocks/image?repo=open-compass/opencompass"><br><br>
</th>
</tr>
</table>
</a>
## 🤝 致谢
该项目部分的代码引用并修改自 [OpenICL](https://github.com/Shark-NLP/OpenICL)
该项目部分的数据集和提示词实现修改自 [chain-of-thought-hub](https://github.com/FranxYao/chain-of-thought-hub), [instruct-eval](https://github.com/declare-lab/instruct-eval)
## 🖊️ 引用
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
```
<p align="right"><a href="#top">🔝返回顶部</a></p>
[github-contributors-link]: https://github.com/open-compass/opencompass/graphs/contributors
[github-contributors-shield]: https://img.shields.io/github/contributors/open-compass/opencompass?color=c4f042&labelColor=black&style=flat-square
[github-forks-link]: https://github.com/open-compass/opencompass/network/members
[github-forks-shield]: https://img.shields.io/github/forks/open-compass/opencompass?color=8ae8ff&labelColor=black&style=flat-square
[github-issues-link]: https://github.com/open-compass/opencompass/issues
[github-issues-shield]: https://img.shields.io/github/issues/open-compass/opencompass?color=ff80eb&labelColor=black&style=flat-square
[github-license-link]: https://github.com/open-compass/opencompass/blob/main/LICENSE
[github-license-shield]: https://img.shields.io/github/license/open-compass/opencompass?color=white&labelColor=black&style=flat-square
[github-release-link]: https://github.com/open-compass/opencompass/releases
[github-release-shield]: https://img.shields.io/github/v/release/open-compass/opencompass?color=369eff&labelColor=black&logo=github&style=flat-square
[github-releasedate-link]: https://github.com/open-compass/opencompass/releases
[github-releasedate-shield]: https://img.shields.io/github/release-date/open-compass/opencompass?labelColor=black&style=flat-square
[github-stars-link]: https://github.com/open-compass/opencompass/stargazers
[github-stars-shield]: https://img.shields.io/github/stars/open-compass/opencompass?color=ffcb47&labelColor=black&style=flat-square
[github-trending-shield]: https://trendshift.io/api/badge/repositories/6630
[github-trending-url]: https://trendshift.io/repositories/6630
- ifeval:
name: IFEval
category: Instruction Following
paper: https://arxiv.org/pdf/2311.07911
configpath: opencompass/configs/datasets/IFEval
- nphard:
name: NPHardEval
category: Reasoning
paper: https://arxiv.org/pdf/2312.14890v2
configpath: opencompass/configs/datasets/NPHardEval
- pmmeval:
name: PMMEval
category: Language
paper: https://arxiv.org/pdf/2411.09116v1
configpath: opencompass/configs/datasets/PMMEval
- theoremqa:
name: TheroremQA
category: Reasoning
paper: https://arxiv.org/pdf/2305.12524
configpath: opencompass/configs/datasets/TheroremQA
- agieval:
name: AGIEval
category: Examination
paper: https://arxiv.org/pdf/2304.06364
configpath: opencompass/configs/datasets/agieval
- babilong:
name: BABILong
category: Long Context
paper: https://arxiv.org/pdf/2406.10149
configpath: opencompass/configs/datasets/babilong
- bigcodebench:
name: BigCodeBench
category: Code
paper: https://arxiv.org/pdf/2406.15877
configpath: opencompass/configs/datasets/bigcodebench
- calm:
name: CaLM
category: Reasoning
paper: https://arxiv.org/pdf/2405.00622
configpath: opencompass/configs/datasets/calm
- infinitebench:
name: InfiniteBench (∞Bench)
category: Long Context
paper: https://aclanthology.org/2024.acl-long.814.pdf
configpath: opencompass/configs/datasets/infinitebench
- korbench:
name: KOR-Bench
category: Reasoning
paper: https://arxiv.org/pdf/2410.06526v1
configpath: opencompass/configs/datasets/korbench
- lawbench:
name: LawBench
category: Knowledge / Law
paper: https://arxiv.org/pdf/2309.16289
configpath: opencompass/configs/datasets/lawbench
- leval:
name: L-Eval
category: Long Context
paper: https://arxiv.org/pdf/2307.11088v1
configpath: opencompass/configs/datasets/leval
- livecodebench:
name: LiveCodeBench
category: Code
paper: https://arxiv.org/pdf/2403.07974
configpath: opencompass/configs/datasets/livecodebench
- livemathbench:
name: LiveMathBench
category: Math
paper: https://arxiv.org/pdf/2412.13147
configpath: opencompass/configs/datasets/livemathbench
- longbench:
name: LongBench
category: Long Context
paper: https://github.com/THUDM/LongBench
configpath: opencompass/configs/datasets/livemathbench
- lveval:
name: LV-Eval
category: Long Context
paper: https://arxiv.org/pdf/2402.05136
configpath: opencompass/configs/datasets/lveval
- medbench:
name: MedBench
category: Knowledge / Medicine
paper: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10778138
configpath: opencompass/configs/datasets/MedBench
- musr:
name: MuSR
category: Reasoning
paper: https://arxiv.org/pdf/2310.16049
configpath: opencompass/configs/datasets/musr
- needlebench:
name: NeedleBench
category: Long Context
paper: https://arxiv.org/pdf/2407.11963
configpath: opencompass/configs/datasets/needlebench
- ruler:
name: RULER
category: Long Context
paper: https://arxiv.org/pdf/2404.06654
configpath: opencompass/configs/datasets/ruler
- alignment:
name: AlignBench
category: Subjective / Alignment
paper: https://arxiv.org/pdf/2311.18743
configpath: opencompass/configs/datasets/subjective/alignbench
- alpaca:
name: AlpacaEval
category: Subjective / Instruction Following
paper: https://github.com/tatsu-lab/alpaca_eval
configpath: opencompass/configs/datasets/subjective/aplaca_eval
- arenahard:
name: Arena-Hard
category: Subjective / Chatbot
paper: https://lmsys.org/blog/2024-04-19-arena-hard/
configpath: opencompass/configs/datasets/subjective/arena_hard
- flames:
name: FLAMES
category: Subjective / Alignment
paper: https://arxiv.org/pdf/2311.06899
configpath: opencompass/configs/datasets/subjective/flames
- fofo:
name: FOFO
category: Subjective / Format Following
paper: https://arxiv.org/pdf/2402.18667
configpath: opencompass/configs/datasets/subjective/fofo
- followbench:
name: FollowBench
category: Subjective / Instruction Following
paper: https://arxiv.org/pdf/2310.20410
configpath: opencompass/configs/datasets/subjective/followbench
- hellobench:
name: HelloBench
category: Subjective / Long Context
paper: https://arxiv.org/pdf/2409.16191
configpath: opencompass/configs/datasets/subjective/hellobench
- judgerbench:
name: JudgerBench
category: Subjective / Long Context
paper: https://arxiv.org/pdf/2410.16256
configpath: opencompass/configs/datasets/subjective/judgerbench
- multiround:
name: MT-Bench-101
category: Subjective / Multi-Round
paper: https://arxiv.org/pdf/2402.14762
configpath: opencompass/configs/datasets/subjective/multiround
- wildbench:
name: WildBench
category: Subjective / Real Task
paper: https://arxiv.org/pdf/2406.04770
configpath: opencompass/configs/datasets/subjective/wildbench
- teval:
name: T-Eval
category: Tool Utilization
paper: https://arxiv.org/pdf/2312.14033
configpath: opencompass/configs/datasets/teval
- finalceiq:
name: FinanceIQ
category: Knowledge / Finance
paper: https://github.com/Duxiaoman-DI/XuanYuan/tree/main/FinanceIQ
configpath: opencompass/configs/datasets/FinanceIQ
- gaokaobench:
name: GAOKAOBench
category: Examination
paper: https://arxiv.org/pdf/2305.12474
configpath: opencompass/configs/datasets/GaokaoBench
- lcbench:
name: LCBench
category: Code
paper: https://github.com/open-compass/CodeBench/
configpath: opencompass/configs/datasets/LCBench
- MMLUArabic:
name: ArabicMMLU
category: Language
paper: https://arxiv.org/pdf/2402.12840
configpath: opencompass/configs/datasets/MMLUArabic
- OpenFinData:
name: OpenFinData
category: Knowledge / Finance
paper: https://github.com/open-compass/OpenFinData
configpath: opencompass/configs/datasets/OpenFinData
- QuALITY:
name: QuALITY
category: Long Context
paper: https://arxiv.org/pdf/2112.08608
configpath: opencompass/configs/datasets/QuALITY
- advglue:
name: Adversarial GLUE
category: Safety
paper: https://openreview.net/pdf?id=GF9cSKI3A_q
configpath: opencompass/configs/datasets/adv_glue
- afqmcd:
name: CLUE / AFQMC
category: Language
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_afqmc
- aime2024:
name: AIME2024
category: Examination
paper: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
configpath: opencompass/configs/datasets/aime2024
- anli:
name: Adversarial NLI
category: Reasoning
paper: https://arxiv.org/pdf/1910.14599v2
configpath: opencompass/configs/datasets/anli
- anthropics_evals:
name: Anthropics Evals
category: Safety
paper: https://arxiv.org/pdf/2212.09251
configpath: opencompass/configs/datasets/anthropics_evals
- apps:
name: APPS
category: Code
paper: https://arxiv.org/pdf/2105.09938
configpath: opencompass/configs/datasets/apps
- arc:
name: ARC
category: Reasoning
paper: https://arxiv.org/pdf/1803.05457
configpath: [opencompass/configs/datasets/ARC_c, opencompass/configs/datasets/ARC_e]
- arc_prize_public_eval:
name: ARC Prize
category: ARC-AGI
paper: https://arcprize.org/guide#private
configpath: opencompass/configs/datasets/ARC_Prize_Public_Evaluation
- ax:
name: SuperGLUE / AX
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: [opencompass/configs/datasets/SuperGLUE_AX_b, opencompass/configs/datasets/SuperGLUE_AX_g]
- bbh:
name: BIG-Bench Hard
category: Reasoning
paper: https://arxiv.org/pdf/2210.09261
configpath: opencompass/configs/datasets/bbh
- BoolQ:
name: SuperGLUE / BoolQ
category: Knowledge
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_BoolQ
- c3:
name: CLUE / C3 (C³)
category: Understanding
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_C3
- cb:
name: SuperGLUE / CB
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_CB
- ceval:
name: C-EVAL
category: Examination
paper: https://arxiv.org/pdf/2305.08322v1
configpath: opencompass/configs/datasets/ceval
- charm:
name: CHARM
category: Reasoning
paper: https://arxiv.org/pdf/2403.14112
configpath: opencompass/configs/datasets/CHARM
- chembench:
name: ChemBench
category: Knowledge / Chemistry
paper: https://arxiv.org/pdf/2404.01475
configpath: opencompass/configs/datasets/ChemBench
- chid:
name: FewCLUE / CHID
category: Language
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_chid
- chinese_simpleqa:
name: Chinese SimpleQA
category: Knowledge
paper: https://arxiv.org/pdf/2411.07140
configpath: opencompass/configs/datasets/chinese_simpleqa
- cibench:
name: CIBench
category: Code
paper: https://www.arxiv.org/pdf/2407.10499
configpath: opencompass/configs/datasets/CIBench
- civilcomments:
name: CivilComments
category: Safety
paper: https://arxiv.org/pdf/1903.04561
configpath: opencompass/configs/datasets/civilcomments
- clozeTest_maxmin:
name: Cloze Test-max/min
category: Code
paper: https://arxiv.org/pdf/2102.04664
configpath: opencompass/configs/datasets/clozeTest_maxmin
- cluewsc:
name: FewCLUE / CLUEWSC
category: Language / WSC
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_cluewsc
- cmb:
name: CMB
category: Knowledge / Medicine
paper: https://arxiv.org/pdf/2308.08833
configpath: opencompass/configs/datasets/cmb
- cmmlu:
name: CMMLU
category: Understanding
paper: https://arxiv.org/pdf/2306.09212
configpath: opencompass/configs/datasets/cmmlu
- cmnli:
name: CLUE / CMNLI
category: Reasoning
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_cmnli
- cmo_fib:
name: cmo_fib
category: Examination
paper: ""
configpath: opencompass/configs/datasets/cmo_fib
- cmrc:
name: CLUE / CMRC
category: Understanding
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_CMRC
- commonsenseqa:
name: CommonSenseQA
category: Knowledge
paper: https://arxiv.org/pdf/1811.00937v2
configpath: opencompass/configs/datasets/commonsenseqa
- commonsenseqa_cn:
name: CommonSenseQA-CN
category: Knowledge
paper: ""
configpath: opencompass/configs/datasets/commonsenseqa_cn
- copa:
name: SuperGLUE / COPA
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_COPA
- crowspairs:
name: CrowsPairs
category: Safety
paper: https://arxiv.org/pdf/2010.00133
configpath: opencompass/configs/datasets/crowspairs
- crowspairs_cn:
name: CrowsPairs-CN
category: Safety
paper: ""
configpath: opencompass/configs/datasets/crowspairs_cn
- cvalues:
name: CVALUES
category: Safety
paper: http://xdp-expriment.oss-cn-zhangjiakou.aliyuncs.com/shanqi.xgh/release_github/CValues.pdf
configpath: opencompass/configs/datasets/cvalues
- drcd:
name: CLUE / DRCD
category: Understanding
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_DRCD
- drop:
name: DROP (DROP Simple Eval)
category: Understanding
paper: https://arxiv.org/pdf/1903.00161
configpath: opencompass/configs/datasets/drop
- ds1000:
name: DS-1000
category: Code
paper: https://arxiv.org/pdf/2211.11501
configpath: opencompass/configs/datasets/ds1000
- eprstmt:
name: FewCLUE / EPRSTMT
category: Understanding
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_eprstmt
- flores:
name: Flores
category: Language
paper: https://aclanthology.org/D19-1632.pdf
configpath: opencompass/configs/datasets/flores
- game24:
name: Game24
category: Math
paper: https://huggingface.co/datasets/nlile/24-game
configpath: opencompass/configs/datasets/game24
- govrepcrs:
name: Government Report Dataset
category: Long Context
paper: https://aclanthology.org/2021.naacl-main.112.pdf
configpath: opencompass/configs/datasets/govrepcrs
- gpqa:
name: GPQA
category: Knowledge
paper: https://arxiv.org/pdf/2311.12022v1
configpath: opencompass/configs/datasets/gpqa
- gsm8k:
name: GSM8K
category: Math
paper: https://arxiv.org/pdf/2110.14168v2
configpath: opencompass/configs/datasets/gsm8k
- gsm_hard:
name: GSM-Hard
category: Math
paper: https://proceedings.mlr.press/v202/gao23f/gao23f.pdf
configpath: opencompass/configs/datasets/gsm_hard
- hle:
name: HLE(Humanity's Last Exam)
category: Reasoning
paper: https://lastexam.ai/paper
configpath: opencompass/configs/datasets/HLE
- hellaswag:
name: HellaSwag
category: Reasoning
paper: https://arxiv.org/pdf/1905.07830
configpath: opencompass/configs/datasets/hellaswag
- humaneval:
name: HumanEval
category: Code
paper: https://arxiv.org/pdf/2107.03374v2
configpath: opencompass/configs/datasets/humaneval
- humaneval_cn:
name: HumanEval-CN
category: Code
paper: ""
configpath: opencompass/configs/datasets/humaneval_cn
- humaneval_multi:
name: Multi-HumanEval
category: Code
paper: https://arxiv.org/pdf/2210.14868
configpath: opencompass/configs/datasets/humaneval_multi
- humanevalx:
name: HumanEval-X
category: Code
paper: https://dl.acm.org/doi/pdf/10.1145/3580305.3599790
configpath: opencompass/configs/datasets/humanevalx
- hungarian_math:
name: Hungarian_Math
category: Math
paper: https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam
configpath: opencompass/configs/datasets/hungarian_exam
- iwslt2017:
name: IWSLT2017
category: Language
paper: https://cris.fbk.eu/bitstream/11582/312796/1/iwslt17-overview.pdf
configpath: opencompass/configs/datasets/iwslt2017
- jigsawmultilingual:
name: JigsawMultilingual
category: Safety
paper: https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/data
configpath: opencompass/configs/datasets/jigsawmultilingual
- lambada:
name: LAMBADA
category: Understanding
paper: https://arxiv.org/pdf/1606.06031
configpath: opencompass/configs/datasets/lambada
- lcsts:
name: LCSTS
category: Understanding
paper: https://aclanthology.org/D15-1229.pdf
configpath: opencompass/configs/datasets/lcsts
- livestembench:
name: LiveStemBench
category: ""
paper: ""
configpath: opencompass/configs/datasets/livestembench
- llm_compression:
name: LLM Compression
category: Bits Per Character (BPC)
paper: https://arxiv.org/pdf/2404.09937
configpath: opencompass/configs/datasets/llm_compression
- math:
name: MATH
category: Math
paper: https://arxiv.org/pdf/2103.03874
configpath: opencompass/configs/datasets/math
- math401:
name: MATH 401
category: Math
paper: https://arxiv.org/pdf/2304.02015
configpath: opencompass/configs/datasets/math401
- mathbench:
name: MathBench
category: Math
paper: https://arxiv.org/pdf/2405.12209
configpath: opencompass/configs/datasets/mathbench
- mbpp:
name: MBPP
category: Code
paper: https://arxiv.org/pdf/2108.07732
configpath: opencompass/configs/datasets/mbpp
- mbpp_cn:
name: MBPP-CN
category: Code
paper: ""
configpath: opencompass/configs/datasets/mbpp_cn
- mbpp_plus:
name: MBPP-PLUS
category: Code
paper: ""
configpath: opencompass/configs/datasets/mbpp_plus
- mgsm:
name: MGSM
category: Language / Math
paper: https://arxiv.org/pdf/2210.03057
configpath: opencompass/configs/datasets/mgsm
- mmlu:
name: MMLU
category: Understanding
paper: https://arxiv.org/pdf/2009.03300
configpath: opencompass/configs/datasets/mmlu
- mmlu_cf:
name: MMLU-CF
category: Understanding
paper: https://arxiv.org/pdf/2412.15194
configpath: opencompass/configs/datasets/mmlu_cf
- mmlu_pro:
name: MMLU-Pro
category: Understanding
paper: https://arxiv.org/pdf/2406.01574
configpath: opencompass/configs/datasets/mmlu_pro
- mmmlu:
name: MMMLU
category: Language / Understanding
paper: https://huggingface.co/datasets/openai/MMMLU
configpath: opencompass/configs/datasets/mmmlu
- multirc:
name: SuperGLUE / MultiRC
category: Understanding
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_MultiRC
- narrativeqa:
name: NarrativeQA
category: Understanding
paper: https://github.com/google-deepmind/narrativeqa
configpath: opencompass/configs/datasets/narrativeqa
- natural_question:
name: NaturalQuestions
category: Knowledge
paper: https://github.com/google-research-datasets/natural-questions
configpath: opencompass/configs/datasets/nq
- natural_question_cn:
name: NaturalQuestions-CN
category: Knowledge
paper: ""
configpath: opencompass/configs/datasets/nq_cn
- obqa:
name: OpenBookQA
category: Knowledge
paper: https://arxiv.org/pdf/1809.02789v1
configpath: opencompass/configs/datasets/obqa
- piqa:
name: OpenBookQA
category: Knowledge / Physics
paper: https://arxiv.org/pdf/1911.11641v1
configpath: opencompass/configs/datasets/piqa
- py150:
name: py150
category: Code
paper: https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-line
configpath: opencompass/configs/datasets/py150
- qasper:
name: Qasper
category: Long Context
paper: https://arxiv.org/pdf/2105.03011
configpath: opencompass/configs/datasets/qasper
- qaspercut:
name: Qasper-Cut
category: Long Context
paper: ""
configpath: opencompass/configs/datasets/qaspercut
- race:
name: RACE
category: Examination
paper: https://arxiv.org/pdf/1704.04683
configpath: opencompass/configs/datasets/race
- realtoxicprompts:
name: RealToxicPrompts
category: Safety
paper: https://arxiv.org/pdf/2009.11462
configpath: opencompass/configs/datasets/realtoxicprompts
- record:
name: SuperGLUE / ReCoRD
category: Understanding
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_ReCoRD
- rte:
name: SuperGLUE / RTE
category: Reasoning
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_RTE
- ocnli:
name: CLUE / OCNLI
category: Reasoning
paper: https://arxiv.org/pdf/2004.05986
configpath: opencompass/configs/datasets/CLUE_ocnli
- rolebench:
name: RoleBench
category: Role Play
paper: https://arxiv.org/pdf/2310.00746
configpath: opencompass/configs/datasets/rolebench
- s3eval:
name: S3Eval
category: Long Context
paper: https://aclanthology.org/2024.naacl-long.69.pdf
configpath: opencompass/configs/datasets/s3eval
- scibench:
name: SciBench
category: Reasoning
paper: https://sxkdz.github.io/files/publications/ICML/SciBench/SciBench.pdf
configpath: opencompass/configs/datasets/scibench
- scicode:
name: SciCode
category: Code
paper: https://arxiv.org/pdf/2407.13168
configpath: opencompass/configs/datasets/scicode
- simpleqa:
name: SimpleQA
category: Knowledge
paper: https://arxiv.org/pdf/2411.04368
configpath: opencompass/configs/datasets/SimpleQA
- siqa:
name: SocialIQA
category: Reasoning
paper: https://arxiv.org/pdf/1904.09728
configpath: opencompass/configs/datasets/siqa
- squad20:
name: SQuAD2.0
category: Understanding
paper: https://arxiv.org/pdf/1806.03822
configpath: opencompass/configs/datasets/squad20
- storycloze:
name: StoryCloze
category: Reasoning
paper: https://aclanthology.org/2022.emnlp-main.616.pdf
configpath: opencompass/configs/datasets/storycloze
- strategyqa:
name: StrategyQA
category: Reasoning
paper: https://arxiv.org/pdf/2101.02235
configpath: opencompass/configs/datasets/strategyqa
- summedits:
name: SummEdits
category: Language
paper: https://aclanthology.org/2023.emnlp-main.600.pdf
configpath: opencompass/configs/datasets/summedits
- summscreen:
name: SummScreen
category: Understanding
paper: https://arxiv.org/pdf/2104.07091v1
configpath: opencompass/configs/datasets/summscreen
- svamp:
name: SVAMP
category: Math
paper: https://aclanthology.org/2021.naacl-main.168.pdf
configpath: opencompass/configs/datasets/SVAMP
- tabmwp:
name: TabMWP
category: Math / Table
paper: https://arxiv.org/pdf/2209.14610
configpath: opencompass/configs/datasets/TabMWP
- taco:
name: TACO
category: Code
paper: https://arxiv.org/pdf/2312.14852
configpath: opencompass/configs/datasets/taco
- tnews:
name: FewCLUE / TNEWS
category: Understanding
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_tnews
- bustm:
name: FewCLUE / BUSTM
category: Reasoning
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_bustm
- csl:
name: FewCLUE / CSL
category: Understanding
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_csl
- ocnli_fc:
name: FewCLUE / OCNLI-FC
category: Reasoning
paper: https://arxiv.org/pdf/2107.07498
configpath: opencompass/configs/datasets/FewCLUE_ocnli_fc
- triviaqa:
name: TriviaQA
category: Knowledge
paper: https://arxiv.org/pdf/1705.03551v2
configpath: opencompass/configs/datasets/triviaqa
- triviaqarc:
name: TriviaQA-RC
category: Knowledge / Understanding
paper: ""
configpath: opencompass/configs/datasets/triviaqarc
- truthfulqa:
name: TruthfulQA
category: Safety
paper: https://arxiv.org/pdf/2109.07958v2
configpath: opencompass/configs/datasets/truthfulqa
- tydiqa:
name: TyDi-QA
category: Language
paper: https://storage.googleapis.com/tydiqa/tydiqa.pdf
configpath: opencompass/configs/datasets/tydiqa
- wic:
name: SuperGLUE / WiC
category: Language
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_WiC
- wsc:
name: SuperGLUE / WSC
category: Language / WSC
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_WSC
- winogrande:
name: WinoGrande
category: Language / WSC
paper: https://arxiv.org/pdf/1907.10641v2
configpath: opencompass/configs/datasets/winogrande
- xcopa:
name: XCOPA
category: Language
paper: https://arxiv.org/pdf/2005.00333
configpath: opencompass/configs/datasets/XCOPA
- xiezhi:
name: Xiezhi
category: Knowledge
paper: https://arxiv.org/pdf/2306.05783
configpath: opencompass/configs/datasets/xiezhi
- xlsum:
name: XLSum
category: Understanding
paper: https://arxiv.org/pdf/2106.13822v1
configpath: opencompass/configs/datasets/XLSum
- xsum:
name: Xsum
category: Understanding
paper: https://arxiv.org/pdf/1808.08745
configpath: opencompass/configs/datasets/Xsum
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.8"
formats:
- epub
sphinx:
configuration: docs/en/conf.py
python:
install:
- requirements: requirements/docs.txt
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.header-logo {
background-image: url("../image/logo.svg");
background-size: 275px 80px;
height: 80px;
width: 275px;
}
@media screen and (min-width: 1100px) {
.header-logo {
top: -25px;
}
}
pre {
white-space: pre;
}
@media screen and (min-width: 2000px) {
.pytorch-content-left {
width: 1200px;
margin-left: 30px;
}
article.pytorch-article {
max-width: 1200px;
}
.pytorch-breadcrumbs-wrapper {
width: 1200px;
}
.pytorch-right-menu.scrolling-fixed {
position: fixed;
top: 45px;
left: 1580px;
}
}
article.pytorch-article section code {
padding: .2em .4em;
background-color: #f3f4f7;
border-radius: 5px;
}
/* Disable the change in tables */
article.pytorch-article section table code {
padding: unset;
background-color: unset;
border-radius: unset;
}
table.autosummary td {
width: 50%
}
img.align-center {
display: block;
margin-left: auto;
margin-right: auto;
}
article.pytorch-article p.rubric {
font-weight: bold;
}
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 27.3.1, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->
<svg version="1.1" id="图层_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
viewBox="0 0 210 36" style="enable-background:new 0 0 210 36;" xml:space="preserve">
<style type="text/css">
.st0{fill:#5878B4;}
.st1{fill:#36569B;}
.st2{fill:#1B3882;}
</style>
<g id="_x33_">
<g>
<path class="st0" d="M16.5,22.6l-6.4,3.1l5.3-0.2L16.5,22.6z M12.3,33.6l1.1-2.9l-5.3,0.2L12.3,33.6z M21.6,33.3l6.4-3.1l-5.3,0.2
L21.6,33.3z M25.8,22.4l-1.1,2.9l5.3-0.2L25.8,22.4z M31.5,26.2l-7.1,0.2l-1.7-1.1l1.5-4L22.2,20L19,21.5l-1.5,3.9l-2.7,1.3
l-7.1,0.2l-3.2,1.5l2.1,1.4l7.1-0.2l0,0l1.7,1.1l-1.5,4L16,36l3.2-1.5l1.5-3.9l0,0l2.6-1.2l0,0l7.2-0.2l3.2-1.5L31.5,26.2z
M20.2,28.7c-1,0.5-2.3,0.5-3,0.1c-0.6-0.4-0.4-1.2,0.6-1.6c1-0.5,2.3-0.5,3-0.1C21.5,27.5,21.2,28.2,20.2,28.7z"/>
</g>
</g>
<g id="_x32_">
<g>
<path class="st1" d="M33.5,19.8l-1.3-6.5l-1.5,1.9L33.5,19.8z M27.5,5.1l-4.2-2.7L26,7L27.5,5.1z M20.7,5.7l1.3,6.5l1.5-1.9
L20.7,5.7z M26.8,20.4l4.2,2.7l-2.7-4.6L26.8,20.4z M34,22.3l-3.6-6.2l0,0l-0.5-2.7l2-2.6l-0.6-3.2l-2.1-1.4l-2,2.6l-1.7-1.1
l-3.7-6.3L19.6,0l0.6,3.2l3.7,6.3l0,0l0.5,2.6l0,0l-2,2.6l0.6,3.2l2.1,1.4l1.9-2.5l1.7,1.1l3.7,6.3l2.1,1.4L34,22.3z M27.5,14.6
c-0.6-0.4-1.3-1.6-1.5-2.6c-0.2-1,0.2-1.5,0.8-1.1c0.6,0.4,1.3,1.6,1.5,2.6C28.5,14.6,28.1,15.1,27.5,14.6z"/>
</g>
</g>
<g id="_x31_">
<g>
<path class="st2" d="M12,2.8L5.6,5.9l3.8,1.7L12,2.8z M1.1,14.4l1.3,6.5l2.6-4.8L1.1,14.4z M9.1,24l6.4-3.1l-3.8-1.7L9.1,24z
M20,12.4l-1.3-6.5l-2.6,4.8L20,12.4z M20.4,14.9l-5.1-2.3l0,0l-0.5-2.7l3.5-6.5l-0.6-3.2l-3.2,1.5L11,8.1L8.3,9.4l0,0L3.2,7.1
L0,8.6l0.6,3.2l5.2,2.3l0.5,2.7v0l-3.5,6.6l0.6,3.2l3.2-1.5l3.5-6.5l2.6-1.2l0,0l5.2,2.4l3.2-1.5L20.4,14.9z M10.9,15.2
c-1,0.5-1.9,0-2.1-1c-0.2-1,0.4-2.2,1.4-2.7c1-0.5,1.9,0,2.1,1C12.5,13.5,11.9,14.7,10.9,15.2z"/>
</g>
</g>
<path id="字" class="st2" d="M49.5,26.5c-2.5,0-4.4-0.7-5.7-2c-1.8-1.6-2.6-4-2.6-7.1c0-3.2,0.9-5.5,2.6-7.1c1.3-1.3,3.2-2,5.7-2
c2.5,0,4.4,0.7,5.7,2c1.7,1.6,2.6,4,2.6,7.1c0,3.1-0.9,5.5-2.6,7.1C53.8,25.8,51.9,26.5,49.5,26.5z M52.9,21.8
c0.8-1.1,1.3-2.6,1.3-4.5c0-1.9-0.4-3.4-1.3-4.5c-0.8-1.1-2-1.6-3.4-1.6c-1.4,0-2.6,0.5-3.4,1.6c-0.9,1.1-1.3,2.6-1.3,4.5
c0,1.9,0.4,3.4,1.3,4.5c0.9,1.1,2,1.6,3.4,1.6C50.9,23.4,52,22.9,52.9,21.8z M70.9,14.6c1,1.1,1.5,2.7,1.5,4.9c0,2.2-0.5,4-1.5,5.1
c-1,1.2-2.3,1.8-3.9,1.8c-1,0-1.9-0.3-2.5-0.8c-0.4-0.3-0.7-0.7-1.1-1.2V31h-3.3V13.2h3.2v1.9c0.4-0.6,0.7-1,1.1-1.3
c0.7-0.6,1.6-0.9,2.6-0.9C68.6,12.9,69.9,13.5,70.9,14.6z M69,19.6c0-1-0.2-1.9-0.7-2.6c-0.4-0.8-1.2-1.1-2.2-1.1
c-1.2,0-2,0.6-2.5,1.7c-0.2,0.6-0.4,1.4-0.4,2.3c0,1.5,0.4,2.5,1.2,3.1c0.5,0.4,1,0.5,1.7,0.5c0.9,0,1.6-0.4,2.1-1.1
C68.8,21.8,69,20.8,69,19.6z M85.8,22.2c-0.1,0.8-0.5,1.5-1.2,2.3c-1.1,1.2-2.6,1.9-4.6,1.9c-1.6,0-3.1-0.5-4.3-1.6
c-1.2-1-1.9-2.8-1.9-5.1c0-2.2,0.6-3.9,1.7-5.1c1.1-1.2,2.6-1.8,4.4-1.8c1.1,0,2,0.2,2.9,0.6c0.9,0.4,1.6,1,2.1,1.9
c0.5,0.8,0.8,1.6,1,2.6c0.1,0.6,0.1,1.4,0.1,2.5h-8.7c0,1.3,0.4,2.2,1.2,2.7c0.5,0.3,1,0.5,1.7,0.5c0.7,0,1.2-0.2,1.7-0.6
c0.2-0.2,0.4-0.5,0.6-0.9H85.8z M82.5,18.3c-0.1-0.9-0.3-1.6-0.8-2c-0.5-0.5-1.1-0.7-1.8-0.7c-0.8,0-1.4,0.2-1.8,0.7
c-0.4,0.5-0.7,1.1-0.8,2H82.5z M94.3,15.7c-1.1,0-1.9,0.5-2.3,1.4c-0.2,0.5-0.3,1.2-0.3,1.9V26h-3.3V13.2h3.2v1.9
c0.4-0.7,0.8-1.1,1.2-1.4c0.7-0.5,1.6-0.8,2.6-0.8c1.3,0,2.4,0.3,3.2,1c0.8,0.7,1.3,1.8,1.3,3.4V26h-3.4v-7.8c0-0.7-0.1-1.2-0.3-1.5
C95.8,16,95.2,15.7,94.3,15.7z M115.4,24.7c-1.3,1.2-2.9,1.8-4.9,1.8c-2.5,0-4.4-0.8-5.9-2.4c-1.4-1.6-2.1-3.8-2.1-6.6
c0-3,0.8-5.3,2.4-7c1.4-1.4,3.2-2.1,5.4-2.1c2.9,0,5,1,6.4,2.9c0.7,1.1,1.1,2.1,1.2,3.2h-3.6c-0.2-0.8-0.5-1.5-0.9-1.9
c-0.7-0.8-1.6-1.1-2.9-1.1c-1.3,0-2.3,0.5-3.1,1.6c-0.8,1.1-1.1,2.6-1.1,4.5s0.4,3.4,1.2,4.4c0.8,1,1.8,1.4,3.1,1.4
c1.3,0,2.2-0.4,2.9-1.2c0.4-0.4,0.7-1.1,0.9-2h3.6C117.5,22,116.7,23.5,115.4,24.7z M130.9,14.8c1.1,1.4,1.6,2.9,1.6,4.8
c0,1.9-0.5,3.5-1.6,4.8c-1.1,1.3-2.7,2-4.9,2c-2.2,0-3.8-0.7-4.9-2c-1.1-1.3-1.6-2.9-1.6-4.8c0-1.8,0.5-3.4,1.6-4.8
c1.1-1.4,2.7-2,4.9-2C128.2,12.8,129.9,13.5,130.9,14.8z M126,15.6c-1,0-1.7,0.3-2.3,1c-0.5,0.7-0.8,1.7-0.8,3c0,1.3,0.3,2.3,0.8,3
c0.5,0.7,1.3,1,2.3,1c1,0,1.7-0.3,2.3-1c0.5-0.7,0.8-1.7,0.8-3c0-1.3-0.3-2.3-0.8-3C127.7,16,127,15.6,126,15.6z M142.1,16.7
c-0.3-0.6-0.8-0.9-1.7-0.9c-1,0-1.6,0.3-1.9,0.9c-0.2,0.4-0.3,0.9-0.3,1.6V26h-3.4V13.2h3.2v1.9c0.4-0.7,0.8-1.1,1.2-1.4
c0.6-0.5,1.5-0.8,2.5-0.8c1,0,1.8,0.2,2.4,0.6c0.5,0.4,0.9,0.9,1.1,1.5c0.4-0.8,1-1.3,1.6-1.7c0.7-0.4,1.5-0.5,2.3-0.5
c0.6,0,1.1,0.1,1.7,0.3c0.5,0.2,1,0.6,1.5,1.1c0.4,0.4,0.6,1,0.7,1.6c0.1,0.4,0.1,1.1,0.1,1.9l0,8.1h-3.4v-8.1
c0-0.5-0.1-0.9-0.2-1.2c-0.3-0.6-0.8-0.9-1.6-0.9c-0.9,0-1.6,0.4-1.9,1.1c-0.2,0.4-0.3,0.9-0.3,1.5V26h-3.4v-7.6
C142.4,17.6,142.3,17.1,142.1,16.7z M167,14.6c1,1.1,1.5,2.7,1.5,4.9c0,2.2-0.5,4-1.5,5.1c-1,1.2-2.3,1.8-3.9,1.8
c-1,0-1.9-0.3-2.5-0.8c-0.4-0.3-0.7-0.7-1.1-1.2V31h-3.3V13.2h3.2v1.9c0.4-0.6,0.7-1,1.1-1.3c0.7-0.6,1.6-0.9,2.6-0.9
C164.7,12.9,166,13.5,167,14.6z M165.1,19.6c0-1-0.2-1.9-0.7-2.6c-0.4-0.8-1.2-1.1-2.2-1.1c-1.2,0-2,0.6-2.5,1.7
c-0.2,0.6-0.4,1.4-0.4,2.3c0,1.5,0.4,2.5,1.2,3.1c0.5,0.4,1,0.5,1.7,0.5c0.9,0,1.6-0.4,2.1-1.1C164.9,21.8,165.1,20.8,165.1,19.6z
M171.5,14.6c0.9-1.1,2.4-1.7,4.5-1.7c1.4,0,2.6,0.3,3.7,0.8c1.1,0.6,1.6,1.6,1.6,3.1v5.9c0,0.4,0,0.9,0,1.5c0,0.4,0.1,0.7,0.2,0.9
c0.1,0.2,0.3,0.3,0.5,0.4V26h-3.6c-0.1-0.3-0.2-0.5-0.2-0.7c0-0.2-0.1-0.5-0.1-0.8c-0.5,0.5-1,0.9-1.6,1.3c-0.7,0.4-1.5,0.6-2.4,0.6
c-1.2,0-2.1-0.3-2.9-1c-0.8-0.7-1.1-1.6-1.1-2.8c0-1.6,0.6-2.7,1.8-3.4c0.7-0.4,1.6-0.7,2.9-0.8l1.1-0.1c0.6-0.1,1.1-0.2,1.3-0.3
c0.5-0.2,0.7-0.5,0.7-0.9c0-0.5-0.2-0.9-0.6-1.1c-0.4-0.2-0.9-0.3-1.6-0.3c-0.8,0-1.3,0.2-1.7,0.6c-0.2,0.3-0.4,0.7-0.5,1.2h-3.2
C170.6,16.2,170.9,15.3,171.5,14.6z M173.9,23.6c0.3,0.3,0.7,0.4,1.1,0.4c0.7,0,1.4-0.2,2-0.6c0.6-0.4,0.9-1.2,0.9-2.3v-1.2
c-0.2,0.1-0.4,0.2-0.6,0.3c-0.2,0.1-0.5,0.2-0.9,0.2l-0.8,0.1c-0.7,0.1-1.2,0.3-1.5,0.5c-0.5,0.3-0.8,0.8-0.8,1.4
C173.5,22.9,173.6,23.3,173.9,23.6z M193.1,13.8c1,0.6,1.6,1.7,1.7,3.3h-3.3c0-0.4-0.2-0.8-0.4-1c-0.4-0.5-1-0.7-1.9-0.7
c-0.7,0-1.2,0.1-1.6,0.3c-0.3,0.2-0.5,0.5-0.5,0.8c0,0.4,0.2,0.7,0.5,0.8c0.3,0.2,1.5,0.5,3.5,0.9c1.3,0.3,2.3,0.8,3,1.4
c0.7,0.6,1,1.4,1,2.4c0,1.3-0.5,2.3-1.4,3.1c-0.9,0.8-2.4,1.2-4.4,1.2c-2,0-3.5-0.4-4.5-1.3c-1-0.9-1.4-1.9-1.4-3.2h3.4
c0.1,0.6,0.2,1,0.5,1.3c0.4,0.4,1.2,0.7,2.3,0.7c0.7,0,1.2-0.1,1.6-0.3c0.4-0.2,0.6-0.5,0.6-0.9c0-0.4-0.2-0.7-0.5-0.9
c-0.3-0.2-1.5-0.5-3.5-1c-1.4-0.4-2.5-0.8-3.1-1.3c-0.6-0.5-0.9-1.3-0.9-2.3c0-1.2,0.5-2.2,1.4-3c0.9-0.9,2.2-1.3,3.9-1.3
C190.8,12.9,192.1,13.2,193.1,13.8z M206.5,13.8c1,0.6,1.6,1.7,1.7,3.3h-3.3c0-0.4-0.2-0.8-0.4-1c-0.4-0.5-1-0.7-1.9-0.7
c-0.7,0-1.2,0.1-1.6,0.3c-0.3,0.2-0.5,0.5-0.5,0.8c0,0.4,0.2,0.7,0.5,0.8c0.3,0.2,1.5,0.5,3.5,0.9c1.3,0.3,2.3,0.8,3,1.4
c0.7,0.6,1,1.4,1,2.4c0,1.3-0.5,2.3-1.4,3.1c-0.9,0.8-2.4,1.2-4.4,1.2c-2,0-3.5-0.4-4.5-1.3c-1-0.9-1.4-1.9-1.4-3.2h3.4
c0.1,0.6,0.2,1,0.5,1.3c0.4,0.4,1.2,0.7,2.3,0.7c0.7,0,1.2-0.1,1.6-0.3c0.4-0.2,0.6-0.5,0.6-0.9c0-0.4-0.2-0.7-0.5-0.9
c-0.3-0.2-1.5-0.5-3.5-1c-1.4-0.4-2.5-0.8-3.1-1.3c-0.6-0.5-0.9-1.3-0.9-2.3c0-1.2,0.5-2.2,1.4-3c0.9-0.9,2.2-1.3,3.9-1.3
C204.2,12.9,205.5,13.2,206.5,13.8z"/>
</svg>
<?xml version="1.0" encoding="UTF-8"?>
<svg id="_图层_2" data-name="图层 2" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 34.59 36">
<defs>
<style>
.cls-1 {
fill: #36569b;
}
.cls-2 {
fill: #1b3882;
}
.cls-3 {
fill: #5878b4;
}
</style>
</defs>
<g id="_图层_1-2" data-name="图层 1">
<g>
<g id="_3" data-name="3">
<path class="cls-3" d="m16.53,22.65l-6.37,3.07,5.27-.16,1.1-2.91Zm-4.19,10.95l1.12-2.91-5.27.17,4.15,2.74Zm9.3-.29l6.37-3.07-5.27.16-1.1,2.91Zm4.19-10.95l-1.12,2.91,5.27-.17-4.15-2.74Zm5.72,3.81l-7.08.23-1.73-1.14,1.5-3.95-2.06-1.36-3.16,1.53-1.48,3.89-2.67,1.29-7.14.23-3.16,1.53,2.07,1.36,7.13-.23h0s1.69,1.11,1.69,1.11l-1.51,3.98,2.06,1.36,3.16-1.53,1.5-3.95h0s2.56-1.24,2.56-1.24h0s7.23-.24,7.23-.24l3.16-1.53-2.06-1.36Zm-11.29,2.56c-.99.48-2.31.52-2.96.1-.65-.42-.37-1.15.62-1.63.99-.48,2.31-.52,2.96-.1.65.42.37,1.15-.62,1.63Z"/>
</g>
<g id="_2" data-name="2">
<path class="cls-1" d="m33.5,19.84l-1.26-6.51-1.46,1.88,2.72,4.63Zm-6.05-14.69l-4.16-2.74,2.71,4.64,1.45-1.89Zm-6.73.58l1.26,6.51,1.46-1.88-2.72-4.63Zm6.05,14.69l4.16,2.74-2.71-4.64-1.45,1.89Zm7.19,1.91l-3.63-6.2h0s-.53-2.74-.53-2.74l1.96-2.56-.63-3.23-2.07-1.36-1.96,2.56-1.69-1.11-3.71-6.33-2.07-1.36.63,3.23,3.68,6.28h0s.51,2.62.51,2.62h0s-1.99,2.6-1.99,2.6l.63,3.23,2.06,1.36,1.95-2.54,1.73,1.14,3.69,6.29,2.07,1.36-.63-3.23Zm-6.47-7.7c-.65-.42-1.33-1.59-1.52-2.6-.2-1.01.17-1.49.81-1.06.65.42,1.33,1.59,1.52,2.6.2,1.01-.17,1.49-.81,1.06Z"/>
</g>
<g id="_1" data-name="1">
<path class="cls-2" d="m11.96,2.82l-6.37,3.07,3.81,1.74,2.55-4.81ZM1.07,14.37l1.26,6.53,2.56-4.8-3.82-1.73Zm7.99,9.59l6.37-3.07-3.81-1.74-2.55,4.81Zm10.89-11.55l-1.26-6.53-2.56,4.8,3.82,1.73Zm.45,2.53l-5.13-2.32h0s-.53-2.71-.53-2.71l3.47-6.53-.63-3.24-3.16,1.53-3.42,6.43-2.67,1.29h0s-5.17-2.34-5.17-2.34l-3.16,1.53.63,3.24,5.17,2.33.51,2.65h0s-3.49,6.57-3.49,6.57l.63,3.24,3.16-1.53,3.46-6.52,2.56-1.24h0s5.24,2.37,5.24,2.37l3.16-1.53-.63-3.24Zm-9.52.24c-.99.48-1.95.04-2.14-.97-.2-1.01.44-2.22,1.43-2.69.99-.48,1.95-.04,2.14.97.2,1.01-.44,2.22-1.43,2.7Z"/>
</g>
</g>
</g>
</svg>
var collapsedSections = ['Dataset Statistics'];
$(document).ready(function () {
$('.dataset').DataTable({
"stateSave": false,
"lengthChange": false,
"pageLength": 20,
"order": [],
"language": {
"info": "Show _START_ to _END_ Items(Totally _TOTAL_ )",
"infoFiltered": "(Filtered from _MAX_ Items)",
"search": "Search:",
"zeroRecords": "Item Not Found",
"paginate": {
"next": "Next",
"previous": "Previous"
},
}
});
});
{% extends "layout.html" %}
{% block body %}
<h1>Page Not Found</h1>
<p>
The page you are looking for cannot be found.
</p>
<p>
If you just switched documentation versions, it is likely that the page you were on is moved. You can look for it in
the content table left, or go to <a href="{{ pathto(root_doc) }}">the homepage</a>.
</p>
<!-- <p>
If you cannot find documentation you want, please <a
href="">open an issue</a> to tell us!
</p> -->
{% endblock %}
.. role:: hidden
:class: hidden-section
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
..
autogenerated from _templates/autosummary/class.rst
note it does not have :inherited-members:
.. role:: hidden
:class: hidden-section
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
:special-members: __call__
..
autogenerated from _templates/callable.rst
note it does not have :inherited-members:
# Accelerate Evaluation Inference with vLLM or LMDeploy
## Background
During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.
- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
- [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
## Preparation for Acceleration
First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:
### LMDeploy Installation Method
Install LMDeploy using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):
```bash
pip install lmdeploy
```
### VLLM Installation Method
Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
```bash
pip install vllm
```
## Accelerated Evaluation Using VLLM or LMDeploy
### Method 1: Using Command Line Parameters to Change the Inference Backend
OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:
```python
# eval_gsm8k.py
from mmengine.config import read_base
with read_base():
# Select a dataset list
from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
# Select an interested model
from ..models.hf_llama.hf_llama3_8b_instruct import models
```
Here, `hf_llama3_8b_instruct` specifies the original Huggingface model configuration, as shown below:
```python
from opencompass.models import HuggingFacewithChatTemplate
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='llama-3-8b-instruct-hf',
path='meta-llama/Meta-Llama-3-8B-Instruct',
max_out_len=1024,
batch_size=8,
run_cfg=dict(num_gpus=1),
stop_words=['<|end_of_text|>', '<|eot_id|>'],
)
]
```
To evaluate the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model, use:
```bash
python run.py config/eval_gsm8k.py
```
To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:
```bash
python run.py config/eval_gsm8k.py -a vllm
```
or
```bash
python run.py config/eval_gsm8k.py -a lmdeploy
```
### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API
OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:
1. Install the openai package:
```bash
pip install openai
```
2. Deploy the inference acceleration service API for vLLM or LMDeploy. Below is an example for LMDeploy:
```bash
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
```
Parameters for starting the api_server can be checked using `lmdeploy serve api_server -h`, such as --tp for tensor parallelism, --session-len for the maximum context window length, --cache-max-entry-count for adjusting the k/v cache memory usage ratio, etc.
3. Once the service is successfully deployed, modify the evaluation script by changing the model configuration path to the service address, as shown below:
```python
from opencompass.models import OpenAISDK
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = [
dict(
abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
type=OpenAISDK,
key='EMPTY', # API key
openai_api_base='http://0.0.0.0:23333/v1', # Service address
path='Meta-Llama-3-8B-Instruct', # Model name for service request
tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', # The tokenizer name or path, if set to `None`, uses the default `gpt-4` tokenizer
rpm_verbose=True, # Whether to print request rate
meta_template=api_meta_template, # Service request template
query_per_second=1, # Service request rate
max_out_len=1024, # Maximum output length
max_seq_len=4096, # Maximum input length
temperature=0.01, # Generation temperature
batch_size=8, # Batch size
retry=3, # Number of retries
)
]
```
## Acceleration Effect and Performance Comparison
Below is a comparison table of the acceleration effect and performance when using VLLM or LMDeploy on a single A800 GPU for evaluating the Llama-3-8B-Instruct model on the GSM8k dataset:
| Inference Backend | Accuracy | Inference Time (minutes:seconds) | Speedup (relative to Huggingface) |
| ----------------- | -------- | -------------------------------- | --------------------------------- |
| Huggingface | 74.22 | 24:26 | 1.0 |
| LMDeploy | 73.69 | 11:15 | 2.2 |
| VLLM | 72.63 | 07:52 | 3.1 |
# CircularEval
## Background
For multiple-choice questions, when a Language Model (LLM) provides the correct option, it does not necessarily imply a true understanding and reasoning of the question. It could be a guess. To differentiate these scenarios and reduce LLM bias towards options, CircularEval (CircularEval) can be utilized. A multiple-choice question is augmented by shuffling its options, and if the LLM correctly answers all variations of the augmented question, it is considered correct under CircularEval.
## Adding Your Own CircularEval Dataset
Generally, to evaluate a dataset using CircularEval, both its loading and evaluation methods need to be rewritten. Modifications are required in both the OpenCompass main library and configuration files. We will use C-Eval as an example for explanation.
OpenCompass main library:
```python
from opencompass.datasets.ceval import CEvalDataset
from opencompass.datasets.circular import CircularDatasetMeta
class CircularCEvalDataset(CEvalDataset, metaclass=CircularDatasetMeta):
# The overloaded dataset class
dataset_class = CEvalDataset
# Splits of the DatasetDict that need CircularEval. For CEvalDataset, which loads [dev, val, test], we only need 'val' and 'test' for CircularEval, not 'dev'
default_circular_splits = ['val', 'test']
# List of keys to be shuffled
default_option_keys = ['A', 'B', 'C', 'D']
# If the content of 'answer_key' is one of ['A', 'B', 'C', 'D'], representing the correct answer. This field indicates how to update the correct answer after shuffling options. Choose either this or default_answer_key_switch_method
default_answer_key = 'answer'
# If 'answer_key' content is not one of ['A', 'B', 'C', 'D'], a function can be used to specify the correct answer after shuffling options. Choose either this or default_answer_key
# def default_answer_key_switch_method(item, circular_pattern):
# # 'item' is the original data item
# # 'circular_pattern' is a tuple indicating the order after shuffling options, e.g., ('D', 'A', 'B', 'C') means the original option A is now D, and so on
# item['answer'] = circular_pattern['ABCD'.index(item['answer'])]
# return item
```
`CircularCEvalDataset` accepts the `circular_pattern` parameter with two values:
- `circular`: Indicates a single cycle. It is the default value. ABCD is expanded to ABCD, BCDA, CDAB, DABC, a total of 4 variations.
- `all_possible`: Indicates all permutations. ABCD is expanded to ABCD, ABDC, ACBD, ACDB, ADBC, ADCB, BACD, ..., a total of 24 variations.
Additionally, we provide a `CircularEvaluator` to replace `AccEvaluator`. This Evaluator also accepts `circular_pattern`, and it should be consistent with the above. It produces the following metrics:
- `acc_{origin|circular|all_possible}`: Treating each question with shuffled options as separate, calculating accuracy.
- `perf_{origin|circular|all_possible}`: Following Circular logic, a question is considered correct only if all its variations with shuffled options are answered correctly, calculating accuracy.
- `more_{num}_{origin|circular|all_possible}`: According to Circular logic, a question is deemed correct if the number of its variations answered correctly is greater than or equal to num, calculating accuracy.
OpenCompass configuration file:
```python
from mmengine.config import read_base
from opencompass.datasets.circular import CircularCEvalDataset
with read_base():
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
for d in ceval_datasets:
# Overloading the load method
d['type'] = CircularCEvalDataset
# Renaming for differentiation from non-circular evaluation versions
d['abbr'] = d['abbr'] + '-circular-4'
# Overloading the evaluation method
d['eval_cfg']['evaluator'] = {'type': CircularEvaluator}
# The dataset after the above operations looks like this:
# dict(
# type=CircularCEvalDataset,
# path='./data/ceval/formal_ceval', # Unchanged
# name='computer_network', # Unchanged
# abbr='ceval-computer_network-circular-4',
# reader_cfg=dict(...), # Unchanged
# infer_cfg=dict(...), # Unchanged
# eval_cfg=dict(evaluator=dict(type=CircularEvaluator), ...),
# )
```
Additionally, for better presentation of results in CircularEval, consider using the following summarizer:
```python
from mmengine.config import read_base
from opencompass.summarizers import CircularSummarizer
with read_base():
from ...summarizers.groups.ceval.ceval_summary_groups
new_summary_groups = []
for item in ceval_summary_groups:
new_summary_groups.append(
{
'name': item['name'] + '-circular-4',
'subsets': [i + '-circular-4' for i in item['subsets']],
}
)
summarizer = dict(
type=CircularSummarizer,
# Select specific metrics to view
metric_types=['acc_origin', 'perf_circular'],
dataset_abbrs = [
'ceval-circular-4',
'ceval-humanities-circular-4',
'ceval-stem-circular-4',
'ceval-social-science-circular-4',
'ceval-other-circular-4',
],
summary_groups=new_summary_groups,
)
```
For more complex evaluation examples, refer to this sample code: https://github.com/open-compass/opencompass/tree/main/configs/eval_circular.py
# Code Evaluation Tutorial
This tutorial primarily focuses on evaluating a model's coding proficiency, using `humaneval` and `mbpp` as examples.
## pass@1
If you only need to generate a single response to evaluate the pass@1 performance, you can directly use [configs/datasets/humaneval/humaneval_gen_8e312c.py](https://github.com/open-compass/opencompass/blob/main/configs/datasets/humaneval/humaneval_gen_8e312c.py) and [configs/datasets/mbpp/deprecated_mbpp_gen_1e1056.py](https://github.com/open-compass/opencompass/blob/main/configs/datasets/mbpp/deprecated_mbpp_gen_1e1056.py), referring to the general [quick start tutorial](../get_started/quick_start.md).
For multilingual evaluation, please refer to the [Multilingual Code Evaluation Tutorial](./code_eval_service.md).
## pass@k
If you need to generate multiple responses for a single example to evaluate the pass@k performance, consider the following two situations. Here we take 10 responses as an example:
### Typical Situation
For most models that support the `num_return_sequences` parameter in HF's generation, we can use it directly to obtain multiple responses. Refer to the following configuration file:
```python
from opencompass.datasets import MBPPDatasetV2, MBPPPassKEvaluator
with read_base():
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
mbpp_datasets[0]['type'] = MBPPDatasetV2
mbpp_datasets[0]['eval_cfg']['evaluator']['type'] = MBPPPassKEvaluator
mbpp_datasets[0]['reader_cfg']['output_column'] = 'test_column'
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
models = [
dict(
type=HuggingFaceCausalLM,
...,
generation_kwargs=dict(
num_return_sequences=10,
do_sample=True,
top_p=0.95,
temperature=0.8,
),
...,
)
]
```
For `mbpp`, new changes are needed in the dataset and evaluation, so we simultaneously modify the `type`, `eval_cfg.evaluator.type`, `reader_cfg.output_column` fields to accommodate these requirements.
We also need model responses with randomness, thus setting the `generation_kwargs` parameter is necessary. Note that we need to set `num_return_sequences` to get the number of responses.
Note: `num_return_sequences` must be greater than or equal to k, as pass@k itself is a probability estimate.
You can specifically refer to the following configuration file [configs/eval_code_passk.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_code_passk.py)
### For Models That Do Not Support Multiple Responses
This applies to some HF models with poorly designed APIs or missing features. In this case, we need to repeatedly construct datasets to achieve multiple response effects. Refer to the following configuration:
```python
from opencompass.datasets import MBPPDatasetV2, MBPPPassKEvaluator
with read_base():
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.mbpp.deprecated_mbpp_gen_1e1056 import mbpp_datasets
humaneval_datasets[0]['abbr'] = 'openai_humaneval_pass10'
humaneval_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['abbr'] = 'mbpp_pass10'
mbpp_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['type'] = MBPPDatasetV2
mbpp_datasets[0]['eval_cfg']['evaluator']['type'] = MBPPPassKEvaluator
mbpp_datasets[0]['reader_cfg']['output_column'] = 'test_column'
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
models = [
dict(
type=HuggingFaceCausalLM,
...,
generation_kwargs=dict(
do_sample=True,
top_p=0.95,
temperature=0.8,
),
...,
)
]
```
Since the dataset's prompt has not been modified, we need to replace the corresponding fields to achieve the purpose of repeating the dataset.
You need to modify these fields:
- `num_repeats`: the number of times the dataset is repeated
- `abbr`: It's best to modify the dataset abbreviation along with the number of repetitions because the number of datasets will change, preventing potential issues arising from discrepancies with the values in `.cache/dataset_size.json`.
For `mbpp`, modify the `type`, `eval_cfg.evaluator.type`, `reader_cfg.output_column` fields as well.
We also need model responses with randomness, thus setting the `generation_kwargs` parameter is necessary.
You can specifically refer to the following configuration file [configs/eval_code_passk_repeat_dataset.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_code_passk_repeat_dataset.py)
# Code Evaluation Docker Tutorial
To complete the LLM code capability evaluation, we need to build a separate evaluation environment to avoid executing erroneous code in the development environment, which would inevitably cause losses. The code evaluation service currently used by OpenCompass can refer to the [code-evaluator](https://github.com/open-compass/code-evaluator) project. The following will introduce evaluation tutorials around the code evaluation service.
1. humaneval-x
This is a multi-programming language dataset [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x).
You can download the dataset from this [download link](https://github.com/THUDM/CodeGeeX2/tree/main/benchmark/humanevalx). Please download the language file (××.jsonl.gz) that needs to be evaluated and place it in the `./data/humanevalx` folder.
The currently supported languages are `python`, `cpp`, `go`, `java`, `js`.
2. DS1000
This is a Python multi-algorithm library dataset [ds1000](https://github.com/xlang-ai/DS-1000).
You can download the dataset from this [download link](https://github.com/xlang-ai/DS-1000/blob/main/ds1000_data.zip).
The currently supported algorithm libraries are `Pandas`, `Numpy`, `Tensorflow`, `Scipy`, `Sklearn`, `Pytorch`, `Matplotlib`.
## Launching the Code Evaluation Service
1. Ensure you have installed Docker, please refer to [Docker installation document](https://docs.docker.com/engine/install/).
2. Pull the source code of the code evaluation service project and build the Docker image.
Choose the dockerfile corresponding to the dataset you need, and replace `humanevalx` or `ds1000` in the command below.
```shell
git clone https://github.com/open-compass/code-evaluator.git
docker build -t code-eval-{your-dataset}:latest -f docker/{your-dataset}/Dockerfile .
```
3. Create a container with the following commands:
```shell
# Log output format
docker run -it -p 5000:5000 code-eval-{your-dataset}:latest python server.py
# Run the program in the background
# docker run -itd -p 5000:5000 code-eval-{your-dataset}:latest python server.py
# Using different ports
# docker run -itd -p 5001:5001 code-eval-{your-dataset}:latest python server.py --port 5001
```
**Note:**
- If you encounter a timeout during the evaluation of Go, please use the following command when creating the container.
```shell
docker run -it -p 5000:5000 -e GO111MODULE=on -e GOPROXY=https://goproxy.io code-eval-{your-dataset}:latest python server.py
```
4. To ensure you have access to the service, use the following command to check the inference environment and evaluation service connection status. (If both inferences and code evaluations run on the same host, skip this step.)
```shell
ping your_service_ip_address
telnet your_service_ip_address your_service_port
```
## Local Code Evaluation
When the model inference and code evaluation services are running on the same host or within the same local area network, direct code reasoning and evaluation can be performed. **Note: DS1000 is currently not supported, please proceed with remote evaluation.**
### Configuration File
We provide [the configuration file](https://github.com/open-compass/opencompass/blob/main/configs/eval_codegeex2.py) of using `humanevalx` for evaluation on `codegeex2` as reference.
The dataset and related post-processing configurations files can be found at this [link](https://github.com/open-compass/opencompass/tree/main/configs/datasets/humanevalx) with attention paid to the `evaluator` field in the humanevalx_eval_cfg_dict.
```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator
humanevalx_reader_cfg = dict(
input_columns=['prompt'], output_column='task_id', train_split='test')
humanevalx_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024))
humanevalx_eval_cfg_dict = {
lang : dict(
evaluator=dict(
type=HumanevalXEvaluator,
language=lang,
ip_address="localhost", # replace to your code_eval_server ip_address, port
port=5000), # refer to https://github.com/open-compass/code-evaluator to launch a server
pred_role='BOT')
for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now
}
humanevalx_datasets = [
dict(
type=HumanevalXDataset,
abbr=f'humanevalx-{lang}',
language=lang,
path='./data/humanevalx',
reader_cfg=humanevalx_reader_cfg,
infer_cfg=humanevalx_infer_cfg,
eval_cfg=humanevalx_eval_cfg_dict[lang])
for lang in ['python', 'cpp', 'go', 'java', 'js']
]
```
### Task Launch
Refer to the [Quick Start](../get_started.html)
## Remote Code Evaluation
Model inference and code evaluation services located in different machines which cannot be accessed directly require prior model inference before collecting the code evaluation results. The configuration file and inference process can be reused from the previous tutorial.
### Collect Inference Results(Only for Humanevalx)
In OpenCompass's tools folder, there is a script called `collect_code_preds.py` provided to process and collect the inference results after providing the task launch configuration file during startup along with specifying the working directory used corresponding to the task.
It is the same with `-r` option in `run.py`. More details can be referred through the [documentation](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html#launching-evaluation).
```shell
python tools/collect_code_preds.py [config] [-r latest]
```
The collected results will be organized as following under the `-r` folder:
```
workdir/humanevalx
├── codegeex2-6b
│   ├── humanevalx_cpp.json
│   ├── humanevalx_go.json
│   ├── humanevalx_java.json
│   ├── humanevalx_js.json
│   └── humanevalx_python.json
├── CodeLlama-13b
│   ├── ...
├── CodeLlama-13b-Instruct
│   ├── ...
├── CodeLlama-13b-Python
│   ├── ...
├── ...
```
For DS1000, you just need to obtain the corresponding prediction file generated by `opencompass`.
### Code Evaluation
Make sure your code evaluation service is started, and use `curl` to request:
#### The following only supports Humanevalx
```shell
curl -X POST -F 'file=@{result_absolute_path}' -F 'dataset={dataset/language}' {your_service_ip_address}:{your_service_port}/evaluate
```
For example:
```shell
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' localhost:5000/evaluate
```
The we have:
```
"{\"pass@1\": 37.19512195121951%}"
```
Additionally, we offer an extra option named `with_prompt`(Defaults to `True`), since some models(like `WizardCoder`) generate complete codes without requiring the form of concatenating prompt and prediction. You may refer to the following commands for evaluation.
```shell
curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' -H 'with-prompt: False' localhost:5000/evaluate
```
#### The following only supports DS1000
Make sure the code evaluation service is started, then use `curl` to submit a request:
```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' localhost:5000/evaluate
```
DS1000 supports additional debug parameters. Be aware that a large amount of log will be generated when it is turned on:
- `full`: Additional print out of the original prediction for each error sample, post-processing prediction, running program, and final error.
- `half`: Additional print out of the running program and final error for each error sample.
- `error`: Additional print out of the final error for each error sample.
```shell
curl -X POST -F 'file=@./internlm-chat-7b-hf-v11/ds1000_Numpy.json' -F 'debug=error' localhost:5000/evaluate
```
You can also modify the `num_workers` in the same way to control the degree of parallelism.
## Advanced Tutorial
Besides evaluating the supported HUMANEVAList data set, users might also need:
### Support New Dataset
Please refer to the [tutorial on supporting new datasets](./new_dataset.md).
### Modify Post-Processing
1. For local evaluation, follow the post-processing section in the tutorial on supporting new datasets to modify the post-processing method.
2. For remote evaluation, please modify the post-processing part in the tool's `collect_code_preds.py`.
3. Some parts of post-processing could also be modified in the code evaluation service, more information will be available in the next section.
### Debugging Code Evaluation Service
When supporting new datasets or modifying post-processors, it is possible that modifications need to be made to the original code evaluation service. Please make changes based on the following steps:
1. Remove the installation of the `code-evaluator` in `Dockerfile`, mount the `code-evaluator` when starting the container instead:
```shell
docker run -it -p 5000:5000 -v /local/path/of/code-evaluator:/workspace/code-evaluator code-eval:latest bash
```
2. Install and start the code evaluation service locally. At this point, any necessary modifications can be made to the local copy of the `code-evaluator`.
```shell
cd code-evaluator && pip install -r requirements.txt
python server.py
```
# Data Contamination Assessment
**Data Contamination** refers to the phenomenon where data intended for downstream testing tasks appear in the training data of large language models (LLMs), resulting in artificially inflated performance metrics in downstream tasks (such as summarization, natural language inference, text classification), which do not accurately reflect the model's true generalization capabilities.
Since the source of data contamination lies in the training data used by LLMs, the most direct method to detect data contamination is to collide test data with training data and then report the extent of overlap between the two. The classic GPT-3 [paper](https://arxiv.org/pdf/2005.14165.pdf) reported on this in Table C.1.
However, today's open-source community often only publishes model parameters, not training datasets. In such cases, how to determine the presence and extent of data contamination remains unsolved. OpenCompass offers two possible solutions.
## Contamination Data Annotation Based on Self-Built Co-Distribution Data
Referencing the method mentioned in Section 5.2 of [Skywork](https://arxiv.org/pdf/2310.19341.pdf), we directly used the dataset [mock_gsm8k_test](https://huggingface.co/datasets/Skywork/mock_gsm8k_test) uploaded to HuggingFace by Skywork.
In this method, the authors used GPT-4 to synthesize data similar to the original GSM8K style, and then calculated the perplexity on the GSM8K training set (train), GSM8K test set (test), and GSM8K reference set (ref). Since the GSM8K reference set was newly generated, the authors considered it as clean, not belonging to any training set of any model. They posited:
- If the test set's perplexity is significantly lower than the reference set's, the test set might have appeared in the model's training phase;
- If the training set's perplexity is significantly lower than the test set's, the training set might have been overfitted by the model.
The following configuration file can be referenced:
```python
from mmengine.config import read_base
with read_base():
from .datasets.gsm8k_contamination.gsm8k_contamination_ppl_ecdd22 import gsm8k_datasets # includes training, test, and reference sets
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model # model under review
from .models.yi.hf_yi_6b import models as hf_yi_6b_model
datasets = [*gsm8k_datasets]
models = [*hf_qwen_7b_model, *hf_yi_6b_model]
```
An example output is as follows:
```text
dataset version metric mode internlm-7b-hf qwen-7b-hf yi-6b-hf chatglm3-6b-base-hf qwen-14b-hf baichuan2-13b-base-hf internlm-20b-hf aquila2-34b-hf ...
--------------- --------- ----------- ------- ---------------- ------------ ---------- --------------------- ------------- ----------------------- ----------------- ---------------- ...
gsm8k-train-ppl 0b8e46 average_ppl unknown 1.5 0.78 1.37 1.16 0.5 0.76 1.41 0.78 ...
gsm8k-test-ppl 0b8e46 average_ppl unknown 1.56 1.33 1.42 1.3 1.15 1.13 1.52 1.16 ...
gsm8k-ref-ppl f729ba average_ppl unknown 1.55 1.2 1.43 1.35 1.27 1.19 1.47 1.35 ...
```
Currently, this solution only supports the GSM8K dataset. We welcome the community to contribute more datasets.
Consider cite the following paper if you find it helpful:
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Contamination Data Annotation Based on Classic Pre-trained Sets
Thanks to [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) and @liyucheng09 for providing this method.
In this method, the authors search the test datasets (such as C-Eval, ARC, HellaSwag, etc.) using the Common Crawl database and Bing search engine, then mark each test sample as clean / question contaminated / both question and answer contaminated.
During testing, OpenCompass
will report the accuracy or perplexity of ceval on subsets composed of these three labels. Generally, the accuracy ranges from low to high: clean, question contaminated, both question and answer contaminated subsets. The authors believe:
- If the performance of the three is relatively close, the contamination level of the model on that test set is light; otherwise, it is heavy.
The following configuration file can be referenced [link](https://github.com/open-compass/opencompass/blob/main/configs/eval_contamination.py):
```python
from mmengine.config import read_base
with read_base():
from .datasets.ceval.ceval_clean_ppl import ceval_datasets # ceval dataset with contamination tags
from .models.yi.hf_yi_6b import models as hf_yi_6b_model # model under review
from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model
from .summarizers.contamination import ceval_summarizer as summarizer # output formatting
datasets = [*ceval_datasets]
models = [*hf_yi_6b_model, *hf_qwen_7b_model]
```
An example output is as follows:
```text
dataset version mode yi-6b-hf - - qwen-7b-hf - - ...
---------------------------------------------- --------- ------ ---------------- ----------------------------- --------------------------------------- ---------------- ----------------------------- --------------------------------------- ...
- - - accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated ...
...
ceval-humanities - ppl 74.42 75.00 82.14 67.44 50.00 70.54 ...
ceval-stem - ppl 53.70 57.14 85.61 47.41 52.38 67.63 ...
ceval-social-science - ppl 81.60 84.62 83.09 76.00 61.54 72.79 ...
ceval-other - ppl 72.31 73.91 75.00 58.46 39.13 61.88 ...
ceval-hard - ppl 44.35 37.50 70.00 41.13 25.00 30.00 ...
ceval - ppl 67.32 71.01 81.17 58.97 49.28 67.82 ...
```
Currently, this solution only supports the C-Eval, MMLU, HellaSwag and ARC. [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) also includes CSQA and WinoGrande, but these have not yet been implemented in OpenCompass. We welcome the community to contribute more datasets.
Consider cite the following paper if you find it helpful:
```bibtex
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
author={OpenCompass Contributors},
howpublished = {\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@article{Li2023AnOS,
title={An Open Source Data Contamination Report for Llama Series Models},
author={Yucheng Li},
journal={ArXiv},
year={2023},
volume={abs/2310.17589},
url={https://api.semanticscholar.org/CorpusID:264490711}
}
```
# Custom Dataset Tutorial
This tutorial is intended for temporary and informal use of datasets. If the dataset requires long-term use or has specific needs for custom reading/inference/evaluation, it is strongly recommended to implement it according to the methods described in [new_dataset.md](./new_dataset.md).
In this tutorial, we will introduce how to test a new dataset without implementing a config or modifying the OpenCompass source code. We support two types of tasks: multiple choice (`mcq`) and question & answer (`qa`). For `mcq`, both ppl and gen inferences are supported; for `qa`, gen inference is supported.
## Dataset Format
We support datasets in both `.jsonl` and `.csv` formats.
### Multiple Choice (`mcq`)
For `mcq` datasets, the default fields are as follows:
- `question`: The stem of the multiple-choice question.
- `A`, `B`, `C`, ...: Single uppercase letters representing the options, with no limit on the number. Defaults to parsing consecutive letters strating from `A` as options.
- `answer`: The correct answer to the multiple-choice question, which must be one of the options used above, such as `A`, `B`, `C`, etc.
Non-default fields will be read in but are not used by default. To use them, specify in the `.meta.json` file.
An example of the `.jsonl` format:
```jsonl
{"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B"}
{"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A"}
{"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B"}
{"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C"}
```
An example of the `.csv` format:
```csv
question,A,B,C,answer
127+545+588+620+556+199=,2632,2635,2645,B
735+603+102+335+605=,2376,2380,2410,B
506+346+920+451+910+142+659+850=,4766,4774,4784,C
504+811+870+445=,2615,2630,2750,B
```
### Question & Answer (`qa`)
For `qa` datasets, the default fields are as follows:
- `question`: The stem of the question & answer question.
- `answer`: The correct answer to the question & answer question. It can be missing, indicating the dataset has no correct answer.
Non-default fields will be read in but are not used by default. To use them, specify in the `.meta.json` file.
An example of the `.jsonl` format:
```jsonl
{"question": "752+361+181+933+235+986=", "answer": "3448"}
{"question": "712+165+223+711=", "answer": "1811"}
{"question": "921+975+888+539=", "answer": "3323"}
{"question": "752+321+388+643+568+982+468+397=", "answer": "4519"}
```
An example of the `.csv` format:
```csv
question,answer
123+147+874+850+915+163+291+604=,3967
149+646+241+898+822+386=,3142
332+424+582+962+735+798+653+214=,4700
649+215+412+495+220+738+989+452=,4170
```
## Command Line List
Custom datasets can be directly called for evaluation through the command line.
```bash
python run.py \
--models hf_llama2_7b \
--custom-dataset-path xxx/test_mcq.csv \
--custom-dataset-data-type mcq \
--custom-dataset-infer-method ppl
```
```bash
python run.py \
--models hf_llama2_7b \
--custom-dataset-path xxx/test_qa.jsonl \
--custom-dataset-data-type qa \
--custom-dataset-infer-method gen
```
In most cases, `--custom-dataset-data-type` and `--custom-dataset-infer-method` can be omitted. OpenCompass will
set them based on the following logic:
- If options like `A`, `B`, `C`, etc., can be parsed from the dataset file, it is considered an `mcq` dataset; otherwise, it is considered a `qa` dataset.
- The default `infer_method` is `gen`.
## Configuration File
In the original configuration file, simply add a new item to the `datasets` variable. Custom datasets can be mixed with regular datasets.
```python
datasets = [
{"path": "xxx/test_mcq.csv", "data_type": "mcq", "infer_method": "ppl"},
{"path": "xxx/test_qa.jsonl", "data_type": "qa", "infer_method": "gen"},
]
```
## Supplemental Information for Dataset `.meta.json`
OpenCompass will try to parse the input dataset file by default, so in most cases, the `.meta.json` file is **not necessary**. However, if the dataset field names are not the default ones, or custom prompt words are required, it should be specified in the `.meta.json` file.
The file is placed in the same directory as the dataset, with the filename followed by `.meta.json`. An example file structure is as follows:
```tree
.
├── test_mcq.csv
├── test_mcq.csv.meta.json
├── test_qa.jsonl
└── test_qa.jsonl.meta.json
```
Possible fields in this file include:
- `abbr` (str): Abbreviation of the dataset, serving as its ID.
- `data_type` (str): Type of dataset, options are `mcq` and `qa`.
- `infer_method` (str): Inference method, options are `ppl` and `gen`.
- `human_prompt` (str): User prompt template for generating prompts. Variables in the template are enclosed in `{}`, like `{question}`, `{opt1}`, etc. If `template` exists, this field will be ignored.
- `bot_prompt` (str): Bot prompt template for generating prompts. Variables in the template are enclosed in `{}`, like `{answer}`, etc. If `template` exists, this field will be ignored.
- `template` (str or dict): Question template for generating prompts. Variables in the template are enclosed in `{}`, like `{question}`, `{opt1}`, etc. The relevant syntax is in [here](../prompt/prompt_template.md) regarding `infer_cfg['prompt_template']['template']`.
- `input_columns` (list): List of input fields for reading data.
- `output_column` (str): Output field for reading data.
- `options` (list): List of options for reading data, valid only when `data_type` is `mcq`.
For example:
```json
{
"human_prompt": "Question: 127 + 545 + 588 + 620 + 556 + 199 =\nA. 2632\nB. 2635\nC. 2645\nAnswer: Let's think step by step, 127 + 545 + 588 + 620 + 556 + 199 = 672 + 588 + 620 + 556 + 199 = 1260 + 620 + 556 + 199 = 1880 + 556 + 199 = 2436 + 199 = 2635. So the answer is B.\nQuestion: {question}\nA. {A}\nB. {B}\nC. {C}\nAnswer: ",
"bot_prompt": "{answer}"
}
```
or
```json
{
"template": "Question: {my_question}\nX. {X}\nY. {Y}\nZ. {Z}\nW. {W}\nAnswer:",
"input_columns": ["my_question", "X", "Y", "Z", "W"],
"output_column": "my_answer",
}
```
# Evaluation with Lightllm
We now support the evaluation of large language models using [Lightllm](https://github.com/ModelTC/lightllm) for inference. Developed by SenseTime, LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. Lightllm provides support for various large Language models, allowing users to perform model inference through Lightllm, locally deploying it as a service. During the evaluation process, OpenCompass feeds data to Lightllm through an API and processes the response. OpenCompass has been adapted for compatibility with Lightllm, and this tutorial will guide you on using OpenCompass to evaluate models with Lightllm as the inference backend.
## Setup
### Install OpenCompass
Please follow the [instructions](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) to install the OpenCompass and prepare the evaluation datasets.
### Install Lightllm
Please follow the [Lightllm homepage](https://github.com/ModelTC/lightllm) to install the Lightllm. Pay attention to aligning the versions of relevant dependencies, especially the version of the Transformers.
## Evaluation
We use the evaluation of Humaneval with the llama2-7B model as an example.
### Step-1: Deploy the model locally as a service using Lightllm.
```shell
python -m lightllm.server.api_server --model_dir /path/llama2-7B \
--host 0.0.0.0 \
--port 1030 \
--nccl_port 2066 \
--max_req_input_len 4096 \
--max_req_total_len 6144 \
--tp 1 \
--trust_remote_code \
--max_total_token_num 120000
```
\*\*Note: \*\* tp can be configured to enable TensorParallel inference on several gpus, suitable for the inference of very large models.
\*\*Note: \*\* The max_total_token_num in the above command will affect the throughput performance during testing. It can be configured according to the documentation on the [Lightllm homepage](https://github.com/ModelTC/lightllm). As long as it does not run out of memory, it is often better to set it as high as possible.
\*\*Note: \*\* If you want to start multiple LightLLM services on the same machine, you need to reconfigure the above port and nccl_port.
You can use the following Python script to quickly test whether the current service has been successfully started.
```python
import time
import requests
import json
url = 'http://localhost:8080/generate'
headers = {'Content-Type': 'application/json'}
data = {
'inputs': 'What is AI?',
"parameters": {
'do_sample': False,
'ignore_eos': False,
'max_new_tokens': 1024,
}
}
response = requests.post(url, headers=headers, data=json.dumps(data))
if response.status_code == 200:
print(response.json())
else:
print('Error:', response.status_code, response.text)
```
### Step-2: Evaluate the above model using OpenCompass.
```shell
python run.py configs/eval_lightllm.py
```
You are expected to get the evaluation results after the inference and evaluation.
\*\*Note: \*\*In `eval_lightllm.py`, please align the configured URL with the service address from the previous step.
# Evaluation with LMDeploy
We now support evaluation of models accelerated by the [LMDeploy](https://github.com/InternLM/lmdeploy). LMDeploy is a toolkit designed for compressing, deploying, and serving LLM. It has a remarkable inference performance. We now illustrate how to evaluate a model with the support of LMDeploy in OpenCompass.
## Setup
### Install OpenCompass
Please follow the [instructions](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) to install the OpenCompass and prepare the evaluation datasets.
### Install LMDeploy
Install lmdeploy via pip (python 3.8+)
```shell
pip install lmdeploy
```
The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by:
```shell
export LMDEPLOY_VERSION=0.6.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
## Evaluation
When evaluating a model, it is necessary to prepare an evaluation configuration that specifies information such as the evaluation dataset, the model, and inference parameters.
Taking [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) as an example, the evaluation config is as follows:
```python
# configure the dataset
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_a58960 import \
gsm8k_datasets
# and output the results in a chosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# configure lmdeploy
from opencompass.models import TurboMindModelwithChatTemplate
# configure the model
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr=f'internlm2-chat-7b-lmdeploy',
# model path, which can be the address of a model repository on the Hugging Face Hub or a local path
path='internlm/internlm2-chat-7b',
# inference backend of LMDeploy. It can be either 'turbomind' or 'pytorch'.
# If the model is not supported by 'turbomind', it will fallback to
# 'pytorch'
backend='turbomind',
# For the detailed engine config and generation config, please refer to
# https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/messages.py
engine_config=dict(tp=1),
gen_config=dict(do_sample=False),
# the max size of the context window
max_seq_len=7168,
# the max number of new tokens
max_out_len=1024,
# the max number of prompts that LMDeploy receives
# in `generate` function
batch_size=5000,
run_cfg=dict(num_gpus=1),
)
]
```
Place the aforementioned configuration in a file, such as "configs/eval_internlm2_lmdeploy.py". Then, in the home folder of OpenCompass, start evaluation by the following command:
```shell
python run.py configs/eval_internlm2_lmdeploy.py -w outputs
```
You are expected to get the evaluation results after the inference and evaluation.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment