Unverified Commit d6261e10 authored by Leymore's avatar Leymore Committed by GitHub
Browse files

[Doc] Update dataset list (#437)



* add new dataset list

* add new dataset list

* add new dataset list

* update

* update

* update readme

---------
Co-authored-by: default avatargaotongxiao <gaotongxiao@gmail.com>
parent dc1b82c3
...@@ -34,9 +34,10 @@ Just like a compass guides us on our journey, OpenCompass will guide you through ...@@ -34,9 +34,10 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a> ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
- **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥. - **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥. - **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md). 🔥🔥🔥. - **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
- **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details. - **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.06\]** [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation. - **\[2023.09.06\]** [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass. - **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass.
...@@ -51,7 +52,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through ...@@ -51,7 +52,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features includes: OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features includes:
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 50+ datasets with about 300,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. - **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
- **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours. - **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours.
...@@ -67,6 +68,60 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -67,6 +68,60 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
<p align="right"><a href="#top">🔝Back to top</a></p> <p align="right"><a href="#top">🔝Back to top</a></p>
## 🛠️ Installation
Below are the steps for quick installation and datasets preparation.
```Python
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# Download dataset to data/ folder
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
Some third-party features, like Humaneval and Llama, may require additional steps to work properly, for detailed steps please refer to the [Installation Guide](https://opencompass.readthedocs.io/en/latest/get_started.html).
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🏗️ ️Evaluation
After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared, you can evaluate the performance of the LLaMA-7b model on the MMLU and C-Eval datasets using the following command:
```bash
python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
```
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
```bash
python run.py --datasets ceval_ppl mmlu_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace model path
--model-kwargs device_map='auto' \ # Arguments for model construction
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # Arguments for tokenizer construction
--max-out-len 100 \ # Maximum number of tokens generated
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--batch-size 8 \ # Batch size
--no-batch-padding \ # Don't enable batch padding, infer through for loop to avoid performance loss
--num-gpus 1 # Number of required GPUs
```
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Dataset Support ## 📖 Dataset Support
<table align="center"> <table align="center">
...@@ -82,10 +137,7 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -82,10 +137,7 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
<b>Reasoning</b> <b>Reasoning</b>
</td> </td>
<td> <td>
<b>Comprehensive Examination</b> <b>Examination</b>
</td>
<td>
<b>Understanding</b>
</td> </td>
</tr> </tr>
<tr valign="top"> <tr valign="top">
...@@ -126,24 +178,33 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -126,24 +178,33 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
<summary><b>Translation</b></summary> <summary><b>Translation</b></summary>
- Flores - Flores
- IWSLT2017
</details> </details>
</td>
<td>
<details open> <details open>
<summary><b>Knowledge Question Answering</b></summary> <summary><b>Multi-language Question Answering</b></summary>
- BoolQ - TyDi-QA
- CommonSenseQA - XCOPA
- NaturalQuestion
- TrivialQA
</details> </details>
<details open> <details open>
<summary><b>Multi-language Question Answering</b></summary> <summary><b>Multi-language Summary</b></summary>
- TyDi-QA - XLSum
</details>
</td>
<td>
<details open>
<summary><b>Knowledge Question Answering</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details> </details>
</td> </td>
...@@ -158,6 +219,7 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -158,6 +219,7 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
- AX-g - AX-g
- CB - CB
- RTE - RTE
- ANLI
</details> </details>
...@@ -165,7 +227,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -165,7 +227,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
<summary><b>Commonsense Reasoning</b></summary> <summary><b>Commonsense Reasoning</b></summary>
- StoryCloze - StoryCloze
- StoryCloze-CN (coming soon)
- COPA - COPA
- ReCoRD - ReCoRD
- HellaSwag - HellaSwag
...@@ -186,14 +247,8 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -186,14 +247,8 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
<summary><b>Theorem Application</b></summary> <summary><b>Theorem Application</b></summary>
- TheoremQA - TheoremQA
- StrategyQA
</details> - SciBench
<details open>
<summary><b>Code</b></summary>
- HumanEval
- MBPP
</details> </details>
...@@ -208,17 +263,44 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -208,17 +263,44 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
<details open> <details open>
<summary><b>Junior High, High School, University, Professional Examinations</b></summary> <summary><b>Junior High, High School, University, Professional Examinations</b></summary>
- GAOKAO-2023 - C-Eval
- CEval
- AGIEval - AGIEval
- MMLU - MMLU
- GAOKAO-Bench - GAOKAO-Bench
- CMMLU - CMMLU
- ARC - ARC
- Xiezhi
</details> </details>
<details open>
<summary><b>Medical Examinations</b></summary>
- CMB
</details>
</td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Understanding</b>
</td> </td>
<td> <td>
<b>Long Context</b>
</td>
<td>
<b>Safety</b>
</td>
<td>
<b>Code</b>
</td>
</tr>
<tr valign="top">
<td>
<details open> <details open>
<summary><b>Reading Comprehension</b></summary> <summary><b>Reading Comprehension</b></summary>
...@@ -227,6 +309,9 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -227,6 +309,9 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
- DRCD - DRCD
- MultiRC - MultiRC
- RACE - RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details> </details>
...@@ -236,6 +321,7 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -236,6 +321,7 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
- CSL - CSL
- LCSTS - LCSTS
- XSum - XSum
- SummScreen
</details> </details>
...@@ -246,6 +332,48 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -246,6 +332,48 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
- LAMBADA - LAMBADA
- TNEWS - TNEWS
</details>
</td>
<td>
<details open>
<summary><b>Long Context Understanding</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>Safety</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>Robustness</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>Code</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details> </details>
</td> </td>
</tr> </tr>
...@@ -280,86 +408,28 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun ...@@ -280,86 +408,28 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
- Alpaca - Alpaca
- Baichuan - Baichuan
- WizardLM - WizardLM
- ChatGLM-6B - ChatGLM2
- ChatGLM2-6B
- MPT
- Falcon - Falcon
- TigerBot - TigerBot
- MOSS - Qwen
- ... - ...
</td> </td>
<td> <td>
- OpenAI - OpenAI
- Claude (coming soon) - Claude
- PaLM (coming soon) - PaLM (coming soon)
- …… - ……
</td> </td>
<!--
- GLM
- ...
</td> -->
</tr> </tr>
</tbody> </tbody>
</table> </table>
## 🛠️ Installation
Below are the steps for quick installation and datasets preparation.
```Python
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# Download dataset to data/ folder
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
Some third-party features, like Humaneval and Llama, may require additional steps to work properly, for detailed steps please refer to the [Installation Guide](https://opencompass.readthedocs.io/en/latest/get_started.html).
<p align="right"><a href="#top">🔝Back to top</a></p> <p align="right"><a href="#top">🔝Back to top</a></p>
## 🏗️ ️Evaluation
After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared, you can evaluate the performance of the LLaMA-7b model on the MMLU and C-Eval datasets using the following command:
```bash
python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
```
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
```bash
python run.py --datasets ceval_ppl mmlu_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace model path
--model-kwargs device_map='auto' \ # Arguments for model construction
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # Arguments for tokenizer construction
--max-out-len 100 \ # Maximum number of tokens generated
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--batch-size 8 \ # Batch size
--no-batch-padding \ # Don't enable batch padding, infer through for loop to avoid performance loss
--num-gpus 1 # Number of required GPUs
```
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
## 🔜 Roadmap ## 🔜 Roadmap
- [ ] Subjective Evaluation - [ ] Subjective Evaluation
......
...@@ -34,9 +34,10 @@ ...@@ -34,9 +34,10 @@
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a> ## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2023.09.26\]** 我们在评测榜单上更新了[Qwen](https://github.com/QwenLM/Qwen), 这是目前表现最好的开源模型之一, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
- **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥. - **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥. - **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).🔥🔥🔥. - **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).
- **\[2023.09.08\]** 我们在评测榜单上更新了Baichuan-2/Tigerbot-2/Vicuna-v1.5, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情。 - **\[2023.09.08\]** 我们在评测榜单上更新了Baichuan-2/Tigerbot-2/Vicuna-v1.5, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情。
- **\[2023.09.06\]** 欢迎 [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) 团队采用OpenCompass对模型进行系统评估。我们非常感谢社区在提升LLM评估的透明度和可复现性上所做的努力。 - **\[2023.09.06\]** 欢迎 [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) 团队采用OpenCompass对模型进行系统评估。我们非常感谢社区在提升LLM评估的透明度和可复现性上所做的努力。
- **\[2023.09.02\]** 我们加入了[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的评测支持。 - **\[2023.09.02\]** 我们加入了[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的评测支持。
...@@ -53,7 +54,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -53,7 +54,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
- **开源可复现**:提供公平、公开、可复现的大模型评测方案 - **开源可复现**:提供公平、公开、可复现的大模型评测方案
- **全面的能力维度**:五大维度设计,提供 50+ 个数据集约 30 万题的的模型评测方案,全面评估模型能力 - **全面的能力维度**:五大维度设计,提供 70+ 个数据集约 40 万题的的模型评测方案,全面评估模型能力
- **丰富的模型支持**:已支持 20+ HuggingFace 及 API 模型 - **丰富的模型支持**:已支持 20+ HuggingFace 及 API 模型
...@@ -69,6 +70,62 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -69,6 +70,62 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<p align="right"><a href="#top">🔝返回顶部</a></p> <p align="right"><a href="#top">🔝返回顶部</a></p>
## 🛠️ 安装
下面展示了快速安装以及准备数据集的步骤。
```Python
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# 下载数据集到 data/ 处
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
有部分第三方功能,如 Humaneval 以及 Llama,可能需要额外步骤才能正常运行,详细步骤请参考[安装指南](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html)
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🏗️ ️评测
确保按照上述步骤正确安装 OpenCompass 并准备好数据集后,可以通过以下命令评测 LLaMA-7b 模型在 MMLU 和 C-Eval 数据集上的性能:
```bash
python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
```
OpenCompass 预定义了许多模型和数据集的配置,你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
```bash
# 列出所有配置
python tools/list_configs.py
# 列出所有跟 llama 及 mmlu 相关的配置
python tools/list_configs.py llama mmlu
```
你也可以通过命令行去评测其它 HuggingFace 模型。同样以 LLaMA-7b 为例:
```bash
python run.py --datasets ceval_ppl mmlu_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace 模型地址
--model-kwargs device_map='auto' \ # 构造 model 的参数
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # 构造 tokenizer 的参数
--max-out-len 100 \ # 最长生成 token 数
--max-seq-len 2048 \ # 模型能接受的最大序列长度
--batch-size 8 \ # 批次大小
--no-batch-padding \ # 不打开 batch padding,通过 for loop 推理,避免精度损失
--num-gpus 1 # 所需 gpu 数
```
通过命令行或配置文件,OpenCompass 还支持评测 API 或自定义模型,以及更多样化的评测策略。请阅读[快速上手](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html#id3)了解如何运行一个评测任务。
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📖 数据集支持 ## 📖 数据集支持
<table align="center"> <table align="center">
...@@ -84,10 +141,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -84,10 +141,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<b>推理</b> <b>推理</b>
</td> </td>
<td> <td>
<b>学科</b> <b>考试</b>
</td>
<td>
<b>理解</b>
</td> </td>
</tr> </tr>
<tr valign="top"> <tr valign="top">
...@@ -128,24 +182,33 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -128,24 +182,33 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<summary><b>翻译</b></summary> <summary><b>翻译</b></summary>
- Flores - Flores
- IWSLT2017
</details> </details>
</td>
<td>
<details open> <details open>
<summary><b>知识问答</b></summary> <summary><b>多语种问答</b></summary>
- BoolQ - TyDi-QA
- CommonSenseQA - XCOPA
- NaturalQuestion
- TrivialQA
</details> </details>
<details open> <details open>
<summary><b>多语种问答</b></summary> <summary><b>多语种总结</b></summary>
- TyDi-QA - XLSum
</details>
</td>
<td>
<details open>
<summary><b>知识问答</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details> </details>
</td> </td>
...@@ -160,6 +223,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -160,6 +223,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
- AX-g - AX-g
- CB - CB
- RTE - RTE
- ANLI
</details> </details>
...@@ -167,7 +231,6 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -167,7 +231,6 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<summary><b>常识推理</b></summary> <summary><b>常识推理</b></summary>
- StoryCloze - StoryCloze
- StoryCloze-CN(即将上线)
- COPA - COPA
- ReCoRD - ReCoRD
- HellaSwag - HellaSwag
...@@ -188,14 +251,8 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -188,14 +251,8 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<summary><b>定理应用</b></summary> <summary><b>定理应用</b></summary>
- TheoremQA - TheoremQA
- StrategyQA
</details> - SciBench
<details open>
<summary><b>代码</b></summary>
- HumanEval
- MBPP
</details> </details>
...@@ -210,16 +267,43 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -210,16 +267,43 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<details open> <details open>
<summary><b>初中/高中/大学/职业考试</b></summary> <summary><b>初中/高中/大学/职业考试</b></summary>
- GAOKAO-2023 - C-Eval
- CEval
- AGIEval - AGIEval
- MMLU - MMLU
- GAOKAO-Bench - GAOKAO-Bench
- CMMLU - CMMLU
- ARC - ARC
- Xiezhi
</details>
<details open>
<summary><b>医学考试</b></summary>
- CMB
</details> </details>
</td> </td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>理解</b>
</td>
<td>
<b>长文本</b>
</td>
<td>
<b>安全</b>
</td>
<td>
<b>代码</b>
</td>
</tr>
<tr valign="top">
<td> <td>
<details open> <details open>
<summary><b>阅读理解</b></summary> <summary><b>阅读理解</b></summary>
...@@ -229,6 +313,9 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -229,6 +313,9 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
- DRCD - DRCD
- MultiRC - MultiRC
- RACE - RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details> </details>
...@@ -238,6 +325,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -238,6 +325,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
- CSL - CSL
- LCSTS - LCSTS
- XSum - XSum
- SummScreen
</details> </details>
...@@ -248,6 +336,48 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -248,6 +336,48 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
- LAMBADA - LAMBADA
- TNEWS - TNEWS
</details>
</td>
<td>
<details open>
<summary><b>长文本理解</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>安全</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>健壮性</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>代码</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details> </details>
</td> </td>
</tr> </tr>
...@@ -276,92 +406,34 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 ...@@ -276,92 +406,34 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<tr valign="top"> <tr valign="top">
<td> <td>
- InternLM
- LLaMA - LLaMA
- Vicuna - Vicuna
- Alpaca - Alpaca
- Baichuan - Baichuan
- WizardLM - WizardLM
- ChatGLM-6B - ChatGLM2
- ChatGLM2-6B
- MPT
- Falcon - Falcon
- TigerBot - TigerBot
- MOSS - Qwen
- …… - ……
</td> </td>
<td> <td>
- OpenAI - OpenAI
- Claude (即将推出) - Claude
- PaLM (即将推出) - PaLM (即将推出)
- …… - ……
</td> </td>
<!-- <td>
- GLM
- ……
</td> -->
</tr> </tr>
</tbody> </tbody>
</table> </table>
## 🛠️ 安装
下面展示了快速安装以及准备数据集的步骤。
```Python
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# 下载数据集到 data/ 处
wget https://github.com/open-compass/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
```
有部分第三方功能,如 Humaneval 以及 Llama,可能需要额外步骤才能正常运行,详细步骤请参考[安装指南](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html)
<p align="right"><a href="#top">🔝返回顶部</a></p> <p align="right"><a href="#top">🔝返回顶部</a></p>
## 🏗️ ️评测
确保按照上述步骤正确安装 OpenCompass 并准备好数据集后,可以通过以下命令评测 LLaMA-7b 模型在 MMLU 和 C-Eval 数据集上的性能:
```bash
python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
```
OpenCompass 预定义了许多模型和数据集的配置,你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
```bash
# 列出所有配置
python tools/list_configs.py
# 列出所有跟 llama 及 mmlu 相关的配置
python tools/list_configs.py llama mmlu
```
你也可以通过命令行去评测其它 HuggingFace 模型。同样以 LLaMA-7b 为例:
```bash
python run.py --datasets ceval_ppl mmlu_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace 模型地址
--model-kwargs device_map='auto' \ # 构造 model 的参数
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # 构造 tokenizer 的参数
--max-out-len 100 \ # 最长生成 token 数
--max-seq-len 2048 \ # 模型能接受的最大序列长度
--batch-size 8 \ # 批次大小
--no-batch-padding \ # 不打开 batch padding,通过 for loop 推理,避免精度损失
--num-gpus 1 # 所需 gpu 数
```
通过命令行或配置文件,OpenCompass 还支持评测 API 或自定义模型,以及更多样化的评测策略。请阅读[快速上手](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html#id3)了解如何运行一个评测任务。
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)
## 🔜 路线图 ## 🔜 路线图
- [ ] 主观评测 - [ ] 主观评测
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment