update doc (#830)

f3675516 · Fengzhe Zhou · GitHub · 793e32c9 · f3675516 · f3675516
Unverified Commit f3675516 authored Jan 24, 2024 by Fengzhe Zhou Committed by GitHub Jan 24, 2024
4 changed files
--- a/docs/en/get_started/faq.md
+++ b/docs/en/get_started/faq.md
@@ -2,6 +2,27 @@
 ## General
+### What are the differences and connections between `ppl` and `gen`?
+`ppl` stands for perplexity, an index used to evaluate a model's language modeling capabilities. In the context of OpenCompass, it generally refers to a method of answering multiple-choice questions: given a context, the model needs to choose the most appropriate option from multiple choices. In this case, we concatenate the n options with the context to form n sequences, then calculate the model's perplexity for these n sequences. We consider the option corresponding to the sequence with the lowest perplexity as the model's reasoning result for this question. This evaluation method is simple and direct in post-processing, with high certainty.
+`gen` is an abbreviation for generate. In the context of OpenCompass, it refers to the model's continuation writing result given a context as the reasoning result for a question. Generally, the string obtained from continuation writing requires a heavier post-processing process to extract reliable answers and complete the evaluation.
+In terms of usage, multiple-choice questions and some multiple-choice-like questions of the base model use `ppl`, while the base model's multiple-selection and non-multiple-choice questions use `gen`. All questions of the chat model use `gen`, as many commercial API models do not expose the `ppl` interface. However, there are exceptions, such as when we want the base model to output the problem-solving process (e.g., Let's think step by step), we will also use `gen`, but the overall usage is as shown in the following table:
+|            | ppl            | gen                  |
+| ---------- | -------------- | -------------------- |
+| Base Model | Only MCQ Tasks | Tasks Other Than MCQ |
+| Chat Model | None           | All Tasks            |
+Similar to `ppl`, conditional log probability (`clp`) calculates the probability of the next token given a context. It is also only applicable to multiple-choice questions, and the range of probability calculation is limited to the tokens corresponding to the option numbers. The option corresponding to the token with the highest probability is considered the model's reasoning result. Compared to `ppl`, `clp` calculation is more efficient, requiring only one inference, whereas `ppl` requires n inferences. However, the drawback is that `clp` is subject to the tokenizer. For example, the presence or absence of space symbols before and after an option can change the tokenizer's encoding result, leading to unreliable test results. Therefore, `clp` is rarely used in OpenCompass.
+### How does OpenCompass control the number of shots in few-shot evaluations?
+In the dataset configuration file, there is a retriever field indicating how to recall samples from the dataset as context examples. The most commonly used is `FixKRetriever`, which means using a fixed k samples, hence k-shot. There is also `ZeroRetriever`, which means not using any samples, which in most cases implies 0-shot.
+On the other hand, in-context samples can also be directly specified in the dataset template. In this case, `ZeroRetriever` is also used, but the evaluation is not 0-shot and needs to be determined based on the specific template. Refer to [prompt](../prompt/prompt_template.md) for more details
 ### How does OpenCompass allocate GPUs?
 OpenCompass processes evaluation requests using the unit termed as "task". Each task is an independent combination of model(s) and dataset(s). The GPU resources needed for a task are determined entirely by the model being evaluated, specifically by the `num_gpus` parameter.
@@ -58,6 +79,10 @@ Because of HuggingFace's implementation, OpenCompass requires network (especiall
  HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 HF_EVALUATE_OFFLINE=1 python run.py ...
  ```
  With which no more network connection is needed for the evaluation. However, error will still be raised if the files any dataset or model is missing from the cache.
+- Use mirror like [hf-mirror](https://hf-mirror.com/)
+  ```python
+  HF_ENDPOINT=https://hf-mirror.com python run.py ...
+  ```
 ### My server cannot connect to the Internet, how can I use OpenCompass?

--- a/docs/en/notes/contribution_guide.md
+++ b/docs/en/notes/contribution_guide.md
@@ -12,6 +12,7 @@
    - [6.  Merge your branch to `main` branch and delete the branch](#6--merge-your-branch-to-main-branch-and-delete-the-branch)
  - [Code style](#code-style)
    - [Python](#python)
+  - [About Contributing Test Datasets](#about-contributing-test-datasets)
 Thanks for your interest in contributing to OpenCompass! All kinds of contributions are welcome, including but not limited to the following.
@@ -137,3 +138,21 @@ We use the following tools for linting and formatting:
 - [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
 Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/OpenCompass/blob/main/setup.cfg).
+## About Contributing Test Datasets
+- Submitting Test Datasets
+  - Please implement logic for automatic dataset downloading in the code; or provide a method for obtaining the dataset in the PR. The OpenCompass maintainers will follow up accordingly. If the dataset is not yet public, please indicate so.
+- Submitting Data Configuration Files
+- Provide a README in the same directory as the data configuration. The README should include, but is not limited to:
+  - A brief description of the dataset
+  - The official link to the dataset
+  - Some test examples from the dataset
+  - Evaluation results of the dataset on relevant models
+  - Citation of the dataset
+- (Optional) Summarizer of the dataset
+- (Optional) If the testing process cannot be achieved simply by concatenating the dataset and model configuration files, a configuration file for conducting the test is also required.
+- (Optional) If necessary, please add a description of the dataset in the relevant documentation sections. This is very necessary to help users understand the testing scheme. You can refer to the following types of documents in OpenCompass:
+  - [Circular Evaluation](../advanced_guides/circular_eval.md)
+  - [Code Evaluation](../advanced_guides/code_eval.md)
+  - [Contamination Assessment](../advanced_guides/contamination_eval.md)
--- a/docs/zh_cn/get_started/faq.md
+++ b/docs/zh_cn/get_started/faq.md
@@ -2,6 +2,27 @@
 ## 通用
+### ppl 和 gen 有什么区别和联系？
+`ppl` 是困惑度 (perplexity) 的缩写，是一种评价模型进行语言建模能力的指标。在 OpenCompass 的语境下，它一般指一种选择题的做法：给定一个上下文，模型需要从多个备选项中选择一个最合适的。此时，我们会将 n 个选项拼接上上下文后，形成 n 个序列，然后计算模型对这 n 个序列的 perplexity，我们认为其中 perplexity 最低的序列所对应的选项即为模型在这道题上面的推理结果，该种评测方法的后处理简单直接、确定性高。
+`gen` 是生成 (generate) 的缩写。在 OpenCompass 的语境下，它指的是在给定上下文的情况下，模型往后续写的结果就是这道题目上的推理结果。一般来说，续写得到的字符串需要结合上比较重的后处理过程，才能进行可靠的答案提取，从而完成评测。
+从使用上来说，基座模型的单项选择题和部分具有选择题性质的题目会使用 `ppl`，基座模型的不定项选择和非选择题都会使用 `gen`。而对话模型的所有题目都会使用 `gen`，因为许多商用 API 模型不会暴露 `ppl` 的接口。但也存在例外情况，例如我们希望基座模型输出解题思路过程时 (例如 Let's think step by step)，我们同样会使用 `gen`，但总体的使用如下图所示：
+|          | ppl         | gen                |
+| -------- | ----------- | ------------------ |
+| 基座模型 | 仅 MCQ 任务 | MCQ 以外的其他任务 |
+| 对话模型 | 无          | 所有任务           |
+与 `ppl` 高度类似地，条件对数概率 `clp` (conditional log probability) 是在给定上下文的情况下，计算下一个 token 的概率。它也仅适用于选择题，考察概率的范围仅限于备选项标号所对应的 token，取其中概率最高的 token 所对应的选项为模型的推理结果。与 ppl 相比，`clp` 的计算更加高效，仅需要推理一次，而 ppl 需要推理 n 次，但坏处是，`clp` 受制于 tokenizer，在例如选项前后有无空格符号时，tokenizer 编码的结果会有变化，导致测试结果不可靠。因此 OpenCompass 中很少使用 `clp`。
+### OpenCompass 如何控制 few shot 评测的 shot 数目？
+在数据集配置文件中，有一个 `retriever` 的字段，该字段表示如何召回数据集中的样本作为上下文样例，其中最常用的是 `FixKRetriever` 表示固定使用某 k 个样本，因此即为 k-shot。另外还有 `ZeroRetriever` 表示不使用任何样本，这在大多数情况下意味着 0-shot。
+另一方面，in context 的样本也可以直接在数据集的模板中指定，在该情况下亦会搭配使用 `ZeroRetriever`，但此时的评测并不是 0-shot，而需要根据具体的模板来进行确定。具体请看 [prompt](../prompt/prompt_template.md)
 ### OpenCompass 如何分配 GPU？
 OpenCompass 使用称为 task (任务) 的单位处理评估请求。每个任务都是模型和数据集的独立组合。任务所需的 GPU 资源完全由正在评估的模型决定，具体取决于 `num_gpus` 参数。
@@ -16,7 +37,7 @@ OpenCompass 使用称为 task (任务) 的单位处理评估请求。每个任
 供给侧就是运行多少任务。任务是模型和数据集的组合，它首先取决于要测多少模型和多少数据集。另外由于 OpenCompass 会将一个较大的任务拆分成多个小任务，因此每个子任务有多少条数据 `--max-partition-size` 也会影响任务的数量。(`--max-partition-size` 与真实数据条目成正比，但并不是 1:1 的关系)。
-需求侧就是有多少 worker 在运行。由于 OpenCompass 会同时实例化多个模型去进行推理，因此我们用 `--num-gpus` 来指定每个实例使用多少 GPU。注意 `--num-gpus` 是一个 HuggingFace 模型专用的参数，非 HuggingFace 模型设置该参数是不会起作用的。同时我们使用 `--max-num-workers` 去表示最多有多少个实例在运行。最后由于 GPU 显存、负载不充分等问题，OpenCompass 也支持在同一个 GPU 上运行多个实例，这个参数是 `--max-num-workers-per-gpu`。因此可以笼统地认为，我们总共会使用 `--num-gpus` * `--max-num-workers` /  `--max-num-workers-per-gpu` 个 GPU。
+需求侧就是有多少 worker 在运行。由于 OpenCompass 会同时实例化多个模型去进行推理，因此我们用 `--num-gpus` 来指定每个实例使用多少 GPU。注意 `--num-gpus` 是一个 HuggingFace 模型专用的参数，非 HuggingFace 模型设置该参数是不会起作用的。同时我们使用 `--max-num-workers` 去表示最多有多少个实例在运行。最后由于 GPU 显存、负载不充分等问题，OpenCompass 也支持在同一个 GPU 上运行多个实例，这个参数是 `--max-num-workers-per-gpu`。因此可以笼统地认为，我们总共会使用 `--num-gpus` * `--max-num-workers` / `--max-num-workers-per-gpu` 个 GPU。
 综上，当任务运行较慢，GPU 负载不高的时候，我们首先需要检查供给是否充足，如果不充足，可以考虑调小 `--max-partition-size` 来将任务拆分地更细；其次需要检查需求是否充足，如果不充足，可以考虑增大 `--max-num-workers` 和 `--max-num-workers-per-gpu`。一般来说，**我们会将 `--num-gpus` 设定为最小的满足需求的值，并不会再进行调整**。
@@ -58,6 +79,10 @@ sudo apt-get install -y libgl1 libglib2.0-0
  HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 HF_EVALUATE_OFFLINE=1 python run.py ...
  ```
  这样，评估不再需要网络连接。但是，如果缓存中缺少任何数据集或模型的文件，仍然会引发错误。
+- 使用中国大陆内的镜像源，例如 [hf-mirror](https://hf-mirror.com/)
+  ```python
+  HF_ENDPOINT=https://hf-mirror.com python run.py ...
+  ```
 ### 我的服务器无法连接到互联网，我如何使用 OpenCompass？
@@ -87,9 +112,9 @@ OpenCompass 中的每个任务代表等待评估的特定模型和数据集部
 ## 模型
-### 如何使用本地已下好的Huggingface模型?
+### 如何使用本地已下好的 Huggingface 模型?
-如果您已经提前下载好Huggingface的模型文件，请手动指定模型路径，并在`--model-kwargs` 和 `--tokenizer-kwargs`中添加 `trust_remote_code=True`. 示例如下
+如果您已经提前下载好 Huggingface 的模型文件，请手动指定模型路径，并在`--model-kwargs` 和 `--tokenizer-kwargs`中添加 `trust_remote_code=True`. 示例如下
 ```bash
 python run.py --datasets siqa_gen winograd_ppl \

--- a/docs/zh_cn/notes/contribution_guide.md
+++ b/docs/zh_cn/notes/contribution_guide.md
@@ -12,6 +12,7 @@
    - [6. `拉取请求`合并之后删除该分支](#6-拉取请求合并之后删除该分支)
  - [代码风格](#代码风格)
    - [Python](#python)
+  - [关于提交数据集](#关于提交数据集)
 感谢你对于OpenCompass的贡献！我们欢迎各种形式的贡献，包括但不限于以下几点。
@@ -146,3 +147,21 @@ git checkout main -b branchname
 - [docformatter](https://github.com/myint/docformatter): 一个格式化docstring的工具。
 yapf和isort的样式配置可以在[setup.cfg](https://github.com/OpenCompass/blob/main/setup.cfg)中找到。
+## 关于贡献测试数据集
+- 提交测试数据集
+  - 请在代码中实现自动下载数据集的逻辑；或者在 PR 中提供获取数据集的方法，OpenCompass 的维护者会跟进处理。如果数据集尚未公开，亦请注明。
+- 提交数据配置文件
+- 在数据配置同级目录下提供 README，README 中的内容应该包含，但不局限于：
+  - 该数据集的简单说明
+  - 该数据集的官方链接
+  - 该数据集的一些测试样例
+  - 该数据集在相关模型上的评测结果
+  - 该数据集的引用
+- (可选) 数据集的 summarizer
+- (可选) 如果测试过程无法通过简单拼接数据集和模型配置文件的方式来实现的话，还需要提供进行测试过程的配置文件
+- (可选) 如果需要，请在文档相关位置处添加该数据集的说明。这在辅助用户理解该测试方案是非常必要的，可参考 OpenCompass 中该类型的文档：
+  - [循环评测](../advanced_guides/circular_eval.md)
+  - [代码评测](../advanced_guides/code_eval.md)
+  - [污染评估](../advanced_guides/contamination_eval.md)