### What are the differences and connections between `ppl` and `gen`?
`ppl` stands for perplexity, an index used to evaluate a model's language modeling capabilities. In the context of OpenCompass, it generally refers to a method of answering multiple-choice questions: given a context, the model needs to choose the most appropriate option from multiple choices. In this case, we concatenate the n options with the context to form n sequences, then calculate the model's perplexity for these n sequences. We consider the option corresponding to the sequence with the lowest perplexity as the model's reasoning result for this question. This evaluation method is simple and direct in post-processing, with high certainty.
`gen` is an abbreviation for generate. In the context of OpenCompass, it refers to the model's continuation writing result given a context as the reasoning result for a question. Generally, the string obtained from continuation writing requires a heavier post-processing process to extract reliable answers and complete the evaluation.
In terms of usage, multiple-choice questions and some multiple-choice-like questions of the base model use `ppl`, while the base model's multiple-selection and non-multiple-choice questions use `gen`. All questions of the chat model use `gen`, as many commercial API models do not expose the `ppl` interface. However, there are exceptions, such as when we want the base model to output the problem-solving process (e.g., Let's think step by step), we will also use `gen`, but the overall usage is as shown in the following table:
| Base Model | Only MCQ Tasks | Tasks Other Than MCQ |
| Chat Model | None | All Tasks |
Similar to `ppl`, conditional log probability (`clp`) calculates the probability of the next token given a context. It is also only applicable to multiple-choice questions, and the range of probability calculation is limited to the tokens corresponding to the option numbers. The option corresponding to the token with the highest probability is considered the model's reasoning result. Compared to `ppl`, `clp` calculation is more efficient, requiring only one inference, whereas `ppl` requires n inferences. However, the drawback is that `clp` is subject to the tokenizer. For example, the presence or absence of space symbols before and after an option can change the tokenizer's encoding result, leading to unreliable test results. Therefore, `clp` is rarely used in OpenCompass.
### How does OpenCompass control the number of shots in few-shot evaluations?
In the dataset configuration file, there is a retriever field indicating how to recall samples from the dataset as context examples. The most commonly used is `FixKRetriever`, which means using a fixed k samples, hence k-shot. There is also `ZeroRetriever`, which means not using any samples, which in most cases implies 0-shot.
On the other hand, in-context samples can also be directly specified in the dataset template. In this case, `ZeroRetriever` is also used, but the evaluation is not 0-shot and needs to be determined based on the specific template. Refer to [prompt](../prompt/prompt_template.md) for more details
### How does OpenCompass allocate GPUs?
### How does OpenCompass allocate GPUs?
OpenCompass processes evaluation requests using the unit termed as "task". Each task is an independent combination of model(s) and dataset(s). The GPU resources needed for a task are determined entirely by the model being evaluated, specifically by the `num_gpus` parameter.
OpenCompass processes evaluation requests using the unit termed as "task". Each task is an independent combination of model(s) and dataset(s). The GPU resources needed for a task are determined entirely by the model being evaluated, specifically by the `num_gpus` parameter.
...
@@ -58,6 +79,10 @@ Because of HuggingFace's implementation, OpenCompass requires network (especiall
...
@@ -58,6 +79,10 @@ Because of HuggingFace's implementation, OpenCompass requires network (especiall
With which no more network connection is needed for the evaluation. However, error will still be raised if the files any dataset or model is missing from the cache.
With which no more network connection is needed for the evaluation. However, error will still be raised if the files any dataset or model is missing from the cache.
- Use mirror like [hf-mirror](https://hf-mirror.com/)
```python
HF_ENDPOINT=https://hf-mirror.compythonrun.py...
```
### My server cannot connect to the Internet, how can I use OpenCompass?
### My server cannot connect to the Internet, how can I use OpenCompass?
-[6. Merge your branch to `main` branch and delete the branch](#6--merge-your-branch-to-main-branch-and-delete-the-branch)
-[6. Merge your branch to `main` branch and delete the branch](#6--merge-your-branch-to-main-branch-and-delete-the-branch)
-[Code style](#code-style)
-[Code style](#code-style)
-[Python](#python)
-[Python](#python)
-[About Contributing Test Datasets](#about-contributing-test-datasets)
Thanks for your interest in contributing to OpenCompass! All kinds of contributions are welcome, including but not limited to the following.
Thanks for your interest in contributing to OpenCompass! All kinds of contributions are welcome, including but not limited to the following.
...
@@ -137,3 +138,21 @@ We use the following tools for linting and formatting:
...
@@ -137,3 +138,21 @@ We use the following tools for linting and formatting:
-[docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
-[docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/OpenCompass/blob/main/setup.cfg).
Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/OpenCompass/blob/main/setup.cfg).
## About Contributing Test Datasets
- Submitting Test Datasets
- Please implement logic for automatic dataset downloading in the code; or provide a method for obtaining the dataset in the PR. The OpenCompass maintainers will follow up accordingly. If the dataset is not yet public, please indicate so.
- Submitting Data Configuration Files
- Provide a README in the same directory as the data configuration. The README should include, but is not limited to:
- A brief description of the dataset
- The official link to the dataset
- Some test examples from the dataset
- Evaluation results of the dataset on relevant models
- Citation of the dataset
- (Optional) Summarizer of the dataset
- (Optional) If the testing process cannot be achieved simply by concatenating the dataset and model configuration files, a configuration file for conducting the test is also required.
- (Optional) If necessary, please add a description of the dataset in the relevant documentation sections. This is very necessary to help users understand the testing scheme. You can refer to the following types of documents in OpenCompass:
`ppl` 是困惑度 (perplexity) 的缩写,是一种评价模型进行语言建模能力的指标。在 OpenCompass 的语境下,它一般指一种选择题的做法:给定一个上下文,模型需要从多个备选项中选择一个最合适的。此时,我们会将 n 个选项拼接上上下文后,形成 n 个序列,然后计算模型对这 n 个序列的 perplexity,我们认为其中 perplexity 最低的序列所对应的选项即为模型在这道题上面的推理结果,该种评测方法的后处理简单直接、确定性高。