The whole evaluation pipeline consists of two methods:
The whole evaluation pipeline consists of three methods:
1.`GPT Evaluation`: evaluates model predictions using GPT models.
1.`GPT Evaluation`: evaluates model predictions using GPT models.
* Compare the performance of two different models (battle).
* Compare the performance of two different models (battle).
* Rate the model according to pre-defined metrics using prompting design.
* Rate the model according to pre-defined metrics using prompting design.
2.`Automatic Evaluation`: evaluates model predictions using automatic metrics.
2.`Automatic Evaluation`: evaluates model predictions using automatic metrics.
3.`UniEval`: evaluates model predictions using UniEval models(English only).
### Evaluation Category
### Evaluation Category
...
@@ -75,7 +76,9 @@ GPT evaluation uses GPT models to evaluate the prediction of different models an
...
@@ -75,7 +76,9 @@ GPT evaluation uses GPT models to evaluate the prediction of different models an
GPT models evaluate the quality of model predictions based on the given prompt words and gives a score between 1-5.
GPT models evaluate the quality of model predictions based on the given prompt words and gives a score between 1-5.
> **NOTE:** Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "The answer should be in line with common sense, life experience, etc."(this is for category `brainstorming`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.
> **NOTE 1:** Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "The answer should be in line with common sense, life experience, etc."(this is for category `brainstorming`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.
> **NOTE 2:** To add customized metrics, you can refer to [FAQ](#faq).
#### Automatic Evaluation
#### Automatic Evaluation
...
@@ -85,7 +88,7 @@ There are two ways to obtain reference answers:
...
@@ -85,7 +88,7 @@ There are two ways to obtain reference answers:
* For instruction coming from human-designed problems, the reference answers are generated by GPT-3.5, such as roleplay, chat.
* For instruction coming from human-designed problems, the reference answers are generated by GPT-3.5, such as roleplay, chat.
* For instruction related with classic NLP problems, the reference answers are collected from open-sourced dataset with target answers, such as classification, extraction, summarization.
* For instruction related with classic NLP problems, the reference answers are collected from open-sourced dataset with target answers, such as classification, extraction, summarization.
There are 5 types of automatic evaluation metrics listed in the table below:
There are 6 types of automatic evaluation metrics listed in the table below:
@@ -94,6 +97,25 @@ There are 5 types of automatic evaluation metrics listed in the table below:
...
@@ -94,6 +97,25 @@ There are 5 types of automatic evaluation metrics listed in the table below:
| Distinct | Measure the diversity of generation text by counting the unique n-grams. |
| Distinct | Measure the diversity of generation text by counting the unique n-grams. |
| BERTScore | Measure the semantic similarity between tokens of predictions and references with BERT. |
| BERTScore | Measure the semantic similarity between tokens of predictions and references with BERT. |
| Precision<br/> Recall<br/> F1 Score | Measure the number of overlaps between prediction and reference (design for classification and extraction categories). |
| Precision<br/> Recall<br/> F1 Score | Measure the number of overlaps between prediction and reference (design for classification and extraction categories). |
| CHRF | Measure the similarity of character n-grams between prediction and reference. |
#### UniEval Evaluation
UniEval converts all evaluation tasks of different dimensions(metrics) into Boolean QA problems and utilize the model to answer with “Yes” or “No”. Compared with similarity-based metrics such as ROUGE and BLEU, UniEval can achieve a more comprehensive evaluation. In addition, UniEval also demonstrates its ability to transfer to unseen dimensions and tasks.
In our evaluation pipeline, two pre-trained UniEval evaluators are used. One is [unieval-sum](https://huggingface.co/MingZhong/unieval-sum) and the other is [unieval-dialog](https://huggingface.co/MingZhong/unieval-dialog). The two models can be used for the 3 tasks, `summarization`, `dialogue` and `data2text`. Each task has different evaluation dimensions.
| UniEval Model | Task | Dimension(Metric) |
| :------------: | :----------------- | :--- |
| unieval-sum | summarization | coherence: whether the summary is coherent<br/>consistency: whether the claim is consistent with the given document<br/>fluency: whether the paragraph is fluent<br/>relevance: whether the summary is relevant to the reference |
| unieval-sum | data2text | naturalness: whether the utterance is fluent<br/>informativeness: whether the utterance is informative according to the reference |
| unieval-dialog | dialogue | naturalness: whether the response is natural in the dialogue<br/>coherence: whether the response is coherent in the dialogue history<br/>understandability: whether the response is understandable in the dialogue |
> **NOTE 1:** Task "data2text" uses the same model as task "summarization".
> **NOTE 2:** In UniEval paper, the `unieval-sum` model demonstrates the best transfer ability and so you can evaluate your customized metric with this model. Details of adding customized metrics can be found in [FAQ](#faq).
> **NOTE 3:** We consider not including all metrics provided in UniEval in our pipeline because the data structure and content of the instructions we want to evaluate are not suitable for direct use of some UniEval metrics.
## Evaluation Process
## Evaluation Process
...
@@ -215,19 +237,26 @@ The following is an example of a Chinese GPT evaluation prompt. In an evaluation
...
@@ -215,19 +237,26 @@ The following is an example of a Chinese GPT evaluation prompt. In an evaluation
#### Configuration
#### Configuration
The following is an example of a Chinese config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics and automatic metrics in key `GPT` and`Metrics`. You can find an example Chinese config file in `config`.
The following is an example of a Chinese config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics, automatic metrics and UniEval metrics in key `GPT`,`Metrics` and `UniEval`(English only). You can find an example English config file in `config`.
@@ -235,27 +264,33 @@ The following is an example of a Chinese config file. The configuration file can
...
@@ -235,27 +264,33 @@ The following is an example of a Chinese config file. The configuration file can
`"language"`: the language used to evaluate the model capability. We only support Chinese `"cn"` for now.
`"language"`: the language used to evaluate the model capability. We only support Chinese `"cn"` for now.
`"path_for_UniEval"`: path to the UniEval model.
`"category"`: the category/categories needed to evaluate the model capability.
`"category"`: the category/categories needed to evaluate the model capability.
`"GPT"`: the metrics you want to use for GPT evaluation.
`"GPT"`: the metrics you want to use for GPT evaluation.
`"Metrics"`: the metrics you want to use for automatic metrics evaluation.
`"Metrics"`: the metrics you want to use for automatic metrics evaluation.
`"UniEval"`: the metrics you want to use for UniEval metrics evaluation. The metric has to be in the `"{task}-{metric}"` format because different tasks have same metrics such as naturalness and coherence.
You can remove the key such as `"Metrics"` to skip evaluating answers using its corresponding evaluation metrics.
You can create your config file based on available settings listed in following table.
You can create your config file based on available settings listed in following table.
> **NOTE:** For categories which don't have standard answers such as `brainstorming`, you should avoid using automatic metrics such as `BLEU` and `ROUGE` which are based on similarity measures and you should use `Distinct` instead in your config file.
> **NOTE:** For categories which don't have standard answers such as `brainstorming`, you should avoid using automatic metrics such as `BLEU` and `ROUGE` which are based on similarity measures and you should use `Distinct` instead in your config file.
...
@@ -290,23 +325,36 @@ For example, if you want to add a new metric `persuasiveness` into category `bra
...
@@ -290,23 +325,36 @@ For example, if you want to add a new metric `persuasiveness` into category `bra
"id":1,
"id":1,
"category":"brainstorming",
"category":"brainstorming",
"metrics":{
"metrics":{
"persuasiveness":"说服力(1-5):XXX"
"persuasiveness":"persuasiveness(1-5):a short description for persuasiveness"
},
},
"CoT":{
"CoT":{
"persuasiveness":"XXX\n\n说服力:"
"persuasiveness":"CoT for persuasiveness\n\npersuasiveness:"
"prompt":"You are a good assistant. Please rate the given answer to the \"brainstorming\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
}
}
}
}
```
```
</details>
</details>
<details><summary><b>How can I add a new UniEval evaluation metric?</b></summary>
For example, if you want to add a new metric `persuasiveness` into task `data2text`, you should add a Boolean QA question about the metric in function `add_question` in `unieval/utils.py`. Please do note that how effectively the model would evaluate this metric is unknown and you may need some experiments to test whether the model is capable of evaluating this metric.
```python
iftask=='data2text':
ifdimension=='persuasiveness':
cur_input='question: Is this a persuasive utterence </s> utterance: '+output[i]
```
</details>
## To Do
## To Do
- [x] Add evaluation for English capability
- [x] Add evaluation for English capability
- [] Support UniEval
- [x] Support UniEval
- [x] Support GPT-4 evaluation
- [x] Support GPT-4 evaluation
- [ ] Support GPT evaluation with reference in the prompt
## Citations
## Citations
...
@@ -327,4 +375,13 @@ For example, if you want to add a new metric `persuasiveness` into category `bra
...
@@ -327,4 +375,13 @@ For example, if you want to add a new metric `persuasiveness` into category `bra
archivePrefix={arXiv},
archivePrefix={arXiv},
primaryClass={cs.CL}
primaryClass={cs.CL}
}
}
@misc{zhong2022unified,
title={Towards a Unified Multi-Dimensional Evaluator for Text Generation},
author={Ming Zhong and Yang Liu and Da Yin and Yuning Mao and Yizhu Jiao and Pengfei Liu and Chenguang Zhu and Heng Ji and Jiawei Han},
# self.unieval_metric_stats's key is "task" instead of "category".
# Iterating "task" first will avoid repeated loading models because one task corresponds to one UniEval model.
# If key is "category", different models will be loaded for multiple times across categories because the user may require different task(models) to evaluate one category.
forcategoryinself.params:
iflen(answers_per_category[category])==0:
print(f"Category {category} specified in your config doesn't have corresponding answers!")
This module contains the implementation of the abstraction of the device topology. It is used to represent the device topology and manage the distributed information related to the network.
## 📝 Design
This module is inspired by the DeviceMesh in the [Alpa project](https://github.com/alpa-projects/alpa) and the device array can be represented as a 1D or 2D mesh. We will be extending the device mesh to support 3D mesh in the future.
## 🔨 Usage
- Create a device mesh
```python
# this is the list of global ranks involved in the device mesh
# assume we have 4 GPUs and the global ranks for these GPUs are 0, 1, 2, 3