[evaluate] support gpt evaluation with reference (#3972)

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>

[evaluate] support gpt evaluation with reference (#3972)
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2925f473 · Yuanchen · GitHub · 9d02590c · 2925f473 · 2925f473
Unverified Commit 2925f473 authored Jun 13, 2023 by Yuanchen Committed by GitHub Jun 13, 2023
8 changed files
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
@@ -17,6 +17,7 @@ The whole evaluation pipeline consists of three methods:
 1. `GPT Evaluation`: evaluates model predictions using GPT models.
   * Compare the performance of two different models (battle).
   * Rate the model according to pre-defined metrics using prompting design.
+   * Rate the model according to pre-defined metrics with additional reference answer using prompting design.
 2. `Automatic Evaluation`: evaluates model predictions using automatic metrics.
 3. `UniEval`: evaluates model predictions using UniEval models(English only).

@@ -66,7 +67,7 @@ GPT evaluation uses GPT models to evaluate the prediction of different models an
 |       切题<br/>(Relevance)       | 切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。</br></br>Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic. | 1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。<br/> 2. 阅读答案，确认答案是否直接回答了题目所问的问题。<br/> 3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。<br/> 4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。</br></br>1. Read the question to determine what the question asks and what aspects of the question need to be answered.<br>2. Read the answers to make sure that they directly answer the question asked.<br>3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.<br>4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all. |
 |      创意性<br/>(Creativity)       | 创意性(1-5)：某些头脑风暴问题可能需要答案具有创意，提出新的思路。</br></br>Creativity (1-5): Some brainstorming questions may require answers that are creative and suggest new ideas. | 1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则创意性评分可能会受到影响。<br/> 3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠，但仍然可以被认为是有创意的，只要它提供了新的角度或方法来解决问题。<br/> 4. 根据答案的创意性，给出一个1到5的评分。如果答案缺乏创意，则应给出一个较低的评分。如果答案具有创意并提供了新的思路，应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the creativity score may be affected.<br>3. Consider whether the answer contains novel ideas or unique thoughts. An answer may overlap with a known solution and still be considered creative, as long as it offers a new perspective or approach to the problem.<br>4. Give a score of 1 to 5 depending on the creativity of the answer. If the answer lacks creativity, a lower score should be given. If the answer is creative and provides a new idea, a higher score should be given. |
 |     实用性<br/>(Practicality)      | 实用性(1-5)：某些头脑风暴问题可能需要答案提出实用的建议或解决方法。</br></br>Practicality (1-5): Some brainstorming questions may require answers to suggest practical suggestions or solutions. | 1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则实用性评分可能会受到影响。<br/> 3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好，但如果无法实现或应用，则实用性评分可能会受到影响。<br/> 4. 根据答案的实用性，给出一个1到5的评分。如果答案缺乏实用性，则应给出一个较低的评分。如果答案提出了实用的建议或解决方法，并且可以很好地解决问题，则应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the practicality score may be affected.<br>3. Consider whether the suggestions or solutions presented in the answer are practical and workable. The answer may look good, but if it cannot be implemented or applied, the practicality score may be affected.<br>4. Give a score of 1 to 5 depending on the practicality of the answer. If the answer lacks practicality, a lower score should be given. If the answer makes a practical suggestion or solution and solves the problem well, a higher score should be given. |
-|      正确性<br/>(Correctness)      | 正确性(1-5)：答案应该符合常识、生活实际等等。 </br></br> Correctness (1-5): The answer should be in line with common sense, life experience, etc. | 1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则正确性评分可能会受到影响。<br/> 3. 考虑答案中所提供的信息是否正确、符合常识、生活实际等等。如果答案中存在明显的错误或不合理之处，则正确性评分可能会受到影响。<br/> 4. 根据答案的正确性，给出一个1到5的评分。如果答案存在明显的错误或不合理之处，则应给出一个较低的评分。如果答案正确、符合常识、生活实际等等，则应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the correctness score may be affected.<br>3. Consider whether the information provided in the answer is correct, consistent with common sense, real life, etc. If there are obvious errors or implausibilities in the answer, the correctness score may be affected.<br>4. Give a score of 1 to 5 depending on the correctness of the answer. If the answer contains obvious errors or unreasonable points, a lower score should be given. A higher score should be given if the answer is correct, consistent with common sense, real life, etc. |
+|      正确性<br/>(Correctness)      | 正确性(1-5)：正确性(1-5)：答案是否正确。</br></br> Correctness (1-5): whether the answer is correct or not. | 1. 仔细阅读题目，尝试自己回答该问题。<br/>2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。<br/><br/>1. Read the question carefully and try to answer the question yourself. <br/>2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded. |
 |      自然<br/>(Naturalness)      | 自然(1-5)：答案是否自然，并且符合问题给定的身份。</br></br>Naturalness (1-5): whether the answer is natural and fits the identity given by the question. | 1. 阅读题目，确定题目提供的身份信息。<br/> 2. 检查答案内容是否符合题目给定的身份。<br/> 3. 根据以上因素，对该回答的自然性进行打分，分数从1到5，其中1表示不自然，5表示非常自然，并符合问题给定的身份。</br></br>1. Read the question and determine the identity information provided in the question.<br>2. Check whether the content of the answer matches the identity given in the question.<br>3. Based on the above factors, score the naturalness of the response on a scale from 1 to 5, where 1 means unnatural and 5 means very natural and in accordance with the identity given in the question. |
 |     参与感<br/>(Engagingness)      | 参与感(1-5)：答案是否对前面的对话内容做出了恰当的反应，是否理解对话的语境和背景。</br></br>Engagingness (1-5): whether the answer responds appropriately to the content of the preceding conversation and whether it understands the context and background of the conversation. | 1. 阅读题目，确定对话的语境和背景。<br/> 2. 检查答案是否充分理解对话的语境和背景，能否自然地融入到对话中而不显得突兀。<br/> 3. 根据以上因素，对该回答的参与感进行打分，分数从1到5，其中1表示没有参与感，5表示非常有参与感，并且恰当地理解了对话的语境和背景。</br></br>1. Read the questions to determine the context and background of the dialogue.<br>2. Check that the answer fully understands the context and background of the conversation and that it fits naturally into the conversation without seeming abrupt.<br>3. Based on the above factors, rate the response's engagement on a scale from 1 to 5, where 1 means not engaged and 5 means very engaged and appropriately understands the context and background of the conversation. |
 |    合理性<br/>(Reasonableness)     | 合理性(1-5)：答案是否能够与前面的对话内容形成逻辑上的衔接，是否符合常理，能否在这个上下文中合理存在。</br></br>Reasonableness (1-5): Whether the answer can form a logical connection with the content of the previous dialogue, whether it is consistent with common sense, and whether it can reasonably exist in this context. | 1. 阅读题目，确定对话的主题以及问题期望的回答方向。<br/> 2. 判断答案是否能够与前面的对话内容形成逻辑上的衔接，是否符合常理，能否在这个上下文中合理存在。<br/> 3. 根据以上因素，对该回答的合理性进行打分，分数从1到5，其中1表示不合理，5表示非常合理，并且能够与前面的对话内容形成逻辑上的衔接，并符合常理。</br></br>1. Read the question and determine the topic of the conversation and the direction the question expects the answer to go.<br>2. Determine whether the answer can be logically connected to the preceding conversation, whether it makes common sense, and whether it can reasonably exist in this context.<br>3. Based on the above factors, rate the reasonableness of the answer on a scale from 1 to 5, where 1 means unreasonable and 5 means very reasonable and able to form a logical connection with the preceding dialogue content and consistent with common sense. |
@@ -76,7 +77,7 @@ GPT evaluation uses GPT models to evaluate the prediction of different models an

 GPT models evaluate the quality of model predictions based on the given prompt words and gives a score between 1-5.

-> **NOTE 1:**  Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "The answer should be in line with common sense, life experience, etc."(this is for category `brainstorming`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.
+> **NOTE 1:**  Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "Whether the answer is correct or not."(this is for category `classification`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.

 > **NOTE 2:** To add customized metrics, you can refer to [FAQ](#faq).

@@ -249,7 +250,7 @@ The following is an example of a Chinese config file. The configuration file can
    },
    "category": {
        "brainstorming": {
-            "GPT": ["relevance", "creativity", "practicality", "correctness"],
+            "GPT": ["relevance", "creativity", "practicality", "reasonableness"],
            "Metrics": ["Distinct"],
            "UniEval": ["summarization-fluency", "data2text-naturalness", "data2text-informativeness"]
        },
@@ -313,6 +314,8 @@ python eval.py \
    --openai_key "your openai key" \
 ```

+If you want GPT evaluation with reference, you can add an argument `--gpt_with_reference`.
+
 ## FAQ

 <details><summary><b>How can I add a new GPT evaluation metric?</b></summary>
@@ -354,7 +357,7 @@ if task == 'data2text':
 - [x] Add evaluation for English capability
 - [x] Support UniEval
 - [x] Support GPT-4 evaluation
- [ ] Support GPT evaluation with reference in the prompt
+- [x] Support GPT evaluation with reference

 ## Citations


--- a/applications/Chat/evaluate/config/config_cn.json
+++ b/applications/Chat/evaluate/config/config_cn.json
@@ -7,7 +7,7 @@
        "relevance",
        "creativity",
        "practicality",
-        "correctness"
+        "reasonableness"
      ],
      "Metrics": [
        "Distinct"

--- a/applications/Chat/evaluate/config/config_en.json
+++ b/applications/Chat/evaluate/config/config_en.json
@@ -12,7 +12,7 @@
        "relevance",
        "creativity",
        "practicality",
-        "correctness"
+        "reasonableness"
      ],
      "Metrics": [
        "Distinct"

--- a/applications/Chat/evaluate/eval.py
+++ b/applications/Chat/evaluate/eval.py
@@ -38,9 +38,14 @@ def main(args):
            raise Exception(
                "No prompt file for gpt evaluation provided. Please specify the prompt file for gpt evaluation!")

+        if args.gpt_model == "text-davinci-003" and args.gpt_with_reference:
+            raise Exception(
+                "GPT evaluation with reference is not supported for text-davinci-003. You should specify chat models such as gpt-3.5-turbo or gpt-4."
+            )
+
        # initialize evaluator
        evaluator = Evaluator(metrics_per_category, battle_prompt, gpt_evaluation_prompt, args.gpt_model,
-                              config["language"], config.get("path_for_UniEval", None))
+                              config["language"], config.get("path_for_UniEval", None), args.gpt_with_reference)
        if len(args.model_name_list) == 2:
            answers1 = jload(args.answer_file_list[0])
            answers2 = jload(args.answer_file_list[1])
@@ -92,6 +97,10 @@ if __name__ == '__main__':
                        default="gpt-3.5-turbo",
                        choices=["text-davinci-003", "gpt-3.5-turbo", "gpt-4"],
                        help='which GPT model to use for evaluation')
+    parser.add_argument('--gpt_with_reference',
+                        default=False,
+                        action="store_true",
+                        help='whether to include reference answer in gpt evaluation')
    parser.add_argument('--save_path', type=str, default="results", help='path to save evaluation results')
    parser.add_argument('--openai_key', type=str, default=None, required=True, help='Your openai key')
    args = parser.parse_args()

--- a/applications/Chat/evaluate/evaluator.py
+++ b/applications/Chat/evaluate/evaluator.py
@@ -16,13 +16,14 @@ class Evaluator(object):
    """

    def __init__(self, params: Dict[str, Any], battle_prompt: Dict[str, Any], gpt_evaluation_prompt: Dict[str, Any],
-                 gpt_model: str, language: str, path_for_UniEval: Dict[str, str]) -> None:
+                 gpt_model: str, language: str, path_for_UniEval: Dict[str, str], gpt_with_reference: bool) -> None:
        self.params = params
        self.battle_prompt = battle_prompt
        self.gpt_evaluation_prompt = gpt_evaluation_prompt
        self.gpt_model = gpt_model
        self.language = language
        self.path_for_UniEval = path_for_UniEval
+        self.gpt_with_reference = gpt_with_reference
        self.automatic_metric_stats = dict()
        self.unieval_metric_stats = dict()
        self.gpt_evaluation_results = dict()
@@ -157,8 +158,14 @@ class Evaluator(object):
                print(f"No prompt for category {category}! Use prompt for category general now.")
                prompt = self.gpt_evaluation_prompt["general"]

-            self.gpt_evaluation_results[category] = gpt_evaluate.evaluate(answers_per_category[category], prompt,
-                                                                          category_metrics, category, self.gpt_model)
+            self.gpt_evaluation_results[category] = gpt_evaluate.evaluate(
+                answers_per_category[category],
+                prompt,
+                category_metrics,
+                category,
+                self.gpt_model,
+                self.language,
+                references=targets_per_category[category] if self.gpt_with_reference else None)

    def save(self, path: str, model_name_list: List[str]) -> None:
        """

--- a/applications/Chat/evaluate/gpt_evaluate.py
+++ b/applications/Chat/evaluate/gpt_evaluate.py
@@ -13,6 +13,23 @@ import seaborn as sns
 import tqdm
 from utils import jdump, jload

+ref_step_template = {
+    "en":
+        "Now please compare the answer with the {adjective} answer, determine whether the answer is able to achieve the same level of {metric}.\n\n",
+    "cn":
+        "请比较答案与上面的{adjective}答案，确定答案是否可以达到与该{adjective}答案同样水平的{metric}。\n\n"
+}
+
+ref_answer_template_general = {
+    "en": "\nAn example answer with good quality is as follows:\n\n{answer}\n\n",
+    "cn": "\n一个优质的示例答案如下：\n\n{answer}\n\n"
+}
+
+ref_answer_template_correctness = {
+    "en": "\nA correct answer is as follows:\n\n{answer}\n\n",
+    "cn": "\n标准答案如下：\n\n{answer}\n\n"
+}
+

 def get_battle_result(sys_prompt: str, user_prompt: str, id: int, max_tokens: int = 2048) -> Dict[str, Any]:
    """
@@ -233,18 +250,125 @@ def save_battle_results(evaluations: List[Dict], name1: str, name2: str, save_pa
    print(f"Model {name2} average score: {ans2_score/(len(evaluations)-invalid_count):.2f}")


+def reference_template(metric: str, language: str, reference: Dict[str, Any]) -> str:
+    """
+    Get prompt template for GPT evaluation with reference.
+
+    Different languages have different prompt templates.
+
+    Args:
+        metric: metric used in GPT evaluation with reference.
+        language: language for the template.
+        reference: the instruction that contains target answer.
+
+    Returns:
+        Prompt template for GPT evaluation with reference.
+    """
+
+    step_to_add = ref_step_template[language]
+
+    for_the_given_answer = "{metric} (1-5) (directly give the score for the given answer):" if language == "en" else "{metric} (1-5) (直接对给定答案打分)"
+
+    # adjective is used to describe the word "answer" in the prompt.
+    adjective = "example" if language == "en" else "示例"
+    answer_to_add = ref_answer_template_general[language]
+
+    # Only for correctness, we will provide a correct answer and so the adjective for "answer" will be "correct". The prompt words will be "a correct answer".
+    # In other cases, the prompt words will be "an example answer with good quality" by default.
+    if metric.lower() == "correctness":
+        adjective = "correct" if language == "en" else "标准"
+        answer_to_add = ref_answer_template_correctness[language]
+
+    answer_to_add = answer_to_add.format(answer=reference["target"] if reference["target"] else reference["output"])
+    step_to_add = step_to_add.format(metric=metric.lower(),
+                                     adjective=adjective) + for_the_given_answer.format(metric=metric)
+
+    return answer_to_add + step_to_add
+
+
+def fill_in_message(role: str, content: str) -> Dict[str, str]:
+    """
+    Generate one formatted message to send through chat completion.
+
+    Args:
+        role: the role of the author of this message.
+        content: the contents of the message.
+
+    Returns:
+        One message to send through chat completion.
+    """
+
+    return {"role": role, "content": content}
+
+
+def multiturn_chat_completion(user_messages: List[str], model: str, max_tokens: int = 1, turns=2) -> Dict[str, Any]:
+    """
+    Do multi-turn chat completion.
+
+    When turns == 1, it is a one-turn conversation for normal GPT evaluation.
+    When turns == 2, it is a two-turn conversation which is used for GPT evaluation with reference answers.
+
+    Args:
+        user_messages: messages user wants to send.
+        model: the model used to evaluate answers.
+        max_tokens: the maximum number of tokens to generate in the chat completion.
+        turns: the number of turns for conversation.
+
+    Returns:
+        Last turn's response.
+    """
+
+    if len(user_messages) != turns:
+        raise Exception("The length of user messages should be equal to the turn number!")
+
+    assistant_responses = []
+
+    for i in range(turns):
+        messages_to_send = []
+
+        for j in range(i):
+            messages_to_send.append(fill_in_message("user", user_messages[j]))
+            messages_to_send.append(
+                fill_in_message("assistant", assistant_responses[j]["choices"][0]["message"]["content"]))
+
+        # Length of user messages == Length of assistant messages + 1
+        # Because we always expect the api to response
+        messages_to_send.append(fill_in_message("user", user_messages[i]))
+
+        response = openai.ChatCompletion.create(
+            model=model,
+            messages=messages_to_send,
+            temperature=0,
+            max_tokens=max_tokens,
+        )
+
+        # Avoid exceeding rate limits.
+        # You can comment this line if your request doesn't contain many tokens.
+        time.sleep(1)
+
+        assistant_responses.append(response)
+
+    return assistant_responses[-1]
+
+
 def get_gpt_evaluation_without_logprobs(prompt: Dict[str, Any],
                                        inst: Dict[str, Any],
                                        metrics: List[str],
+                                        language: str,
+                                        reference: Dict[str, Any] = None,
                                        model: str = "gpt-3.5-turbo",
                                        max_tokens: int = 2048) -> Dict[str, Any]:
    """
    Use chat models(gpt-3.5-turbo or gpt-4) to evaluate one model answer.

+    Temprature is set to 0 to make the model more deterministic.
+
    Args:
        prompt: a dictionary including prompt template, CoT and metrics.
        inst: the instruction that is needed to be evaluated.
        metrics: the metrics for evaluation.
+        language: language used to change the CoT(add one more step about comparing the given answer and reference) if reference is not None.
+        reference: the reference answer.
        model: the model used to evaluate answers.
        max_tokens: the maximum number of tokens to generate in the chat completion.

@@ -254,7 +378,7 @@ def get_gpt_evaluation_without_logprobs(prompt: Dict[str, Any],

    MAX_API_RETRY = 3

-    question = (inst["instruction"] if inst["input"] == "" else inst["instruction"] + " " + inst["input"])
+    question = (inst["instruction"] if inst["input"] == "" else inst["instruction"] + "\n" + inst["input"])
    answer = inst["output"]
    inst["evaluation"] = {}

@@ -265,28 +389,34 @@ def get_gpt_evaluation_without_logprobs(prompt: Dict[str, Any],
            )
        for i in range(MAX_API_RETRY):
            try:
-                response = openai.ChatCompletion.create(
-                    model=model,
-                    messages=[
-                        {
-                            "role":
-                                "user",
-                            "content":
-                                prompt["prompt"].format(
-                                    question=question,
-                                    answer=answer,
-                                    metric=prompt["metrics"][metric],
-                                    steps=prompt["CoT"][metric],
-                                ),
-                        },
-                    ],
-                    temperature=0,
-                    max_tokens=max_tokens,
+                prompt_reference = "" if reference is None else reference_template(metric, language, reference)
+
+                prompt_1st_round = prompt["prompt"].format(
+                    question=question,
+                    answer=answer,
+                    metric=prompt["metrics"][metric],
+                    steps=prompt["CoT"][metric],
                )
+
+                if prompt_reference:
+                    # Do a 2-round conversation
+                    response = multiturn_chat_completion([prompt_1st_round, prompt_reference],
+                                                         model,
+                                                         max_tokens=max_tokens,
+                                                         turns=2)
+                else:
+                    response = multiturn_chat_completion([prompt_1st_round], model, max_tokens=max_tokens, turns=1)
+
                inst["evaluation"][metric] = {
                    "response": response["choices"][0]["message"]["content"],
                    "logprobs": None,
                }
+
+                # Prevent exceeding rate limits because we have multiple workers.
+                # But this will slow down the evaluation process.
+                # You can comment this line if your request doesn't contain many tokens.
+                time.sleep(len(metrics) * 0.5)
+
                break
            except Exception as e:
                print(e)
@@ -305,6 +435,8 @@ def get_gpt_evaluation_with_logprobs(prompt: Dict[str, Any],
    Use completion model(text-davinci-003) to evaluate one model answer.
    Only completion models can return log probabilities.

+    Temprature is set to 0 to make the model more deterministic.
+
    Args:
        prompt: a dictionary including prompt template, CoT and metrics.
        inst: the instruction that is needed to be evaluated.
@@ -317,7 +449,7 @@ def get_gpt_evaluation_with_logprobs(prompt: Dict[str, Any],

    MAX_API_RETRY = 3

-    question = (inst["instruction"] if inst["input"] == "" else inst["instruction"] + " " + inst["input"])
+    question = (inst["instruction"] if inst["input"] == "" else inst["instruction"] + "\n" + inst["input"])
    answer = inst["output"]
    inst["evaluation"] = {}

@@ -344,6 +476,12 @@ def get_gpt_evaluation_with_logprobs(prompt: Dict[str, Any],
                    "response": response["choices"][0]["text"],
                    "logprobs": response["choices"][0]["logprobs"]["top_logprobs"],
                }
+
+                # Prevent exceeding rate limits because we have multiple workers.
+                # But this will slow down the evaluation process.
+                # You can comment this line if your request doesn't contain many tokens.
+                time.sleep(len(metrics) * 0.5)
+
                break
            except Exception as e:
                print(e)
@@ -354,7 +492,13 @@ def get_gpt_evaluation_with_logprobs(prompt: Dict[str, Any],
    return inst


-def evaluate(answers: List[Dict], prompt: Dict[str, Any], metrics: List[str], category: str, model: str) -> List[Dict]:
+def evaluate(answers: List[Dict],
+             prompt: Dict[str, Any],
+             metrics: List[str],
+             category: str,
+             model: str,
+             language: str,
+             references: List[Dict] = None) -> List[Dict]:
    """
    Use GPT models to evaluate model answers and save evaluation results.

@@ -364,6 +508,8 @@ def evaluate(answers: List[Dict], prompt: Dict[str, Any], metrics: List[str], ca
        metrics: metrics for GPT evaluation.
        category: the category of the model answers for evaluation.
        model: the specific GPT model used to evaluate answers.
+        language: language used in GPT evaluation
+        references: references for GPT evaluation

    Returns:
        Evaluations of the given answers.
@@ -378,12 +524,19 @@ def evaluate(answers: List[Dict], prompt: Dict[str, Any], metrics: List[str], ca

    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        futures = []
-        for inst in answers:
+        for idx, inst in enumerate(answers):
            # Completion models can return log probabilities.
            if model == "text-davinci-003":
                future = executor.submit(get_gpt_evaluation_with_logprobs, prompt, inst, metrics, 1)
            else:
-                future = executor.submit(get_gpt_evaluation_without_logprobs, prompt, inst, metrics, model, 1)
+                future = executor.submit(get_gpt_evaluation_without_logprobs,
+                                         prompt,
+                                         inst,
+                                         metrics,
+                                         language,
+                                         reference=None if references is None else references[idx],
+                                         model=model,
+                                         max_tokens=1)

            futures.append(future)


--- a/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_cn.json
+++ b/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_cn.json
@@ -7,14 +7,14 @@
      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
      "creativity": "创意性(1-5)：某些头脑风暴问题可能需要答案具有创意，提出新的思路。",
      "practicality": "实用性(1-5)：某些头脑风暴问题可能需要答案提出实用的建议或解决方法。",
-      "correctness": "正确性(1-5)：答案应该符合常识、生活实际等等。"
+      "reasonableness": "合理性(1-5)：答案应该符合常识、生活实际等等。"
    },
    "CoT": {
      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
      "creativity": "1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。\n2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则创意性评分可能会受到影响。\n3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠，但仍然可以被认为是有创意的，只要它提供了新的角度或方法来解决问题。\n4. 根据答案的创意性，给出一个1到5的评分。如果答案缺乏创意，则应给出一个较低的评分。如果答案具有创意并提供了新的思路，应给出一个较高的评分。\n\n创意性：",
      "practicality": "1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。\n2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则实用性评分可能会受到影响。\n3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好，但如果无法实现或应用，则实用性评分可能会受到影响。\n4. 根据答案的实用性，给出一个1到5的评分。如果答案缺乏实用性，则应给出一个较低的评分。如果答案提出了实用的建议或解决方法，并且可以很好地解决问题，则应给出一个较高的评分。\n\n实用性：",
-      "correctness": "1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。\n2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则正确性评分可能会受到影响。\n3. 考虑答案中所提供的信息是否正确、符合常识、生活实际等等。如果答案中存在明显的错误或不合理之处，则正确性评分可能会受到影响。\n4. 根据答案的正确性，给出一个1到5的评分。如果答案存在明显的错误或不合理之处，则应给出一个较低的评分。如果答案正确、符合常识、生活实际等等，则应给出一个较高的评分。\n\n正确性："
+      "reasonableness": "1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。\n2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则合理性评分可能会受到影响。\n3. 考虑答案中所提供的信息是否合理、符合常识、生活实际等等。如果答案中存在明显的不合理之处，则合理性评分可能会受到影响。\n4. 根据答案的合理性，给出一个1到5的评分。如果答案存在明显的不合理之处，则应给出一个较低的评分。如果答案合理、符合常识、生活实际等等，则应给出一个较高的评分。\n\n合理性："
    },
    "prompt": "你是一个好助手。请你为下面“头脑风暴”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },

--- a/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_en.json
+++ b/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_en.json