[feature] ColossalEval: Evaluation Pipeline for LLMs (#4786)

* Add ColossalEval * Delete evaluate in Chat --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com>

[feature] ColossalEval: Evaluation Pipeline for LLMs (#4786)
* Add ColossalEval * Delete evaluate in Chat --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com>
ce777853 · Yuanchen · GitHub · 74aa7d96 · ce777853 · ce777853
Unverified Commit ce777853 authored Sep 24, 2023 by Yuanchen Committed by GitHub Sep 24, 2023
18 changed files
--- a/applications/Chat/evaluate/prompt/battle_prompt/battle_prompt_en.json
+++ b/applications/Chat/evaluate/prompt/battle_prompt/battle_prompt_en.json
--- a/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_cn.json
+++ b/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_cn.json
@@ -39,53 +39,8 @@
    },
    "prompt": "你是一个好助手。请你为下面的“补全对话”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
-  "classification": {
-    "id": 3,
-    "category": "classification",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "正确性(1-5)：答案是否正确。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
-    },
-    "prompt": "你是一个好助手。请你为下面的“分类“问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
-  "closed_qa": {
-    "id": 4,
-    "category": "closed_qa",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "正确性(1-5)：答案是否正确。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
-    },
-    "prompt": "你是一个好助手。请你为下面问题的答案打分。\n\n问题如下：\n\n{question}\n\n需要你评分的答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
-  "extraction": {
-    "id": 5,
-    "category": "extraction",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "准确性(1-5)：回答应该准确无误地提取出所需信息，不应该包含任何错误或误导性信息。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读问题并确定需要从材料中提取的信息。\n2. 仔细阅读回答并确保它涵盖了所有需要提取的信息。\n3. 使用所提供的材料来验证回答的准确性。如果回答不准确或包含错误或误导性信息，则无法给出高分。\n4. 检查回答是否包含所有要求提取的信息，不要漏掉任何重要细节。\n5. 根据回答的准确性和完整性，给出一个介于1和5之间的分数，5分表示回答非常准确且完整，1分表示回答几乎没有提取出所需信息。\n\n准确性："
-    },
-    "prompt": "你是一个好助手。请你为下面的“提取”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
  "generation": {
-    "id": 6,
+    "id": 3,
    "category": "generation",
    "metrics": {
      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
@@ -100,7 +55,7 @@
    "prompt": "你是一个好助手。请你为下面的“生成”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
  "open_qa": {
-    "id": 7,
+    "id": 4,
    "category": "open_qa",
    "metrics": {
      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
@@ -114,23 +69,8 @@
    },
    "prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
-  "rewriting": {
-    "id": 8,
-    "category": "rewriting",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "正确性(1-5)：答案是否正确。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
-    },
-    "prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
  "roleplay": {
-    "id": 9,
+    "id": 5,
    "category": "roleplay",
    "metrics": {
      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
@@ -146,33 +86,14 @@
    },
    "prompt": "你是一个好助手。请你为下面的“角色扮演”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
-  "summarization": {
+  "Other": {
-    "id": 10,
+    "id": 6,
-    "category": "summarization",
+    "category": "Other",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "准确性(1-5)：回答应该准确无误地总结出材料的重点。",
-      "conciseness": "简明扼要(1-5)：答案是否简明扼要，没有冗余内容。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读问题给的材料，理解其内容和要点。\n2. 评估回答是否准确地总结出原始材料的重点。\n3. 评估回答是否包含原始材料中的所有关键信息。\n4. 根据以上步骤，给出一个1-5的分数，其中1表示回答不能准确地总结出材料的重点，5表示回答完全准确地总结出材料的重点。\n\n准确性：",
-      "conciseness": "1. 阅读题目，提取出材料的重点。\n2. 阅读该总结，并注意其中的主要观点和信息。\n3. 评估总结的长度。一个简明扼要的总结通常应该在几句话或几段文字内传达关键信息，而不是冗长的段落或文章。\n4. 检查总结是否包含与主要观点无关的信息或冗余信息。\n5.确定总结涵盖了材料中的关键信息，并且没有忽略任何重要细节。\n6.给总结打出1-5的分数，其中5表示总结简明扼要，没有冗余内容，而1表示总结冗长或包含不必要的信息，难以理解或记忆。根据您的判断，打出适当的得分。\n\n简明扼要："
-    },
-    "prompt": "你是一个好助手。请你为下面的“总结”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
-  "general": {
-    "id": 11,
-    "category": "general",
    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
      "correctness": "正确性(1-5)：答案是否正确。"
    },
    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
    },

--- a/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_en.json
+++ b/applications/Chat/evaluate/prompt/evaluation_prompt/evaluation_prompt_en.json
@@ -39,53 +39,8 @@
    },
    "prompt": "You are a good assistant. Please rate the given answer to the \"chat\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
-  "classification": {
-    "id": 3,
-    "category": "classification",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): whether the answer is correct or not."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"classification\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
-  "closed_qa": {
-    "id": 4,
-    "category": "closed_qa",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): whether the answer is correct or not."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question by yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"closed qa\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
-  "extraction": {
-    "id": 5,
-    "category": "extraction",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "correctness (1-5): Answers should extract the required information accurately and should not contain any incorrect or misleading information."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the questions carefully and identify the information that needs to be extracted from the material.\n2. Read the answer carefully and make sure it covers all the information that needs to be extracted.\n3. Use the material provided to verify the correctness of the response. If the response is inaccurate or contains incorrect or misleading information, a high score cannot be given.\n4. Check that the answer contains all the information required to be extracted and do not leave out any important details.\n5. Give a score between 1 and 5 based on the correctness and completeness of the response, with a score of 5 indicating a very accurate and complete response and a score of 1 indicating that the response barely extracts the required information.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"extraction\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
  "generation": {
-    "id": 6,
+    "id": 3,
    "category": "generation",
    "metrics": {
      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
@@ -100,7 +55,7 @@
    "prompt": "You are a good assistant. Please rate the given answer to the \"generation\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
  "open_qa": {
-    "id": 7,
+    "id": 4,
    "category": "open_qa",
    "metrics": {
      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
@@ -114,23 +69,8 @@
    },
    "prompt": "You are a good assistant. Please rate the answers to the \"open qa\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
-  "rewriting": {
-    "id": 8,
-    "category": "rewriting",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): whether the answer is correct or not."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the answers to the \"rewriting\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
  "roleplay": {
-    "id": 9,
+    "id": 5,
    "category": "roleplay",
    "metrics": {
      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
@@ -146,35 +86,17 @@
    },
    "prompt": "You are a good assistant. Please rate the given answer to the \"role-play\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
-  "summarization": {
+  "Other": {
-    "id": 10,
+    "id": 6,
-    "category": "summarization",
+    "category": "Other",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): answers should summarize the main points of the material accurately and unambiguously.",
-      "conciseness": "Conciseness (1-5): answers should be concise and without redundant content."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the material given in the question carefully to understand its content and main points.\n2. Assess whether the answer accurately summarizes the key points of the source material.\n3. assess whether the response contains all the key information in the source material.\n4. Based on the above steps, give a score of 1-5, where 1 means that the response does not accurately summarize the main points of the material and 5 means that the response completely accurately summarizes the main points of the material.\n\nCorrectness:",
-      "conciseness": "1. Read the title and extract the main points of the material.\n2. Read the summary and note the main ideas and messages in it.\n3. Assess the length of the summary. A concise summary should usually convey key information within a few sentences or paragraphs, rather than lengthy paragraphs or essays.\n4. Check that the summary does not contain information that is not relevant to the main ideas or that is redundant.\n5. Make sure that the summary covers the key information in the material and that no important details have been omitted.\n6. Rate the summary on a scale of 1-5, where 5 means the summary is concise and free of redundancy, and 1 means the summary is lengthy or contains unnecessary information that is difficult to understand or remember. Based on your judgment, assign the appropriate score.\n\nConciseness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"summarization\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
-  "general": {
-    "id": 11,
-    "category": "general",
    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
      "correctness": "Correctness (1-5): whether the answer is correct or not."
    },
    "CoT": {
      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
+      "correctness": "1. Read the question carefully and try to answer the question by yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
    },
    "prompt": "You are a good assistant. Please rate the given answer to the question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  }

--- a/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json
+++ b/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json
+{
+  "model": [
+    {
+      "name": "model1"
+    },
+    {
+      "name": "model2"
+    }
+  ],
+  "dataset": [
+    {
+      "name": "mmlu",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "perplexity",
+        "ppl_score",
+        "ppl_score_over_choices"
+      ]
+    },
+    {
+      "name": "cmmlu",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "perplexity",
+        "ppl_score",
+        "ppl_score_over_choices"
+      ]
+    },
+    {
+      "name": "agieval",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "multi_choice_accuracy",
+        "math_equivalence",
+        "perplexity",
+        "ppl_score_over_choices",
+        "ppl_score"
+      ]
+    },
+    {
+      "name": "gaokaobench",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "multi_choice_accuracy",
+        "math_equivalence",
+        "rouge_score",
+        "rouge_zh_score",
+        "perplexity",
+        "ppl_score_over_choices",
+        "ppl_score"
+      ]
+    }
+  ]
+}
--- a/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json
+++ b/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json
+{
+  "model": [
+    {
+      "name": "model name",
+      "model_class": "HuggingFaceCausalLM",
+      "parameters": {
+        "path": "path to model",
+        "model_max_length": 4096,
+        "tokenizer_path": "",
+        "tokenizer_kwargs": {
+          "trust_remote_code": true
+        },
+        "peft_path": null,
+        "model_kwargs": {
+          "torch_dtype": "torch.float32",
+          "trust_remote_code": true
+        },
+        "prompt_template": "plain",
+        "batch_size": 4
+      }
+    },
+    {
+      "name": "model2 name",
+      "model_class": "HuggingFaceCausalLM",
+      "parameters": {
+        "path": "path to model2",
+        "model_max_length": 4096,
+        "tokenizer_path": "",
+        "tokenizer_kwargs": {
+          "trust_remote_code": true
+        },
+        "peft_path": null,
+        "model_kwargs": {
+          "torch_dtype": "torch.float32",
+          "trust_remote_code": true
+        },
+        "prompt_template": "plain",
+        "batch_size": 4
+      }
+    }
+  ],
+  "dataset": [
+    {
+      "name": "agieval",
+      "dataset_class": "AGIEvalDataset",
+      "debug": false,
+      "few_shot": false,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/agieval.json)"
+    },
+    {
+      "name": "ceval",
+      "dataset_class": "CEvalDataset",
+      "debug": false,
+      "few_shot": true,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/ceval.json)"
+    },
+    {
+      "name": "cmmlu",
+      "dataset_class": "CMMLUDataset",
+      "debug": false,
+      "few_shot": true,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/cmmlu.json)"
+    },
+    {
+      "name": "gaokaobench",
+      "dataset_class": "GaoKaoBenchDataset",
+      "debug": false,
+      "few_shot": false,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/gaokaobench.json)"
+    },
+    {
+      "name": "mmlu",
+      "dataset_class": "MMLUDataset",
+      "debug": false,
+      "few_shot": true,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/mmlu.json)"
+    }
+  ]
+}
--- a/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.py
+++ b/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.py
+import argparse
+import os
+import tabulate
+from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator
+from colossal_eval.utils import jdump, jload
+def main(args):
+    config = jload(args.config)
+    evaluation_results = {dataset["name"]: {} for dataset in config["dataset"]}
+    evaluation_results_table = {dataset["name"]: {} for dataset in config["dataset"]}
+    evaluator = DatasetEvaluator()
+    for dataset_parameter in config["dataset"]:
+        dataset_name = dataset_parameter["name"]
+        metrics = dataset_parameter["metrics"]
+        results_metric_model = {metric: {model["name"]: None for model in config["model"]} for metric in metrics}
+        for model in config["model"]:
+            model_name = model["name"]
+            data = jload(
+                os.path.join(args.inference_results_path, model_name, f"{dataset_name}_inference_results.json")
+            )
+            results = evaluator.get_evaluation_results(data, dataset_name, model_name, metrics)
+            for metric, score in results.items():
+                results_metric_model[metric][model_name] = score["ALL"]
+            evaluation_results[dataset_name][model_name] = results
+        evaluation_results_table[dataset_name] = results_metric_model
+    table = []
+    header = ["dataset", "metric"] + [model["name"] for model in config["model"]]
+    table.append(header)
+    for dataset_parameter in config["dataset"]:
+        dataset_name = dataset_parameter["name"]
+        metrics = dataset_parameter["metrics"]
+        for metric, model_results in evaluation_results_table[dataset_name].items():
+            row = [dataset_name]
+            for model, score in model_results.items():
+                if len(row) == 1:
+                    row.extend([metric, "{:.02f}".format(score)])
+                else:
+                    row.append("{:.02f}".format(score))
+            table.append(row)
+    table = tabulate.tabulate(table, headers="firstrow")
+    print(table)
+    os.makedirs(args.evaluation_results_save_path, exist_ok=True)
+    with open(os.path.join(args.evaluation_results_save_path, "evaluation_results_table.txt"), "w") as file:
+        file.write(table)
+    jdump(evaluation_results, os.path.join(args.evaluation_results_save_path, "evaluation_results.json"))
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="ColossalEval evaluation process.")
+    parser.add_argument("--config", type=str, default=None, required=True, help="path to config file")
+    parser.add_argument("--inference_results_path", type=str, default=None, help="path to inference results")
+    parser.add_argument(
+        "--evaluation_results_save_path", type=str, default=None, help="path to save evaluation results"
+    )
+    args = parser.parse_args()
+    main(args)
--- a/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.sh
+++ b/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.sh
+python eval_dataset.py \
+    --config "path to config file" \
+    --inference_results_path "path to inference results" \
+    --evaluation_results_save_path "path to save evaluation results"
--- a/applications/ColossalEval/examples/dataset_evaluation/inference.py
+++ b/applications/ColossalEval/examples/dataset_evaluation/inference.py
+import argparse
+import copy
+import os
+from typing import Dict, List
+import torch
+import torch.distributed as dist
+from colossal_eval import dataset, models, utils
+import colossalai
+from colossalai.logging import get_dist_logger
+logger = get_dist_logger()
+def rm_and_merge(world_size: int, save_path: str, model_names: List[str], dataset_names: Dict[str, List]) -> None:
+    """
+    Remove inference result per rank and merge them into one file.
+    Args:
+        world_size: Number of processes for inference.
+        save_path: The folder for storing inference results.
+        model_names: Names of models for inference.
+        dataset_names: Names of dataset for inference.
+    """
+    for model_name in model_names:
+        for dataset_name, categories in dataset_names.items():
+            all_answers = {}
+            for category in categories:
+                all_answers[category] = {"data": []}
+                answers = {"data": []}
+                for r in range(world_size):
+                    directory = os.path.join(
+                        save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                    )
+                    if not os.path.exists(directory):
+                        raise Exception(
+                            f"Directory {directory} not found. There may be an error during inference time."
+                        )
+                    else:
+                        rank_answers = utils.jload(directory)
+                        answers["data"].extend(rank_answers["data"])
+                        answers["inference_kwargs"] = rank_answers["inference_kwargs"]
+                for r in range(world_size):
+                    try:
+                        directory = os.path.join(
+                            save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                        )
+                        os.remove(directory)
+                    except Exception as e:
+                        print(e)
+                all_answers[category] = answers
+            logger.info(f"Save inference results of model {model_name} on dataset {dataset_name}.")
+            utils.jdump(all_answers, os.path.join(save_path, model_name, f"{dataset_name}_inference_results.json"))
+        logger.info(f"Save inference results of model {model_name} for all dataset.")
+    logger.info(f"Save inference results of all models for all dataset.")
+def main(args):
+    colossalai.launch_from_torch(config={}, seed=42)
+    world_size = dist.get_world_size()
+    rank = dist.get_rank()
+    inference_data = {}
+    debug_args = {}
+    few_shot_args = {}
+    config = utils.jload(args.config)
+    model_parameters = config["model"]
+    dataset_parameters = config["dataset"]
+    for dataset_parameter in dataset_parameters:
+        path = dataset_parameter["path"]
+        save_path = dataset_parameter["save_path"]
+        dataset_name = dataset_parameter["name"]
+        debug_args[dataset_name] = dataset_parameter["debug"]
+        few_shot_args[dataset_name] = dataset_parameter["few_shot"]
+        if not args.load_dataset:
+            if os.path.exists(save_path):
+                dataset_ = utils.jload(save_path)
+                inference_data[dataset_name] = dataset_["test"]
+            else:
+                raise Exception(
+                    "Can't find the converted dataset. You may set load_dataset True to store the dataset first."
+                )
+            continue
+        dataset_class = eval(f"dataset.{dataset_parameter['dataset_class']}")
+        if not issubclass(dataset_class, dataset.BaseDataset):
+            raise ValueError(f"Dataset class {dataset_parameter['dataset_class']} is not a subclass of BaseDataset.")
+        dataset_ = dataset_class(path, logger, dataset_parameter["few_shot"])
+        dataset_.save(save_path)
+        inference_data[dataset_name] = dataset_.dataset["test"]
+    for model_parameter in model_parameters:
+        model_name = model_parameter["name"]
+        model_class = eval(f"models.{model_parameter['model_class']}")
+        paramerters = model_parameter["parameters"]
+        paramerters.update({"logger": logger})
+        paramerters.update({"prompt_template": utils.prompt_templates[paramerters["prompt_template"]]})
+        model_ = model_class(**paramerters)
+        if not issubclass(model_class, models.BaseModel):
+            raise ValueError(f"Model class {model_parameter['model_class']} is not a subclass of BaseModel.")
+        for dataset_name, split_data in inference_data.items():
+            start = 0
+            for category, category_data in split_data.items():
+                if few_shot_args[dataset_name] and category_data["inference_kwargs"].get("few_shot_data", None) is None:
+                    raise Exception(f"Dataset {dataset_name} doesn't have few-shot data for category {category}!")
+                answers_to_dump = copy.deepcopy(category_data)
+                partition_size = len(category_data["data"]) // world_size
+                redundant = len(category_data["data"]) % world_size
+                # Ensure that the amount of data for inference is as consistent as possible across different processes.
+                lengths = [partition_size for _ in range(world_size)]
+                for j in range(redundant):
+                    lengths[(j + start) % world_size] += 1
+                start = (start + redundant) % world_size
+                questions = category_data["data"][sum(lengths[0:rank]) : sum(lengths[0:rank]) + lengths[rank]]
+                answers_per_rank = model_.inference(
+                    questions, inference_kwargs=category_data["inference_kwargs"], debug=debug_args[dataset_name]
+                )
+                answers_to_dump["data"] = answers_per_rank
+                utils.jdump(
+                    answers_to_dump,
+                    os.path.join(
+                        args.inference_save_path,
+                        model_name,
+                        f"{dataset_name}_{category}_inference_results_rank{rank}.json",
+                    ),
+                )
+        logger.info(f"Rank {rank} peak CUDA mem: {torch.cuda.max_memory_allocated()/1024**3:.3f} GB")
+        del model_
+        torch.cuda.empty_cache()
+    dist.barrier()
+    if rank == 0:
+        model_names = [model_parameter["name"] for model_parameter in model_parameters]
+        dataset_names = {key: list(inference_data[key].keys()) for key in inference_data}
+        rm_and_merge(world_size, args.inference_save_path, model_names, dataset_names)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="ColossalEval inference process.")
+    parser.add_argument("--config", type=str, default=None, required=True, help="path to config file")
+    parser.add_argument("--load_dataset", default=False, action="store_true")
+    parser.add_argument("--inference_save_path", type=str, default=None, help="path to save inference results")
+    args = parser.parse_args()
+    main(args)
--- a/applications/ColossalEval/examples/dataset_evaluation/inference.sh
+++ b/applications/ColossalEval/examples/dataset_evaluation/inference.sh
+torchrun --nproc_per_node=1 inference.py \
+    --config "path to config file" \
+    --load_dataset \
+    --inference_save_path "path to save inference results"
--- a/applications/ColossalEval/examples/gpt_evaluation/config/evaluation/config.json
+++ b/applications/ColossalEval/examples/gpt_evaluation/config/evaluation/config.json
+{
+  "language": "en",
+  "category": {
+    "brainstorming": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "creativity",
+        "practicality",
+        "reasonableness"
+      ]
+    },
+    "chat": {
+      "GPT": [
+        "language organization",
+        "naturalness",
+        "engagingness",
+        "fidelity"
+      ]
+    },
+    "generation": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "diversity"
+      ]
+    },
+    "open_qa": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "correctness"
+      ]
+    },
+    "roleplay": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "fidelity",
+        "creativity"
+      ]
+    }
+  }
+}
--- a/applications/ColossalEval/examples/gpt_evaluation/config/inference/config.json
+++ b/applications/ColossalEval/examples/gpt_evaluation/config/inference/config.json
+{
+  "model": [
+    {
+      "name": "model name",
+      "model_class": "HuggingFaceCausalLM",
+      "parameters": {
+        "path": "path to model",
+        "model_max_length": 4096,
+        "tokenizer_path": "",
+        "tokenizer_kwargs": {
+          "trust_remote_code": true
+        },
+        "peft_path": null,
+        "model_kwargs": {
+          "torch_dtype": "torch.float32",
+          "trust_remote_code": true
+        },
+        "prompt_template": "plain",
+        "batch_size": 4
+      }
+    }
+  ],
+  "dataset": [
+    {
+      "name": "colossal",
+      "dataset_class": "ColossalDataset",
+      "debug": false,
+      "few_shot": false,
+      "path": "../../configs/gpt_evaluation/data/eval_en_examples.json",
+      "save_path": "path to save converted dataset (inference_data/colossal.json)"
+    }
+  ]
+}
--- a/applications/Chat/evaluate/eval.py
+++ b/applications/Chat/evaluate/eval.py
@@ -2,8 +2,8 @@ import argparse
 import os
 import openai
-from evaluator import Evaluator
+from colossal_eval.evaluate.evaluator import Evaluator
-from utils import jload
+from colossal_eval.utils import jload
 def main(args):
@@ -51,12 +51,19 @@ def main(args):
            gpt_evaluation_prompt,
            args.gpt_model,
            config["language"],
-            config.get("path_for_UniEval", None),
            args.gpt_with_reference,
        )
        if len(args.model_name_list) == 2:
-            answers1 = jload(args.answer_file_list[0])
+            answers_1 = jload(args.answer_file_list[0])
-            answers2 = jload(args.answer_file_list[1])
+            answers_2 = jload(args.answer_file_list[1])
+            answers1 = []
+            for category, value in answers_1.items():
+                answers1.extend(value["data"])
+            answers2 = []
+            for category, value in answers_2.items():
+                answers2.extend(value["data"])
            assert len(answers1) == len(answers2), "The number of answers for two models should be equal!"
@@ -66,9 +73,21 @@ def main(args):
            targets = jload(args.target_file)
            answers = jload(args.answer_file_list[0])
-            assert len(targets) == len(answers), "The number of target answers and model answers should be equal!"
+            references = []
+            for category, value in targets["test"].items():
+                references.extend(value["data"])
+            predictions = []
+            for category, value in answers.items():
+                predictions.extend(value["data"])
-            evaluator.evaluate(answers=answers, targets=targets)
+            assert len(references) == len(
+                predictions
+            ), "The number of target answers and model answers should be equal!"
+            evaluator.evaluate(
+                answers=predictions, targets=references, save_path=args.save_path, model_name=args.model_name_list[0]
+            )
            evaluator.save(args.save_path, args.model_name_list)
        else:
            raise ValueError("Unsupported number of answer files and model names!")
@@ -99,8 +118,8 @@ if __name__ == "__main__":
    )
    parser.add_argument(
        "--gpt_model",
-        default="gpt-3.5-turbo",
+        default="gpt-3.5-turbo-16k",
-        choices=["text-davinci-003", "gpt-3.5-turbo", "gpt-4"],
+        choices=["text-davinci-003", "gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-4"],
        help="which GPT model to use for evaluation",
    )
    parser.add_argument(

--- a/applications/Chat/evaluate/eval.sh
+++ b/applications/Chat/evaluate/eval.sh
--- a/applications/ColossalEval/examples/gpt_evaluation/inference.py
+++ b/applications/ColossalEval/examples/gpt_evaluation/inference.py
+import argparse
+import copy
+import os
+from typing import Dict, List
+import torch
+import torch.distributed as dist
+from colossal_eval import dataset, models, utils
+import colossalai
+from colossalai.logging import get_dist_logger
+logger = get_dist_logger()
+def rm_and_merge(world_size: int, save_path: str, model_names: List[str], dataset_names: Dict[str, List]) -> None:
+    """
+    Remove inference result per rank and merge them into one file.
+    Args:
+        world_size: Number of processes for inference.
+        save_path: The folder for storing inference results.
+        model_names: Names of models for inference.
+        dataset_names: Names of dataset for inference.
+    """
+    for model_name in model_names:
+        for dataset_name, categories in dataset_names.items():
+            all_answers = {}
+            for category in categories:
+                all_answers[category] = {"data": []}
+                answers = {"data": []}
+                for r in range(world_size):
+                    directory = os.path.join(
+                        save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                    )
+                    if not os.path.exists(directory):
+                        raise Exception(
+                            f"Directory {directory} not found. There may be an error during inference time."
+                        )
+                    else:
+                        rank_answers = utils.jload(directory)
+                        answers["data"].extend(rank_answers["data"])
+                        answers["inference_kwargs"] = rank_answers["inference_kwargs"]
+                for r in range(world_size):
+                    try:
+                        directory = os.path.join(
+                            save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                        )
+                        os.remove(directory)
+                    except Exception as e:
+                        print(e)
+                all_answers[category] = answers
+            logger.info(f"Save inference results of model {model_name} on dataset {dataset_name}.")
+            utils.jdump(all_answers, os.path.join(save_path, model_name, f"{dataset_name}_inference_results.json"))
+        logger.info(f"Save inference results of model {model_name} for all dataset.")
+    logger.info(f"Save inference results of all models for all dataset.")
+def main(args):
+    colossalai.launch_from_torch(config={}, seed=42)
+    world_size = dist.get_world_size()
+    rank = dist.get_rank()
+    inference_data = {}
+    debug_args = {}
+    few_shot_args = {}
+    config = utils.jload(args.config)
+    model_parameters = config["model"]
+    dataset_parameters = config["dataset"]
+    for dataset_parameter in dataset_parameters:
+        path = dataset_parameter["path"]
+        save_path = dataset_parameter["save_path"]
+        dataset_name = dataset_parameter["name"]
+        debug_args[dataset_name] = dataset_parameter["debug"]
+        few_shot_args[dataset_name] = dataset_parameter["few_shot"]
+        if not args.load_dataset:
+            if os.path.exists(save_path):
+                dataset_ = utils.jload(save_path)
+                inference_data[dataset_name] = dataset_["test"]
+            else:
+                raise Exception(
+                    "Can't find the converted dataset. You may set load_dataset True to store the dataset first."
+                )
+            continue
+        dataset_class = eval(f"dataset.{dataset_parameter['dataset_class']}")
+        if not issubclass(dataset_class, dataset.BaseDataset):
+            raise ValueError(f"Dataset class {dataset_parameter['dataset_class']} is not a subclass of BaseDataset.")
+        dataset_ = dataset_class(path, logger, dataset_parameter["few_shot"])
+        dataset_.save(save_path)
+        inference_data[dataset_name] = dataset_.dataset["test"]
+    for model_parameter in model_parameters:
+        model_name = model_parameter["name"]
+        model_class = eval(f"models.{model_parameter['model_class']}")
+        paramerters = model_parameter["parameters"]
+        paramerters.update({"logger": logger})
+        paramerters.update({"prompt_template": utils.prompt_templates[paramerters["prompt_template"]]})
+        model_ = model_class(**paramerters)
+        if not issubclass(model_class, models.BaseModel):
+            raise ValueError(f"Model class {model_parameter['model_class']} is not a subclass of BaseModel.")
+        for dataset_name, split_data in inference_data.items():
+            start = 0
+            for category, category_data in split_data.items():
+                if few_shot_args[dataset_name] and category_data["inference_kwargs"].get("few_shot_data", None) is None:
+                    raise Exception(f"Dataset {dataset_name} doesn't have few-shot data for category {category}!")
+                answers_to_dump = copy.deepcopy(category_data)
+                partition_size = len(category_data["data"]) // world_size
+                redundant = len(category_data["data"]) % world_size
+                # Ensure that the amount of data for inference is as consistent as possible across different processes.
+                lengths = [partition_size for _ in range(world_size)]
+                for j in range(redundant):
+                    lengths[(j + start) % world_size] += 1
+                start = (start + redundant) % world_size
+                questions = category_data["data"][sum(lengths[0:rank]) : sum(lengths[0:rank]) + lengths[rank]]
+                answers_per_rank = model_.inference(
+                    questions, inference_kwargs=category_data["inference_kwargs"], debug=debug_args[dataset_name]
+                )
+                answers_to_dump["data"] = answers_per_rank
+                utils.jdump(
+                    answers_to_dump,
+                    os.path.join(
+                        args.inference_save_path,
+                        model_name,
+                        f"{dataset_name}_{category}_inference_results_rank{rank}.json",
+                    ),
+                )
+        logger.info(f"Rank {rank} peak CUDA mem: {torch.cuda.max_memory_allocated()/1024**3:.3f} GB")
+        del model_
+        torch.cuda.empty_cache()
+    dist.barrier()
+    if rank == 0:
+        model_names = [model_parameter["name"] for model_parameter in model_parameters]
+        dataset_names = {key: list(inference_data[key].keys()) for key in inference_data}
+        rm_and_merge(world_size, args.inference_save_path, model_names, dataset_names)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="ColossalEval inference process.")
+    parser.add_argument("--config", type=str, default=None, required=True, help="path to config file")
+    parser.add_argument("--load_dataset", default=False, action="store_true")
+    parser.add_argument("--inference_save_path", type=str, default=None, help="path to save inference results")
+    args = parser.parse_args()
+    main(args)
--- a/applications/ColossalEval/examples/gpt_evaluation/inference.sh
+++ b/applications/ColossalEval/examples/gpt_evaluation/inference.sh
+torchrun --nproc_per_node=1 inference.py \
+    --config "path to config file" \
+    --load_dataset \
+    --inference_save_path "path to save inference results"
--- a/applications/Chat/evaluate/requirements.txt
+++ b/applications/Chat/evaluate/requirements.txt
+transformers>=4.32.0
+colossalai>=0.3.1
+peft
+tabulate
 jieba
-bert-score
+fuzzywuzzy
-rouge_chinese
+rouge
-scikit-metrics
-nltk
 openai
-seaborn
-pandas
 matplotlib
-numpy
+pandas
-zhon
+seaborn
-rouge_score
+scikit-learn
--- a/applications/ColossalEval/setup.py
+++ b/applications/ColossalEval/setup.py
+from setuptools import find_packages, setup
+def fetch_requirements(path):
+    with open(path, "r") as fd:
+        return [r.strip() for r in fd.readlines()]
+def fetch_readme():
+    with open("README.md", encoding="utf-8") as f:
+        return f.read()
+setup(
+    name="colossal_eval",
+    version="0.0.1",
+    packages=find_packages(exclude=["examples", "*.egg-info"]),
+    description="Colossal-AI LLM-Evaluation Framework",
+    long_description=fetch_readme(),
+    long_description_content_type="text/markdown",
+    license="Apache Software License 2.0",
+    url="https://github.com/hpcaitech/LLM-Evaluation",
+    install_requires=fetch_requirements("requirements.txt"),
+    python_requires=">=3.6",
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: Apache Software License",
+        "Environment :: GPU :: NVIDIA CUDA",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    ],
+)
--- a/applications/README.md
+++ b/applications/README.md
@@ -5,6 +5,7 @@ This directory contains the applications that are powered by Colossal-AI.
 The list of applications include:
 - [X] [Colossal-LLaMA-2](./Colossal-LLaMA-2/): Continual Pre-training of LLaMA-2.
+- [X] [ColossalEval](./ColossalEval): Evaluation Pipeline for LLMs.
 - [X] [Chatbot](./Chat/README.md): Replication of ChatGPT with RLHF.
 - [X] [FastFold](https://github.com/hpcaitech/FastFold): Optimizing AlphaFold (Biomedicine) Training and Inference on GPU Clusters.