Unverified Commit 7f698a5a authored by Timur Aysin's avatar Timur Aysin Committed by GitHub
Browse files

Fix LongBench Evaluation (#3273)



* fix: set 'do_sample=False' and use double quotes in 'doc_to_text'

* feat: update versions and README for longbench

* pacify pre-commit

---------
Co-authored-by: default avatarBaber <baber@hey.com>
parent 0c134ee9
......@@ -5,17 +5,17 @@ task: longbench_2wikimqa
dataset_path: THUDM/LongBench
test_split: test
dataset_name: 2wikimqa
doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 32
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_2wikimqa_e
dataset_path: THUDM/LongBench
test_split: test
dataset_name: 2wikimqa_e
doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 32
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -101,4 +101,7 @@ If other tasks on this dataset are already supported:
### Changelog
v2.: fix doc_to_target; add vcsum
v3: properly use all answers for metric calculation; trim whitespace from resps; fix stop sequences not parsing correctly.
v4: fixed special characters in prompts; use greedy decoding by default.
......@@ -149,7 +149,7 @@ task: {{ task }}
dataset_path: {{ dataset_path }}
test_split: {{ test_split }}
dataset_name: {{ dataset_name }}
doc_to_text: '{{ doc_to_text }}'
doc_to_text: "{{ doc_to_text }}"
doc_to_target: '{{ doc_to_target }}'
process_results: {{ process_results }}
generation_kwargs:
......@@ -180,13 +180,14 @@ if __name__ == "__main__":
generation_kwargs = {
"max_gen_toks": dataset2maxlen[df],
"temperature": 1,
"do_sample": True,
"do_sample": False,
# We'll handle the until value directly in the template
}
raw_doc_to_text = (
dataset2prompt[df]
.replace("\n", "\\n")
.replace('"', '\\"')
.replace("{", "{{")
.replace("}", "}}")
)
......@@ -210,7 +211,7 @@ if __name__ == "__main__":
"generation_kwargs": generation_kwargs,
"has_newline": has_newline, # Add the flag to the template context
"metric_list": metric_list,
"metadata": {"version": "3.0"},
"metadata": {"version": "4.0"},
}
# Render template
......
......@@ -5,17 +5,17 @@ task: longbench_dureader
dataset_path: THUDM/LongBench
test_split: test
dataset_name: dureader
doc_to_text: '请基于给定的文章回答下述问题。\n\n文章:{{context}}\n\n请基于上述文章回答下面的问题。\n\n问题:{{input}}\n回答:'
doc_to_text: "请基于给定的文章回答下述问题。\n\n文章:{{context}}\n\n请基于上述文章回答下面的问题。\n\n问题:{{input}}\n回答:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_rouge_zh_score
generation_kwargs:
max_gen_toks: 128
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "rouge_zh_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_gov_report
dataset_path: THUDM/LongBench
test_split: test
dataset_name: gov_report
doc_to_text: 'You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:'
doc_to_text: "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_rouge_score
generation_kwargs:
max_gen_toks: 512
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "rouge_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_gov_report_e
dataset_path: THUDM/LongBench
test_split: test
dataset_name: gov_report_e
doc_to_text: 'You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:'
doc_to_text: "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{{context}}\n\nNow, write a one-page summary of the report.\n\nSummary:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_rouge_score
generation_kwargs:
max_gen_toks: 512
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "rouge_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_hotpotqa
dataset_path: THUDM/LongBench
test_split: test
dataset_name: hotpotqa
doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 32
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_hotpotqa_e
dataset_path: THUDM/LongBench
test_split: test
dataset_name: hotpotqa_e
doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 32
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_lcc
dataset_path: THUDM/LongBench
test_split: test
dataset_name: lcc
doc_to_text: 'Please complete the code given below. \n{{context}}Next line of code:\n'
doc_to_text: "Please complete the code given below. \n{{context}}Next line of code:\n"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_code_sim_score
generation_kwargs:
max_gen_toks: 64
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "code_sim_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_lcc_e
dataset_path: THUDM/LongBench
test_split: test
dataset_name: lcc_e
doc_to_text: 'Please complete the code given below. \n{{context}}Next line of code:\n'
doc_to_text: "Please complete the code given below. \n{{context}}Next line of code:\n"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_code_sim_score
generation_kwargs:
max_gen_toks: 64
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "code_sim_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_lsht
dataset_path: THUDM/LongBench
test_split: test
dataset_name: lsht
doc_to_text: '请判断给定新闻的类别,下面是一些例子。\n\n{{context}}\n{{input}}'
doc_to_text: "请判断给定新闻的类别,下面是一些例子。\n\n{{context}}\n{{input}}"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_classification_score
generation_kwargs:
max_gen_toks: 64
temperature: 1
do_sample: True
do_sample: False
until: ["\n"]
metric_list:
- metric: "classification_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_multi_news
dataset_path: THUDM/LongBench
test_split: test
dataset_name: multi_news
doc_to_text: 'You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:'
doc_to_text: "You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_rouge_score
generation_kwargs:
max_gen_toks: 512
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "rouge_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_multi_news_e
dataset_path: THUDM/LongBench
test_split: test
dataset_name: multi_news_e
doc_to_text: 'You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:'
doc_to_text: "You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{{context}}\n\nNow, write a one-page summary of all the news.\n\nSummary:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_rouge_score
generation_kwargs:
max_gen_toks: 512
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "rouge_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_multifieldqa_en
dataset_path: THUDM/LongBench
test_split: test
dataset_name: multifieldqa_en
doc_to_text: 'Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
doc_to_text: "Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 64
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_multifieldqa_en_e
dataset_path: THUDM/LongBench
test_split: test
dataset_name: multifieldqa_en_e
doc_to_text: 'Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
doc_to_text: "Read the following text and answer briefly.\n\n{{context}}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 64
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_multifieldqa_zh
dataset_path: THUDM/LongBench
test_split: test
dataset_name: multifieldqa_zh
doc_to_text: '阅读以下文字并用中文简短回答:\n\n{{context}}\n\n现在请基于上面的文章回答下面的问题,只告诉我答案,不要输出任何其他字词。\n\n问题:{{input}}\n回答:'
doc_to_text: "阅读以下文字并用中文简短回答:\n\n{{context}}\n\n现在请基于上面的文章回答下面的问题,只告诉我答案,不要输出任何其他字词。\n\n问题:{{input}}\n回答:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_zh_score
generation_kwargs:
max_gen_toks: 64
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_zh_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_musique
dataset_path: THUDM/LongBench
test_split: test
dataset_name: musique
doc_to_text: 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:'
doc_to_text: "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{{context}}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {{input}}\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 32
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_narrativeqa
dataset_path: THUDM/LongBench
test_split: test
dataset_name: narrativeqa
doc_to_text: 'You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {{context}}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {{input}}\n\nAnswer:'
doc_to_text: "You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {{context}}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {{input}}\n\nAnswer:"
doc_to_target: '{{answers}}'
process_results: !function metrics.get_qa_f1_score
generation_kwargs:
max_gen_toks: 128
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "qa_f1_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
......@@ -5,17 +5,17 @@ task: longbench_passage_count
dataset_path: THUDM/LongBench
test_split: test
dataset_name: passage_count
doc_to_text: 'There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n{{context}}\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: '
doc_to_text: "There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n{{context}}\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: "
doc_to_target: '{{answers}}'
process_results: !function metrics.get_count_score
generation_kwargs:
max_gen_toks: 32
temperature: 1
do_sample: True
do_sample: False
until: []
metric_list:
- metric: "count_score"
aggregation: mean
higher_is_better: True
metadata:
version: 3.0
version: 4.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment