Commit 88486e57 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'group-agg-rework' of...

Merge branch 'group-agg-rework' of https://github.com/EleutherAI/lm-evaluation-harness into multiprompt
parents 5971f2ca ba73d131
......@@ -54,6 +54,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on prehistory
for help. Pacal built the temples as the funerary monument to legitimize his
kingship. The answer is (D).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_prehistory
......@@ -58,6 +58,6 @@ fewshot_config:
for help. Among the four transactions, only Proceeds from long-term debt belongs
to the financing activities section of cashflow, hence the amount reported should
be $100000. The answer is (D).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_accounting
......@@ -117,6 +117,6 @@ fewshot_config:
a due process clause. Hence the strongest argument should be the statute is
overbroad and consequently invalid under the First and Fourteenth Amendments.
The answer is (D).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_law
......@@ -77,6 +77,6 @@ fewshot_config:
for help. The symptoms and the adrenal mass suggested pheochromocytoma, and
the blood pressure indicates hypertension. Phenoxybenzamine is used to treat
hypertension caused by pheochromocytoma. The answer is (D).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_medicine
......@@ -57,6 +57,6 @@ fewshot_config:
for help. Based on the circumstances, you should tell your client about the
pros and cons of each program, but it would be inappropriate to receive the
bonus, so you should not claim the $50 bonus. The answer is (D).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_professional_psychology
......@@ -50,6 +50,6 @@ fewshot_config:
for help. If a public relations media practitioner does not know the answer
to a reporter''s question, they should say ''I don''t know'' and offer to provide
the information later. The answer is (C).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_public_relations
......@@ -99,6 +99,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on security
studies for help. Coercive diplomacy uses the threat of force to induce the
opponent to comply with demands. The answer is (B).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_security_studies
......@@ -53,6 +53,6 @@ fewshot_config:
for help. The post-war welfare state of 1948 aimed to provide free healthcare
and education, full employment, and universal welfare. But it did not aim to
provide a minimum wage. The answer is (B).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_sociology
......@@ -51,6 +51,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on us foreign
policy for help. The 2008 financial crisis damanged the international reputation
of the American model of political economy and capitalism. The answer is (A).'
group: mmlu_flan_cot_fewshot_social_sciences
tag: mmlu_flan_cot_fewshot_social_sciences
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_us_foreign_policy
......@@ -40,6 +40,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on virology
for help. Paroviruses are highly impactful because they do not have nucleic
acid. The answer is (A).'
group: mmlu_flan_cot_fewshot_other
tag: mmlu_flan_cot_fewshot_other
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_virology
......@@ -37,6 +37,6 @@ fewshot_config:
target: 'Let''s think step by step. We refer to Wikipedia articles on world religions
for help. In Judaism, the most distinctive sign of the covenant is circumcision
(brit milah). The answer is (B).'
group: mmlu_flan_cot_fewshot_humanities
tag: mmlu_flan_cot_fewshot_humanities
include: _mmlu_flan_cot_fewshot_template_yaml
task: mmlu_flan_cot_fewshot_world_religions
group: mmlu_flan_cot_zeroshot
group_alias: mmlu (flan style, zeroshot cot)
task:
- mmlu_flan_cot_zeroshot_stem
- mmlu_flan_cot_zeroshot_other
- mmlu_flan_cot_zeroshot_social_sciences
- mmlu_flan_cot_zeroshot_humanities
group_config:
aggregate_metric: True
weight_by_size: True
- group: stem
task:
- mmlu_flan_cot_zeroshot_stem
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: other
task:
- mmlu_flan_cot_zeroshot_other
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: social sciences
task:
- mmlu_flan_cot_zeroshot_social_sciences
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: humanities
task:
- mmlu_flan_cot_zeroshot_humanities
aggregate_metric_list:
- metric: acc
weight_by_size: True
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
......@@ -34,3 +34,5 @@ metric_list:
ignore_punctuation: true
metadata:
version: 2.0
dataset_kwargs:
trust_remote_code: true
"dataset_name": "abstract_algebra"
"description": "The following are multiple choice questions (with answers) about abstract\
\ algebra.\n\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"tag": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_abstract_algebra"
"dataset_name": "anatomy"
"description": "The following are multiple choice questions (with answers) about anatomy.\n\
\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"tag": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_anatomy"
"dataset_name": "astronomy"
"description": "The following are multiple choice questions (with answers) about astronomy.\n\
\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"tag": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_astronomy"
"dataset_name": "business_ethics"
"description": "The following are multiple choice questions (with answers) about business\
\ ethics.\n\n"
"group": "mmlu_flan_cot_zeroshot_other"
"tag": "mmlu_flan_cot_zeroshot_other"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_business_ethics"
"dataset_name": "clinical_knowledge"
"description": "The following are multiple choice questions (with answers) about clinical\
\ knowledge.\n\n"
"group": "mmlu_flan_cot_zeroshot_other"
"tag": "mmlu_flan_cot_zeroshot_other"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_clinical_knowledge"
"dataset_name": "college_biology"
"description": "The following are multiple choice questions (with answers) about college\
\ biology.\n\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"tag": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_college_biology"
"dataset_name": "college_chemistry"
"description": "The following are multiple choice questions (with answers) about college\
\ chemistry.\n\n"
"group": "mmlu_flan_cot_zeroshot_stem"
"tag": "mmlu_flan_cot_zeroshot_stem"
"include": "_mmlu_flan_cot_zeroshot_template_yaml"
"task": "mmlu_flan_cot_zeroshot_college_chemistry"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment