Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
da211969
Unverified
Commit
da211969
authored
Jun 28, 2024
by
Jess
Committed by
GitHub
Jun 28, 2024
Browse files
Merge branch 'EleutherAI:main' into main
parents
1b97e487
801322e0
Changes
654
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
1061 additions
and
787 deletions
+1061
-787
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_miscellaneous.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_miscellaneous.yaml
+43
-27
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_disputes.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_disputes.yaml
+64
-49
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_scenarios.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_moral_scenarios.yaml
+62
-44
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_nutrition.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_nutrition.yaml
+63
-48
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_philosophy.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_philosophy.yaml
+44
-30
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_prehistory.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_prehistory.yaml
+59
-42
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_accounting.yaml
...s/mmlu/flan_cot_fewshot/mmlu_professional_accounting.yaml
+63
-48
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_law.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_professional_law.yaml
+122
-106
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_medicine.yaml
...sks/mmlu/flan_cot_fewshot/mmlu_professional_medicine.yaml
+82
-70
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
...s/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
+62
-48
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
+55
-39
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
+104
-86
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
+58
-43
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
...l/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
+56
-40
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
+45
-31
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
+42
-27
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
...u/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
+4
-2
lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml
...lan_n_shot/generative/_mmlu_flan_generative_template_yaml
+7
-2
lm_eval/tasks/mmlu/flan_n_shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
...shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
+6
-5
lm_eval/tasks/mmlu/generative/_default_template_yaml
lm_eval/tasks/mmlu/generative/_default_template_yaml
+20
-0
No files found.
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_miscellaneous.yaml
View file @
da211969
"
dataset_name"
:
"
miscellaneous"
dataset_name
:
miscellaneous
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
miscellaneous.
\n\
description
:
The following are multiple choice questions (with answers) about miscellaneous.
\n
Q:
Which
of
these
songs
was
a
Top
10
hit
for
the
rock
band
The
Police?
\n
(A)
'Radio
\
fewshot_config
:
\
Ga-Ga'
(B)
'Ob-la-di
Ob-la-da'
(C)
'De
Do
Do
Do
De
Da
Da
Da'
(D)
'In-a-Gadda-Da-Vida'
\n\
sampler
:
first_n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Radio
Ga-Ga
is
by
Queen.
\
samples
:
\
Ob-la-di
Ob-la-da
is
by
The
Beatles.
And
In-a-Gadda-Da-Vida
is
by
Iron
Butterfly.
\
-
question
:
'
Which
of
these
songs
was
a
Top
10
hit
for
the
rock
band
The
Police?
\
Leaving
'De
Do
Do
Do
De
Da
Da
Da'
as
the
only
song
by
The
Police,
and
also
a
Top
\
\
10
hit.
The
answer
is
(C).
\n\n
Q:
What
place
is
named
in
the
title
of
the
1979
\
(A)
'
'
Radio
Ga-Ga'
'
(B)
'
'
Ob-la-di
Ob-la-da'
'
(C)
'
'
De
Do
Do
Do
De
Da
Da
Da'
'
\
live
album
by
rock
legends
Cheap
Trick?
\n
(A)
Budapest
(B)
Budokan
(C)
Bhutan
(D)
\
(D)
'
'
In-a-Gadda-Da-Vida'
'
'
\
Britain
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Nippon
Budokan
\
target
:
Let's think step by step. We refer to Wikipedia for help. Radio Ga-Ga
\
is
an
indoor
arena
in
Tokyo,
Japan
renowned
for
hosting
rock
music
concerts
including
\
is by Queen. Ob-la-di Ob-la-da is by The Beatles. And In-a-Gadda-Da-Vida is
\
Cheap
Trick
in
1978.
'Cheap
Trick
at
Budokan'
became
the
name
of
their
album.
\
by Iron Butterfly. Leaving 'De Do Do Do De Da Da Da' as the only song by The
\
The
answer
is
(B).
\n\n
Q:
What
is
produced
during
photosynthesis?
\n
(A)
hydrogen
\
Police, and also a Top 10 hit. The answer is (C).
\
(B)
nylon
(C)
oxygen
(D)
light
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
-
question
:
'
What
place
is
named
in
the
title
of
the
1979
live
album
by
rock
legends
\
for
help.
Photosynthesis
is
the
process
in
which
green
plants
use
the
green
pigment
\
Cheap
Trick?
\
chlorophyll
to
synthesize
foods
with
water
and
carbon
dioxide.
Oxygen
is
the
byproduct
\
\
of
this
process.
The
answer
is
(C).
\n\n
Q:
Who
is
the
shortest
man
to
ever
win
\
(A)
Budapest
(B)
Budokan
(C)
Bhutan
(D)
Britain'
\
an
NBA
slam
dunk
competition?
\n
(A)
Anthony
'Spud'
Webb
(B)
Michael
'Air'
Jordan
\
target
:
Let's think step by step. We refer to Wikipedia for help. Nippon Budokan
\
(C)
Tyrone
'Muggsy'
Bogues
(D)
Julius
'Dr
J'
Erving
\n
A:
Let's
think
step
by
step.
\
is an indoor arena in Tokyo, Japan renowned for hosting rock music concerts
\
We
refer
to
Wikipedia
for
help.
In
1986,
Spud
Webb,
standing
only
5'7
\"
became
\
including Cheap Trick in 1978. 'Cheap Trick at Budokan' became the name of their
\
the
shortest
NBA
player
in
history
to
win
an
official
slam
dunk
contest.
The
answer
\
album. The answer is (B).
\
is
(A).
\n\n
Q:
How
many
axles
does
a
standard
automobile
have?
\n
(A)
one
(B)
two
\
-
question
:
'
What
is
produced
during
photosynthesis?
\
(C)
four
(D)
eight
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
for
help.
\
\
Most
cars
have
two
axles
to
rotate
the
wheels..
The
answer
is
(B).
\n\n
"
(A)
hydrogen
(B)
nylon
(C)
oxygen
(D)
light'
"
group"
:
"
mmlu_flan_cot_fewshot_other"
target
:
Let's think step by step. We refer to Wikipedia for help. Photosynthesis
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
is the process in which green plants use the green pigment chlorophyll to synthesize
"
task"
:
"
mmlu_flan_cot_fewshot_miscellaneous"
foods with water and carbon dioxide. Oxygen is the byproduct of this process.
The answer is (C).
-
question
:
'
Who
is
the
shortest
man
to
ever
win
an
NBA
slam
dunk
competition?
(A)
Anthony
'
'
Spud'
'
Webb
(B)
Michael
'
'
Air'
'
Jordan
(C)
Tyrone
'
'
Muggsy'
'
Bogues
(D)
Julius
'
'
Dr
J'
'
Erving'
target
:
Let's think step by step. We refer to Wikipedia for help. In 1986, Spud
Webb, standing only 5'7" became the shortest NBA player in history to win an
official slam dunk contest. The answer is (A).
-
question
:
'
How
many
axles
does
a
standard
automobile
have?
(A)
one
(B)
two
(C)
four
(D)
eight'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Most
cars
have
two
axles
to
rotate
the
wheels..
The
answer
is
(B).'
group
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_miscellaneous
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_disputes.yaml
View file @
da211969
"
dataset_name"
:
"
moral_disputes"
dataset_name
:
moral_disputes
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
moral
\
description
:
The following are multiple choice questions (with answers) about moral
\
disputes.
\n\n
Q:
Baron
admits
that
the
versions
of
the
ticking
bomb
hypothetical
\
disputes.
\
she
discusses
are
\"
stunningly
stupid,
\"
but
she
claims
this
is
actually
evidence
\
fewshot_config
:
\
of
\n
(A)
the
stupidity
of
most
traditional
philosophical
examples.
(B)
a
general
\
sampler
:
first_n
\
lack
of
intelligence
among
people
with
advanced
degrees.
(C)
the
wrongness
of
\
samples
:
\
torture.
(D)
the
readiness
on
the
part
of
many
intelligent
people
to
see
torture
\
-
question
:
'
Baron
admits
that
the
versions
of
the
ticking
bomb
hypothetical
she
discusses
\
as
the
best
solution
to
deal
with
terrorism.
\n
A:
Let's
think
step
by
step.
We
\
are
"stunningly
stupid,"
but
she
claims
this
is
actually
evidence
of
\
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
The
ticking
bomb
hypothetical
\
\
poses
a
problem
where
many
people
will
die
to
an
exploding
bomb,
if
the
hypothetical
\
(A)
the
stupidity
of
most
traditional
philosophical
examples.
(B)
a
general
\
terrorist
does
not
disclose
how
to
defuse
it.
Baron
sees
this
hypothetical
as
\
lack
of
intelligence
among
people
with
advanced
degrees.
(C)
the
wrongness
of
\
silly,
but
its
prevalence
does
suggest
intelligent
people,
particularly
utilitarians,
\
torture.
(D)
the
readiness
on
the
part
of
many
intelligent
people
to
see
torture
\
see
torture
as
justifiable
to
save
the
lives
in
this
scenario.
The
answer
is
(D).
\n\
as
the
best
solution
to
deal
with
terrorism.'
\n
Q:
A
fertilized
ovum
is
also
known
as
\n
(A)
a
zygote.
(B)
an
embryo.
(C)
a
viability.
\
target
:
Let's think step by step. We refer to Wikipedia articles on moral disputes
\
(D)
a
blastocyst.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
for help. The ticking bomb hypothetical poses a problem where many people will
\
on
moral
disputes
for
help.
Once
a
single
sperm
penetrates
the
layers
of
an
egg
\
die to an exploding bomb, if the hypothetical terrorist does not disclose how
\
to
form
a
new
cell,
that
cell
is
called
a
zygote.
The
answer
is
(A).
\n\n
Q:
Pence
\
to defuse it. Baron sees this hypothetical as silly, but its prevalence does
\
compares
six
different
cases
of
reproduction,
from
natural
twinning
to
SCNT.
What
\
suggest intelligent people, particularly utilitarians, see torture as justifiable
\
conclusion
does
he
draw
from
this
comparison?
\n
(A)
SCNT
is
not
a
different
kind
\
to save the lives in this scenario. The answer is (D).
\
of
reproduction
because
there
are
no
morally
relevant
differences
between
it
and
\
-
question
:
'
A
fertilized
ovum
is
also
known
as
\
other
permissible
means
of
reproduction.
(B)
Because
there
is
a
low
risk
of
harm
\
\
for
natural
twinning,
there
will
be
a
low
risk
of
harm
for
SCNT.
(C)
both
A
and
\
(A)
a
zygote.
(B)
an
embryo.
(C)
a
viability.
(D)
a
blastocyst.'
\
B
(D)
neither
A
nor
B
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
target
:
Let's think step by step. We refer to Wikipedia articles on moral disputes
\
on
moral
disputes
for
help.
SCNT,
known
as
'somatic
cell
nuclear
transfer',
is
\
for help. Once a single sperm penetrates the layers of an egg to form a new
\
a
technique
in
which
the
nucleus
of
a
somatic
(body)
cell
is
transferred
to
the
\
cell, that cell is called a zygote. The answer is (A).
\
cytoplasm
of
an
enucleated
egg.
This
is
used
in
farm
animal
cloning.
Pence
may
\
-
question
:
'
Pence
compares
six
different
cases
of
reproduction,
from
natural
twinning
\
compare
this
to
other
processes
of
reproduction,
and
see
little
moral
separation.
\
to
SCNT.
What
conclusion
does
he
draw
from
this
comparison?
\
The
answer
is
(A).
\n\n
Q:
Which
of
the
following
is
an
intrinsic
good,
according
\
\
to
Singer?
\n
(A)
being
in
the
upper
class
(B)
the
ability
to
learn
and
solve
problems
\
(A)
SCNT
is
not
a
different
kind
of
reproduction
because
there
are
no
morally
\
(C)
being
more
beautiful
than
most
of
the
population
(D)
being
shorter
than
average
\n\
relevant
differences
between
it
and
other
permissible
means
of
reproduction.
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
\
(B)
Because
there
is
a
low
risk
of
harm
for
natural
twinning,
there
will
be
\
help.
Bioethicist
Peter
Singer
sees
intrinsic
value
as
innate
values
conferred
\
a
low
risk
of
harm
for
SCNT.
(C)
both
A
and
B
(D)
neither
A
nor
B'
\
by
oneself,
for
oneself.
Innanimate
objects
can
be
beautiful,
short,
or
have
some
\
target
:
Let's think step by step. We refer to Wikipedia articles on moral disputes
\
valuable
criteria,
but
capabilities
are
intrinsically
good.
The
answer
is
(B).
\n\
for help. SCNT, known as 'somatic cell nuclear transfer', is a technique in
\n
Q:
According
to
Metz,
what
is
wrong
with
consequentialist
arguments
against
capital
\
which the nucleus of a somatic (body) cell is transferred to the cytoplasm of
\
punishment
based
on
African
values?
\n
(A)
It
is
unclear
as
of
yet
whether
or
not
\
an enucleated egg. This is used in farm animal cloning. Pence may compare this
\
capital
punishment
deters
harm
to
the
community.
(B)
It
is
unclear
as
of
yet
whether
\
to other processes of reproduction, and see little moral separation. The answer
\
or
not
capital
punishment
deters
harm
to
any
individuals.
(C)
Consequentialism
\
is (A).
\
is
not
supported
by
African
values.
(D)
Even
though
consequentialism
is
supported
\
-
question
:
'
Which
of
the
following
is
an
intrinsic
good,
according
to
Singer?
\
by
African
values,
no
consequentialist
arguments
framed
in
terms
of
African
values
\
\
have
been
offered.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
(A)
being
in
the
upper
class
(B)
the
ability
to
learn
and
solve
problems
(C)
\
on
moral
disputes
for
help.
Thaddeus
Metz
is
a
humanities
research
professor
focusing
\
being
more
beautiful
than
most
of
the
population
(D)
being
shorter
than
average'
\
on
ethical
philosophy
in
South
Africa.
Metz
has
written
the
death
penalty
is
unjustified
\
target
:
Let's think step by step. We refer to Wikipedia articles on moral disputes
\
as
it
treats
individuals
as
incapable
of
communal
relations.
It
is
unclear
that
\
for help. Bioethicist Peter Singer sees intrinsic value as innate values conferred
\
capital
punishment
is
to
the
benefit
of,
or
a
deterrent
of
harm
to
the
community.
\
by oneself, for oneself. Innanimate objects can be beautiful, short, or have
\
The
answer
is
(A).
\n\n
"
some valuable criteria, but capabilities are intrinsically good. The answer
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
is (B).
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
-
question
:
'
According
to
Metz,
what
is
wrong
with
consequentialist
arguments
against
"
task"
:
"
mmlu_flan_cot_fewshot_moral_disputes"
capital
punishment
based
on
African
values?
(A)
It
is
unclear
as
of
yet
whether
or
not
capital
punishment
deters
harm
to
the
community.
(B)
It
is
unclear
as
of
yet
whether
or
not
capital
punishment
deters
harm
to
any
individuals.
(C)
Consequentialism
is
not
supported
by
African
values.
(D)
Even
though
consequentialism
is
supported
by
African
values,
no
consequentialist
arguments
framed
in
terms
of
African
values
have
been
offered.'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
Thaddeus
Metz
is
a
humanities
research
professor
focusing
on
ethical
philosophy
in
South
Africa.
Metz
has
written
the
death
penalty
is
unjustified
as
it
treats
individuals
as
incapable
of
communal
relations.
It
is
unclear
that
capital
punishment
is
to
the
benefit
of,
or
a
deterrent
of
harm
to
the
community.
The
answer
is
(A).'
group
:
mmlu_flan_cot_fewshot_humanities
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_moral_disputes
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_scenarios.yaml
View file @
da211969
"
dataset_name"
:
"
moral_scenarios"
dataset_name
:
moral_scenarios
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
moral
\
description
:
The following are multiple choice questions (with answers) about moral
\
scenarios.
\n\n
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
\
scenarios.
\
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
\
fewshot_config
:
\
standards
in
the
US
as
of
2020?
Scenario
1
|
I
took
his
wallet
after
he
harassed
\
sampler
:
first_n
\
me
at
the
bar.
Scenario
2
|
I
made
love
to
my
newborn
baby
boy.
\n
(A)
Wrong,
Wrong
\
samples
:
\
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
\n
A:
Let's
think
\
-
question
:
'
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
\
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Stealing
\
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
\
from
someone
is
not
considered
an
ethical
response
to
harrassment.
And
having
\
the
US
as
of
2020?
Scenario
1
|
I
took
his
wallet
after
he
harassed
me
at
the
\
sex
with
a
minor
is
egregiously
unethical.
The
answer
is
(A).
\n\n
Q:
For
which
\
bar.
Scenario
2
|
I
made
love
to
my
newborn
baby
boy.
\
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
\
\
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
\
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
\
Scenario
1
|
I
asked
for
a
cigarette
from
the
guy
outside
the
store.
Scenario
\
wrong'
\
2
|
When
the
security
guard
left
his
post,
I
pocketed
a
rare
coin.
\n
(A)
Wrong,
\
target
:
Let's think step by step. We refer to Wikipedia articles on moral scenarios
\
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
\n
A:
Let's
\
for help. Stealing from someone is not considered an ethical response to harrassment.
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
\
And having sex with a minor is egregiously unethical. The answer is (A).
\
Asking
for
something
is
not
morally
wrong.
Stealing
is
morally
wrong.
The
answer
\
-
question
:
'
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
\
is
(C).
\n\n
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
\
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
\
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
\
the
US
as
of
2020?
Scenario
1
|
I
asked
for
a
cigarette
from
the
guy
outside
\
in
the
US
as
of
2020?
Scenario
1
|
During
the
house's
demolition,
I
pushed
him
\
the
store.
Scenario
2
|
When
the
security
guard
left
his
post,
I
pocketed
a
\
into
a
wall.
Scenario
2
|
When
I
choose
a
babysitter
for
my
twin
nieces,
I
don't
\
rare
coin.
\
care
about
her
racial
background.
\n
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
\
\
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
\n
A:
Let's
think
step
by
step.
We
refer
to
\
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
\
Wikipedia
articles
on
moral
scenarios
for
help.
Physical
violence
is
morally
wrong.
\
wrong'
\
Being
open
to
hiring
people
from
different
racial
backgrounds
is
an
expected
moral
\
target
:
Let's think step by step. We refer to Wikipedia articles on moral scenarios
\
stance.
The
answer
is
(B).
\n\n
Q:
For
which
of
these
two
scenarios
does
the
main
\
for help. Asking for something is not morally wrong. Stealing is morally wrong.
\
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
\
The answer is (C).
\
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
I
bought
an
old
pick-up
\
-
question
:
'
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
\
truck
for
work.
Scenario
2
|
I
looked
up
at
the
gorgeous
waterfall
and
threw
a
\
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
\
rock
at
it.
\n
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
\
the
US
as
of
2020?
Scenario
1
|
During
the
house'
'
s
demolition,
I
pushed
him
\
wrong,
Not
wrong
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
into
a
wall.
Scenario
2
|
When
I
choose
a
babysitter
for
my
twin
nieces,
I
don'
'
t
\
on
moral
scenarios
for
help.
Neither
of
these
are
wrong
as
no
one
is
wronged
by
\
care
about
her
racial
background.
\
buing
a
pick-up
truck
or
throwing
a
rock
into
a
stream.
The
answer
is
(D).
\n\n\
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
\
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
\
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
\
wrong'
\
US
as
of
2020?
Scenario
1
|
I
really
loved
my
newborn
baby
boy.
Scenario
2
|
The
\
target
:
Let's think step by step. We refer to Wikipedia articles on moral scenarios
\
new
girl
is
embarrassed
to
live
in
a
trailer
park,
so
I
invite
the
class
to
her
\
for help. Physical violence is morally wrong. Being open to hiring people from
\
house.
\n
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
\
different racial backgrounds is an expected moral stance. The answer is (B).
\
Not
wrong
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
\
-
question
:
'
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
\
scenarios
for
help.
Loving
someone
is
not
wrong.
However,
exposing
something
that
\
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
\
someone
is
embarrassed
about
could
be
considered
quite
mean.
The
answer
is
(C).
\n\n
"
the
US
as
of
2020?
Scenario
1
|
I
bought
an
old
pick-up
truck
for
work.
Scenario
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
2
|
I
looked
up
at
the
gorgeous
waterfall
and
threw
a
rock
at
it.
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_moral_scenarios"
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong'
target
:
Let's think step by step. We refer to Wikipedia articles on moral scenarios
for help. Neither of these are wrong as no one is wronged by buing a pick-up
truck or throwing a rock into a stream. The answer is (D).
-
question
:
'
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
I
really
loved
my
newborn
baby
boy.
Scenario
2
|
The
new
girl
is
embarrassed
to
live
in
a
trailer
park,
so
I
invite
the
class
to
her
house.
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Loving
someone
is
not
wrong.
However,
exposing
something
that
someone
is
embarrassed
about
could
be
considered
quite
mean.
The
answer
is
(C).'
group
:
mmlu_flan_cot_fewshot_humanities
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_moral_scenarios
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_nutrition.yaml
View file @
da211969
"
dataset_name"
:
"
nutrition"
dataset_name
:
nutrition
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
nutrition.
\n\
description
:
The following are multiple choice questions (with answers) about nutrition.
\n
Q:
What
is
the
first-line
drug
for
patients
with
type
2
diabetes
and
obesity,
\
fewshot_config
:
\
as
of
2020?
\n
(A)
Acarbose
(B)
Metformin
(C)
Sulphonylureas
(D)
Insulin
\n
A:
Let's
\
sampler
:
first_n
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
Metformin
\
samples
:
\
(Fortamet,
Glumetza,
or
others)
is
usually
the
first
medication
prescribed
for
\
-
question
:
'
What
is
the
first-line
drug
for
patients
with
type
2
diabetes
and
obesity,
\
type
2
diabetes,
as
well
as
obesity.
It
works
by
lowering
glucose
production
in
\
as
of
2020?
\
the
liver
and
improving
the
body's
sensitivity
to
insulin.
The
answer
is
(B).
\n\
\n
Q:
Which
of
the
following
statements
is
correct
(according
to
knowledge
in
2020)?
\n\
(A)
Acarbose
(B)
Metformin
(C)
Sulphonylureas
(D)
Insulin'
(A)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
aspartame
\
target
:
Let's think step by step. We refer to Wikipedia articles on nutrition
\
(B)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
\
for help. Metformin (Fortamet, Glumetza, or others) is usually the first medication
\
saccharin
(C)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
\
prescribed for type 2 diabetes, as well as obesity. It works by lowering glucose
\
sweetener
sucralose
(D)
Consumers
with
phenylketonuria
must
avoid
the
consumption
\
production in the liver and improving the body's sensitivity to insulin. The
\
of
the
sweetener
acesulfame
K
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
answer is (B).
\
articles
on
nutrition
for
help.
People
with
phenylketonuria
(PKU)
cannot
break
\
-
question
:
'
Which
of
the
following
statements
is
correct
(according
to
knowledge
in
\
down
the
amino
acid
phenylalanine.
As
it
builds
up
in
the
blood
and
brain
it
can
\
2020)?
\
lead
to
brain
damage.
People
with
PKU
should
avoid
foods
that
are
converted
to
\
\
phenylalanine
in
the
body,
such
as
aspartame.
The
answer
is
(A).
\n\n
Q:
Which
of
\
(A)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
\
the
following
statements
about
iodine
is
correct,
as
of
2020?
\n
(A)
50%
of
adults
\
aspartame
(B)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
\
consume
iodine
at
levels
below
the
RNI
(B)
Dairy
products
are
a
poor
source
of
\
sweetener
saccharin
(C)
Consumers
with
phenylketonuria
must
avoid
the
consumption
\
iodine
(C)
The
iodine
content
of
organic
milk
is
generally
lower
that
the
level
\
of
the
sweetener
sucralose
(D)
Consumers
with
phenylketonuria
must
avoid
the
\
in
non-organic
milk
(D)
UK
dietary
reference
values
recommend
an
increase
in
iodine
\
consumption
of
the
sweetener
acesulfame
K'
\
intake
in
pregnancy
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
target
:
Let's think step by step. We refer to Wikipedia articles on nutrition
\
on
nutrition
for
help.
Organic
milk
usually
has
less
iodine
content
than
non-organic
\
for help. People with phenylketonuria (PKU) cannot break down the amino acid
\
milk.
The
answer
is
(C).
\n\n
Q:
Which
of
the
following
is
the
most
plausible
explanation
\
phenylalanine. As it builds up in the blood and brain it can lead to brain damage.
\
for
the
protective
effect
of
dietary
fibre
against
cancer
of
the
colon,
as
of
\
People with PKU should avoid foods that are converted to phenylalanine in the
\
2020?
\n
(A)
Propionic
acid,
formed
during
colonic
fibre
fermentation
inhibits
liver
\
body, such as aspartame. The answer is (A).
\
fatty
acid
synthesis
(B)
Butyric
acid,
formed
during
colonic
fibre
fermentation
\
-
question
:
'
Which
of
the
following
statements
about
iodine
is
correct,
as
of
2020?
\
stimulates
\"
silencing
\"
of
the
SLC5A8
tumour
suppressor
gene
(C)
None
of
these
\
\
options
are
correct
(D)
Butyric
acid,
formed
during
colonic
fibre
fermentation
\
(A)
50%
of
adults
consume
iodine
at
levels
below
the
RNI
(B)
Dairy
products
\
stimulates
anti-oxidant
defences
in
the
colon
\n
A:
Let's
think
step
by
step.
We
\
are
a
poor
source
of
iodine
(C)
The
iodine
content
of
organic
milk
is
generally
\
refer
to
Wikipedia
articles
on
nutrition
for
help.
Dietary
fibre
is
inversely
\
lower
that
the
level
in
non-organic
milk
(D)
UK
dietary
reference
values
recommend
\
proportional
to
the
risk
of
colorectal
cancer.
This
is
presumed
because
butyric
\
an
increase
in
iodine
intake
in
pregnancy'
\
acid
(BA)
stimulates
antioxidants
which
help
protect
the
colon
from
cancerous
\
target
:
Let's think step by step. We refer to Wikipedia articles on nutrition
\
tumors.
The
answer
is
(D).
\n\n
Q:
In
a
cohort
study,
the
risk
ratio
of
developing
\
for help. Organic milk usually has less iodine content than non-organic milk.
\
diabetes
was
0.86
when
comparing
consumers
of
tea
(the
exposed)
to
those
who
did
\
The answer is (C).
\
not
drink
tea
(the
unexposed).
Which
one
statement
is
correct
(according
to
knowledge
\
-
question
:
'
Which
of
the
following
is
the
most
plausible
explanation
for
the
protective
\
in
2020)?
\n
(A)
The
tea
drinkers
have
lower
risk
of
developing
diabetes.
(B)
The
\
effect
of
dietary
fibre
against
cancer
of
the
colon,
as
of
2020?
\
tea
drinkers
have
higher
risk
of
developing
diabetes.
(C)
Based
on
the
information
\
\
given
we
cannot
tell
if
the
observed
difference
in
disease
risk
is
the
result
\
(A)
Propionic
acid,
formed
during
colonic
fibre
fermentation
inhibits
liver
\
of
chance.
(D)
The
risk
ratio
is
close
to
the
value
one,
so
there
is
no
difference
\
fatty
acid
synthesis
(B)
Butyric
acid,
formed
during
colonic
fibre
fermentation
\
in
disease
risk
between
the
two
groups.
\n
A:
Let's
think
step
by
step.
We
refer
\
stimulates
"silencing"
of
the
SLC5A8
tumour
suppressor
gene
(C)
None
of
these
\
to
Wikipedia
articles
on
nutrition
for
help.
The
risk
ratio
is
not
sufficiently
\
options
are
correct
(D)
Butyric
acid,
formed
during
colonic
fibre
fermentation
\
reduced
that
it
could
not
be
explained
by
random
chance
given
the
studies
sample
\
stimulates
anti-oxidant
defences
in
the
colon'
\
size.
The
answer
is
(C).
\n\n
"
target
:
Let's think step by step. We refer to Wikipedia articles on nutrition
"
group"
:
"
mmlu_flan_cot_fewshot_other"
for help. Dietary fibre is inversely proportional to the risk of colorectal
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
cancer. This is presumed because butyric acid (BA) stimulates antioxidants which
"
task"
:
"
mmlu_flan_cot_fewshot_nutrition"
help protect the colon from cancerous tumors. The answer is (D).
-
question
:
'
In
a
cohort
study,
the
risk
ratio
of
developing
diabetes
was
0.86
when
comparing
consumers
of
tea
(the
exposed)
to
those
who
did
not
drink
tea
(the
unexposed).
Which
one
statement
is
correct
(according
to
knowledge
in
2020)?
(A)
The
tea
drinkers
have
lower
risk
of
developing
diabetes.
(B)
The
tea
drinkers
have
higher
risk
of
developing
diabetes.
(C)
Based
on
the
information
given
we
cannot
tell
if
the
observed
difference
in
disease
risk
is
the
result
of
chance.
(D)
The
risk
ratio
is
close
to
the
value
one,
so
there
is
no
difference
in
disease
risk
between
the
two
groups.'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
The
risk
ratio
is
not
sufficiently
reduced
that
it
could
not
be
explained
by
random
chance
given
the
studies
sample
size.
The
answer
is
(C).'
group
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_nutrition
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_philosophy.yaml
View file @
da211969
"
dataset_name"
:
"
philosophy"
dataset_name
:
philosophy
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
philosophy.
\n\
description
:
The following are multiple choice questions (with answers) about philosophy.
\n
Q:
The
study
of
reality
in
the
broadest
sense,
an
inquiry
into
the
elemental
nature
\
fewshot_config
:
\
of
the
universe
and
the
things
in
it,
is
known
as
_____.
\n
(A)
metaphysics
(B)
\
sampler
:
first_n
\
epistemology
(C)
quantum
physics
(D)
axiology
\n
A:
Let's
think
step
by
step.
We
\
samples
:
\
refer
to
Wikipedia
articles
on
philosophy
for
help.
Among
the
options,
only
metaphysics
\
-
question
:
'
The
study
of
reality
in
the
broadest
sense,
an
inquiry
into
the
elemental
\
studies
the
nature
of
reality
and
existence.
The
answer
is
(A).
\n\n
Q:
According
\
nature
of
the
universe
and
the
things
in
it,
is
known
as
_____.
\
to
Moore’s
“ideal
utilitarianism,”
the
right
action
is
the
one
that
brings
about
\
\
the
greatest
amount
of:
\n
(A)
pleasure.
(B)
happiness.
(C)
good.
(D)
virtue.
\n\
(A)
metaphysics
(B)
epistemology
(C)
quantum
physics
(D)
axiology'
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
philosophy
for
help.
\
target
:
Let's think step by step. We refer to Wikipedia articles on philosophy
\
Moore's
\"
ideal
utilitarianism
\"
states
that
one's
actions
should
maximize
intrinsic
\
for help. Among the options, only metaphysics studies the nature of reality
\
goods.
The
answer
is
(C).
\n\n
Q:
Before
Tolstoy's
Christian
conversion,
what
was
\
and existence. The answer is (A).
\
his
perspective
on
the
meaning
of
life?
\n
(A)
optimist
(B)
satisfied
(C)
nominally
\
-
question
:
"
According
to
Moore
\u2019
s
\u201C
ideal
utilitarianism,
\u201D
the
right
\
\
religious
(D)
pessimist
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
action
is
the
one
that
brings
about
the
greatest
amount
of:
\n
(A)
pleasure.
\
\
on
philosophy
for
help.
Before
his
conversion,
Tolstoy
feels
that
life
was
uncertain,
\
\
(B)
happiness.
(C)
good.
(D)
virtue."
\
which
is
a
pessimist's
point
of
view.
The
answer
is
(D).
\n\n
Q:
According
to
d'Holbach,
\
target
:
Let's think step by step. We refer to Wikipedia articles on philosophy
\
people
always
act
according
to
_____.
\n
(A)
free
choices
(B)
dictates
of
the
soul
\
for help. Moore's "ideal utilitarianism" states that one's actions should maximize
\
(C)
necessary
natural
laws
(D)
undetermined
will
\n
A:
Let's
think
step
by
step.
\
intrinsic goods. The answer is (C).
\
We
refer
to
Wikipedia
articles
on
philosophy
for
help.
d'Holbach
believes
that
\
-
question
:
'
Before
Tolstoy'
'
s
Christian
conversion,
what
was
his
perspective
on
the
\
people
act
according
to
necessary
laws,
and
it
proves
nothing
about
people's
free
\
meaning
of
life?
\
will.
The
answer
is
(C).
\n\n
Q:
Psychological
egoism
is:
\n
(A)
an
ethical
theory
\
\
about
how
we
ought
to
behave.
(B)
a
generalization
concerning
the
way
people
tend
\
(A)
optimist
(B)
satisfied
(C)
nominally
religious
(D)
pessimist'
\
to
behave.
(C)
a
claim
about
human
nature
and
the
ways
people
are
capable
of
behaving.
\
target
:
Let's think step by step. We refer to Wikipedia articles on philosophy
\
(D)
none
of
the
above.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
for help. Before his conversion, Tolstoy feels that life was uncertain, which
\
on
philosophy
for
help.
Psychological
egoism
suggests
that
one
behaves
based
on
\
is a pessimist's point of view. The answer is (D).
\
what
makes
one
feels
good,
hence
it
is
a
claim
about
human
nature
and
how
humans
\
-
question
:
'
According
to
d'
'
Holbach,
people
always
act
according
to
_____.
\
are
capable
of
behaving.
The
answer
is
(C).
\n\n
"
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
(A)
free
choices
(B)
dictates
of
the
soul
(C)
necessary
natural
laws
(D)
undetermined
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
will'
"
task"
:
"
mmlu_flan_cot_fewshot_philosophy"
target
:
Let's think step by step. We refer to Wikipedia articles on philosophy
for help. d'Holbach believes that people act according to necessary laws, and
it proves nothing about people's free will. The answer is (C).
-
question
:
'
Psychological
egoism
is:
(A)
an
ethical
theory
about
how
we
ought
to
behave.
(B)
a
generalization
concerning
the
way
people
tend
to
behave.
(C)
a
claim
about
human
nature
and
the
ways
people
are
capable
of
behaving.
(D)
none
of
the
above.'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
philosophy
for
help.
Psychological
egoism
suggests
that
one
behaves
based
on
what
makes
one
feels
good,
hence
it
is
a
claim
about
human
nature
and
how
humans
are
capable
of
behaving.
The
answer
is
(C).'
group
:
mmlu_flan_cot_fewshot_humanities
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_philosophy
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_prehistory.yaml
View file @
da211969
"
dataset_name"
:
"
prehistory"
dataset_name
:
prehistory
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
prehistory.
\n\
description
:
The following are multiple choice questions (with answers) about prehistory.
\n
Q:
What
is
the
approximate
mean
cranial
capacity
of
Homo
erectus?
\n
(A)
under
650
\
fewshot_config
:
\
cc
(B)
about
800
cc
(C)
just
under
1000
cc
(D)
1200
cc
\n
A:
Let's
think
step
by
\
sampler
:
first_n
\
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
The
average
cranium
\
samples
:
\
capacity
of
Homo
erectus
is
less
than
1000
cubic
cm.
The
answer
is
(C).
\n\n
Q:
\
-
question
:
'
What
is
the
approximate
mean
cranial
capacity
of
Homo
erectus?
\
According
to
Timothy
Pauketat,
the
evidence
for
social
stratification
and
political
\
\
power
at
Cahokia
suggests:
\n
(A)
a
center
of
Mississippian
civilization
with
conditions
\
(A)
under
650
cc
(B)
about
800
cc
(C)
just
under
1000
cc
(D)
1200
cc'
\
similar
to
the
rise
of
early
states.
(B)
the
limitations
of
authority
in
a
Native
\
target
:
Let's think step by step. We refer to Wikipedia articles on prehistory
\
American
society
of
egalitarian
foragers.
(C)
a
simple
chiefdom
or
perhaps
a
complex
\
for help. The average cranium capacity of Homo erectus is less than 1000 cubic
\
chiefdom
had
evolved
by
A.D.
1500.
(D)
a
center
of
Mississippian
civilization
\
cm. The answer is (C).
\
with
conditions
similar
to
societies
on
the
Northwest
Coast
of
North
America.
\n\
-
question
:
'
According
to
Timothy
Pauketat,
the
evidence
for
social
stratification
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
\
and
political
power
at
Cahokia
suggests:
\
Timothy
Pauketat
is
known
for
his
research
on
Cahokia,
the
center
of
the
Mississippian
\
\
culture,
where
he
found
similar
conditions
to
the
rise
of
early
states.
The
answer
\
(A)
a
center
of
Mississippian
civilization
with
conditions
similar
to
the
rise
\
is
(A).
\n\n
Q:
Recent
research
on
hominid
species
dating
from
the
Middle
Pliocene
\
of
early
states.
(B)
the
limitations
of
authority
in
a
Native
American
society
\
indicates
there
was
(as
of
2020):
\n
(A)
a
great
amount
of
species
diversity,
or
\
of
egalitarian
foragers.
(C)
a
simple
chiefdom
or
perhaps
a
complex
chiefdom
\
a
single
species
that
exhibited
a
lot
of
diversity.
(B)
very
little
species
diversity
\
had
evolved
by
A.D.
1500.
(D)
a
center
of
Mississippian
civilization
with
conditions
\
during
this
period
and
very
few
hominids.
(C)
decreased
species
diversity
due
\
similar
to
societies
on
the
Northwest
Coast
of
North
America.'
\
to
a
prolonged
ice
age
followed
by
a
severe
drought.
(D)
decreased
species
diversity
\
target
:
Let's think step by step. We refer to Wikipedia articles on prehistory
\
but
increased
numbers
of
hammerstones
and
flakes,
indicating
stone
tool
manufacture.
\n\
for help. Timothy Pauketat is known for his research on Cahokia, the center
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
\
of the Mississippian culture, where he found similar conditions to the rise
\
Recent
research
has
recognized
multiple
hominid
species
from
the
Middle
Pliocene,
\
of early states. The answer is (A).
\
meaning
that
there
is
a
great
amount
of
species
diversity
or
diversity
in
a
single
\
-
question
:
'
Recent
research
on
hominid
species
dating
from
the
Middle
Pliocene
indicates
\
species.
The
answer
is
(A).
\n\n
Q:
Researchers
now
believe
that
the
decline
of
\
there
was
(as
of
2020):
\
the
Maya
was
caused
chiefly
by:
\n
(A)
a
cataclysm
of
some
kind,
such
as
an
earthquake,
\
\
volcano,
or
tsunami.
(B)
ecological
degradation
resulting
from
slash-and-burn
\
(A)
a
great
amount
of
species
diversity,
or
a
single
species
that
exhibited
\
farming
techniques.
(C)
endless
wars
between
neighboring
Mayan
city-states.
(D)
\
a
lot
of
diversity.
(B)
very
little
species
diversity
during
this
period
and
\
practices
of
interbreeding
that
led
to
a
steep
rise
in
congenital
disorders.
\n\
very
few
hominids.
(C)
decreased
species
diversity
due
to
a
prolonged
ice
age
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
\
followed
by
a
severe
drought.
(D)
decreased
species
diversity
but
increased
\
Researchers
believe
that
the
Maya
collapse
was
mainly
caused
by
over-exploitation
\
numbers
of
hammerstones
and
flakes,
indicating
stone
tool
manufacture.'
\
of
natural
resources
like
the
slash-and-burn
farming
techniques.
The
answer
is
\
target
:
Let's think step by step. We refer to Wikipedia articles on prehistory
\
(B).
\n\n
Q:
The
great
Mayan
king
Pacal
built
temples
in
the
city
of
Palenque
in
\
for help. Recent research has recognized multiple hominid species from the Middle
\
order
to:
\n
(A)
satisfy
the
powerful
Mayan
astronomer
priests.
(B)
display
his
\
Pliocene, meaning that there is a great amount of species diversity or diversity
\
generosity
to
the
common
people,
since
they
were
allowed
to
live
in
the
temples.
\
in a single species. The answer is (A).
\
(C)
frighten
away
enemies,
in
particular
the
Spaniards.
(D)
legitimize
his
kingship,
\
-
question
:
'
Researchers
now
believe
that
the
decline
of
the
Maya
was
caused
chiefly
\
since
his
father
was
not
royal.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
by:
\
articles
on
prehistory
for
help.
Pacal
built
the
temples
as
the
funerary
monument
\
\
to
legitimize
his
kingship.
The
answer
is
(D).
\n\n
"
(A)
a
cataclysm
of
some
kind,
such
as
an
earthquake,
volcano,
or
tsunami.
(B)
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
ecological
degradation
resulting
from
slash-and-burn
farming
techniques.
(C)
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
endless
wars
between
neighboring
Mayan
city-states.
(D)
practices
of
interbreeding
"
task"
:
"
mmlu_flan_cot_fewshot_prehistory"
that
led
to
a
steep
rise
in
congenital
disorders.'
target
:
Let's think step by step. We refer to Wikipedia articles on prehistory
for help. Researchers believe that the Maya collapse was mainly caused by over-exploitation
of natural resources like the slash-and-burn farming techniques. The answer
is (B).
-
question
:
'
The
great
Mayan
king
Pacal
built
temples
in
the
city
of
Palenque
in
order
to:
(A)
satisfy
the
powerful
Mayan
astronomer
priests.
(B)
display
his
generosity
to
the
common
people,
since
they
were
allowed
to
live
in
the
temples.
(C)
frighten
away
enemies,
in
particular
the
Spaniards.
(D)
legitimize
his
kingship,
since
his
father
was
not
royal.'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
Pacal
built
the
temples
as
the
funerary
monument
to
legitimize
his
kingship.
The
answer
is
(D).'
group
:
mmlu_flan_cot_fewshot_humanities
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_prehistory
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_accounting.yaml
View file @
da211969
"
dataset_name"
:
"
professional_accounting"
dataset_name
:
professional_accounting
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
description
:
The following are multiple choice questions (with answers) about professional
\
accounting.
\n\n
Q:
An
auditor
traces
the
serial
numbers
on
equipment
to
a
nonissuer’s
\
accounting.
\
subledger.
Which
of
the
following
management
assertions
is
supported
by
this
test?
\n\
fewshot_config
:
(A)
Valuation
and
allocation
(B)
Completeness
(C)
Rights
and
obligations
(D)
Presentation
\
sampler
:
first_n
\
and
disclosure
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
\
samples
:
\
accounting
for
help.
The
completeness
assertion
is
tested
by
tracing
supporting
\
-
question
:
"
An
auditor
traces
the
serial
numbers
on
equipment
to
a
nonissuer
\u2019\
\
documents
to
the
record
entries.
The
answer
is
(B).
\n\n
Q:
One
hundred
years
ago,
\
s
subledger.
Which
of
the
following
management
assertions
is
supported
by
this
\
\
your
great-great-grandmother
invested
$100
at
5%
yearly
interest.
What
is
the
\
\
test?
\n
(A)
Valuation
and
allocation
(B)
Completeness
(C)
Rights
and
obligations
\
\
investment
worth
today?
\n
(A)
$13,000
(B)
$600
(C)
$15,000
(D)
$28,000
\n
A:
Let's
\
\
(D)
Presentation
and
disclosure"
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
accounting
for
help.
A
$100
\
target
:
Let's think step by step. We refer to Wikipedia articles on accounting
\
investment
at
5%
yearly
interest
is
worth
100*(1.05)^100=13150
after
100
years,
\
for help. The completeness assertion is tested by tracing supporting documents
\
which
is
around
$13,000.
The
answer
is
(A).
\n\n
Q:
On
January
1,
year
1,
Alpha
\
to the record entries. The answer is (B).
\
Co.
signed
an
annual
maintenance
agreement
with
a
software
provider
for
$15,000
\
-
question
:
'
One
hundred
years
ago,
your
great-great-grandmother
invested
$100
at
5%
\
and
the
maintenance
period
begins
on
March
1,
year
1.
Alpha
also
incurred
$5,000
\
yearly
interest.
What
is
the
investment
worth
today?
\
of
costs
on
January
1,
year
1,
related
to
software
modification
requests
that
\
\
will
increase
the
functionality
of
the
software.
Alpha
depreciates
and
amortizes
\
(A)
$13,000
(B)
$600
(C)
$15,000
(D)
$28,000'
\
its
computer
and
software
assets
over
five
years
using
the
straight-line
method.
\
target
:
Let's think step by step. We refer to Wikipedia articles on accounting
\
What
amount
is
the
total
expense
that
Alpha
should
recognize
related
to
the
maintenance
\
for help. A $100 investment at 5% yearly interest is worth 100*(1.05)^100=13150
\
agreement
and
the
software
modifications
for
the
year
ended
December
31,
year
\
after 100 years, which is around $13,000. The answer is (A).
\
1?
\n
(A)
$5,000
(B)
$13,500
(C)
$16,000
(D)
$20,000
\n
A:
Let's
think
step
by
step.
\
-
question
:
'
On
January
1,
year
1,
Alpha
Co.
signed
an
annual
maintenance
agreement
\
We
refer
to
Wikipedia
articles
on
accounting
for
help.
The
maintenance
period
\
with
a
software
provider
for
$15,000
and
the
maintenance
period
begins
on
March
\
begins
on
March
1,
so
only
10
months
of
expenses
should
be
recognized,
which
is
\
1,
year
1.
Alpha
also
incurred
$5,000
of
costs
on
January
1,
year
1,
related
\
$15,000/12*10=$12,500.
The
software
modification
cost
is
amortized
over
5
years,
\
to
software
modification
requests
that
will
increase
the
functionality
of
the
\
so
each
year
is
$5,000/5=$1,000.
So
the
total
expense
is
$12,500+$1,000=$13,500.
\
software.
Alpha
depreciates
and
amortizes
its
computer
and
software
assets
over
\
The
answer
is
(B).
\n\n
Q:
Krete
is
an
unmarried
taxpayer
with
income
exclusively
\
five
years
using
the
straight-line
method.
What
amount
is
the
total
expense
\
from
wages.
By
December
31,
year
1,
Krete's
employer
has
withheld
$16,000
in
federal
\
that
Alpha
should
recognize
related
to
the
maintenance
agreement
and
the
software
\
income
taxes
and
Krete
has
made
no
estimated
tax
payments.
On
April
15,
year
2,
\
modifications
for
the
year
ended
December
31,
year
1?
\
Krete
timely
filed
for
an
extension
request
to
file
her
individual
tax
return,
\
\
and
paid
$300
of
additional
taxes.
Krete's
year
1
tax
liability
was
$16,500
when
\
(A)
$5,000
(B)
$13,500
(C)
$16,000
(D)
$20,000'
\
she
timely
filed
her
return
on
April
30,
year
2,
and
paid
the
remaining
tax
liability
\
target
:
Let's think step by step. We refer to Wikipedia articles on accounting
\
balance.
What
amount
would
be
subject
to
the
penalty
for
underpayment
of
estimated
\
for help. The maintenance period begins on March 1, so only 10 months of expenses
\
taxes?
\n
(A)
$0
(B)
$500
(C)
$1,650
(D)
$16,500
\n
A:
Let's
think
step
by
step.
We
\
should be recognized, which is $15,000/12*10=$12,500. The software modification
\
refer
to
Wikipedia
articles
on
accounting
for
help.
The
tax
due
after
withholding
\
cost is amortized over 5 years, so each year is $5,000/5=$1,000. So the total
\
is
$16,500-$16,000=$500,
which
is
less
than
$1000,
hence
there
is
no
underpayment
\
expense is $12,500+$1,000=$13,500. The answer is (B).
\
penalty
of
estimated
taxes.
The
answer
is
(A).
\n\n
Q:
Box
a
nongovernmental
not-for-profit
\
-
question
:
'
Krete
is
an
unmarried
taxpayer
with
income
exclusively
from
wages.
By
\
organization
had
the
following
transactions
during
the
year:
Proceeds
from
sale
\
December
31,
year
1,
Krete'
'
s
employer
has
withheld
$16,000
in
federal
income
\
of
investments
$80000
Purchase
of
property
plant
and
equipment
$10000
Proceeds
\
taxes
and
Krete
has
made
no
estimated
tax
payments.
On
April
15,
year
2,
Krete
\
from
long-term
debt
$100000
Loss
on
sale
of
investment
$5000
What
amount
should
\
timely
filed
for
an
extension
request
to
file
her
individual
tax
return,
and
\
be
reported
as
net
cash
provided
by
financing
activities
in
Box's
statement
of
\
paid
$300
of
additional
taxes.
Krete'
'
s
year
1
tax
liability
was
$16,500
when
\
cash
flows?
\n
(A)
$70,000
(B)
$75,000
(C)
$80,000
(D)
100000
\n
A:
Let's
think
step
\
she
timely
filed
her
return
on
April
30,
year
2,
and
paid
the
remaining
tax
\
by
step.
We
refer
to
Wikipedia
articles
on
accounting
for
help.
Among
the
four
\
liability
balance.
What
amount
would
be
subject
to
the
penalty
for
underpayment
\
transactions,
only
Proceeds
from
long-term
debt
belongs
to
the
financing
activities
\
of
estimated
taxes?
\
section
of
cashflow,
hence
the
amount
reported
should
be
$100000.
The
answer
is
\
\
(D).
\n\n
"
(A)
$0
(B)
$500
(C)
$1,650
(D)
$16,500'
"
group"
:
"
mmlu_flan_cot_fewshot_other"
target
:
Let's think step by step. We refer to Wikipedia articles on accounting
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
for help. The tax due after withholding is $16,500-$16,000=$500, which is less
"
task"
:
"
mmlu_flan_cot_fewshot_professional_accounting"
than $1000, hence there is no underpayment penalty of estimated taxes. The answer
is (A).
-
question
:
'
Box
a
nongovernmental
not-for-profit
organization
had
the
following
transactions
during
the
year:
Proceeds
from
sale
of
investments
$80000
Purchase
of
property
plant
and
equipment
$10000
Proceeds
from
long-term
debt
$100000
Loss
on
sale
of
investment
$5000
What
amount
should
be
reported
as
net
cash
provided
by
financing
activities
in
Box'
'
s
statement
of
cash
flows?
(A)
$70,000
(B)
$75,000
(C)
$80,000
(D)
100000'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
accounting
for
help.
Among
the
four
transactions,
only
Proceeds
from
long-term
debt
belongs
to
the
financing
activities
section
of
cashflow,
hence
the
amount
reported
should
be
$100000.
The
answer
is
(D).'
group
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_accounting
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_law.yaml
View file @
da211969
"
dataset_name"
:
"
professional_law"
dataset_name
:
professional_law
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
description
:
The following are multiple choice questions (with answers) about professional
\
law.
\n\n
Q:
A
son
owed
a
creditor
$5,000.
The
son's
father
contacted
the
creditor
\
law.
\
and
told
him
that
he
wanted
to
pay
the
son's
debt.
The
father
signed
a
document
\
fewshot_config
:
\
that
stated
the
father
would
pay
the
son's
debt
at
a
rate
of
$500
a
month
for
\
sampler
:
first_n
\
10
months.
The
creditor
made
no
written
or
oral
commitment
to
forbear
to
sue
the
\
samples
:
\
son
to
collect
the
$5,000
debt,
and
the
father
made
no
oral
or
written
request
\
-
question
:
'
A
son
owed
a
creditor
$5,000.
The
son'
'
s
father
contacted
the
creditor
\
for
any
such
forbearance.
For
the
next
five
months,
the
father
made
and
the
creditor
\
and
told
him
that
he
wanted
to
pay
the
son'
'
s
debt.
The
father
signed
a
document
\
accepted
the
$500
monthly
payments
as
agreed.
During
that
period,
the
creditor,
\
that
stated
the
father
would
pay
the
son'
'
s
debt
at
a
rate
of
$500
a
month
for
\
in
fact,
did
forbear
to
take
any
legal
action
against
the
son.
However,
the
father
\
10
months.
The
creditor
made
no
written
or
oral
commitment
to
forbear
to
sue
\
then
informed
the
creditor
that
he
would
make
no
further
payments
on
the
debt.
\
the
son
to
collect
the
$5,000
debt,
and
the
father
made
no
oral
or
written
request
\
Which
of
the
following
is
the
most
persuasive
argument
that
the
father
is
liable
\
for
any
such
forbearance.
For
the
next
five
months,
the
father
made
and
the
\
to
the
creditor
under
the
terms
of
their
agreement?
\n
(A)
The
father's
promise
\
creditor
accepted
the
$500
monthly
payments
as
agreed.
During
that
period,
the
\
and
the
creditor's
reliance
thereon,
if
proved,
gave
rise
to
a
valid
claim
by
\
creditor,
in
fact,
did
forbear
to
take
any
legal
action
against
the
son.
However,
\
the
creditor
against
the
father
based
on
the
doctrine
of
promissory
estoppel.
\
the
father
then
informed
the
creditor
that
he
would
make
no
further
payments
\
(B)
Because
it
was
foreseeable
that
the
father's
promise
would
induce
the
creditor
\
on
the
debt.
Which
of
the
following
is
the
most
persuasive
argument
that
the
\
to
forbear
taking
any
action
against
the
son,
such
forbearance
was,
as
a
matter
\
father
is
liable
to
the
creditor
under
the
terms
of
their
agreement?
\
of
law,
a
bargained-for
consideration
for
the
father's
promise.
(C)
The
father's
\
\
five
payments
to
the
creditor
totaling
$2,500
manifested
a
serious
intent
on
the
\
(A)
The
father'
'
s
promise
and
the
creditor'
'
s
reliance
thereon,
if
proved,
gave
\
father's
part
to
be
contractually
bound,
and
such
manifestation
is
generally
recognized
\
rise
to
a
valid
claim
by
the
creditor
against
the
father
based
on
the
doctrine
\
as
an
effective
substitute
for
consideration.
(D)
By
assuming
the
antecedent
debt
\
of
promissory
estoppel.
(B)
Because
it
was
foreseeable
that
the
father'
'
s
promise
\
obligation
that
the
son
owed
to
the
creditor,
the
father
became
a
surety
whose
\
would
induce
the
creditor
to
forbear
taking
any
action
against
the
son,
such
\
promise
to
the
creditor
was
enforceable,
since
it
was
in
writing
and
supported
\
forbearance
was,
as
a
matter
of
law,
a
bargained-for
consideration
for
the
father'
'
s
\
by
adequate
consideration.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
promise.
(C)
The
father'
'
s
five
payments
to
the
creditor
totaling
$2,500
manifested
\
articles
on
law
for
help.
The
doctrine
of
promissory
estoppel
stops
a
person
from
\
a
serious
intent
on
the
father'
'
s
part
to
be
contractually
bound,
and
such
manifestation
\
going
back
on
a
promise
in
contract
law,
hence
option
(A)
should
be
the
most
persuasive
\
is
generally
recognized
as
an
effective
substitute
for
consideration.
(D)
By
\
argument.
The
answer
is
(A).
\n\n
Q:
A
state
has
recently
enacted
a
statute
prohibiting
\
assuming
the
antecedent
debt
obligation
that
the
son
owed
to
the
creditor,
the
\
the
disposal
of
any
nuclear
wastes
within
the
state.
This
law
does
not
contravene
\
father
became
a
surety
whose
promise
to
the
creditor
was
enforceable,
since
\
or
conflict
with
any
federal
statutes.
A
man
operates
a
company
in
the
state
that
\
it
was
in
writing
and
supported
by
adequate
consideration.
'
\
is
engaged
in
the
disposal
of
nuclear
wastes.
Subsequent
to
the
passage
of
the
\
target
:
Let's think step by step. We refer to Wikipedia articles on law for help.
\
state
statute,
the
man,
not
yet
aware
of
the
new
law,
entered
into
contracts
with
\
The doctrine of promissory estoppel stops a person from going back on a promise
\
many
out-of-state
firms
to
dispose
of
their
nuclear
wastes
in
the
state.
On
account
\
in contract law, hence option (A) should be the most persuasive argument. The
\
of
this
new
law,
however,
the
man
will
be
unable
to
perform
these
contracts.
Assume
\
answer is (A).
\
that
the
man
has
standing
to
challenge
this
state
law.
Which
of
the
following
\
-
question
:
'
A
state
has
recently
enacted
a
statute
prohibiting
the
disposal
of
any
\
presents
his
strongest
constitutional
grounds
to
challenge
the
state
law
prohibiting
\
nuclear
wastes
within
the
state.
This
law
does
not
contravene
or
conflict
with
\
the
disposal
of
nuclear
wastes
within
the
state?
\n
(A)
The
commerce
clause.
(B)
\
any
federal
statutes.
A
man
operates
a
company
in
the
state
that
is
engaged
\
The
equal
protection
clause
of
the
Fourteenth
Amendment.
(C)
The
privileges
and
\
in
the
disposal
of
nuclear
wastes.
Subsequent
to
the
passage
of
the
state
statute,
\
immunities
clause
of
Article
IV,
Section
2.
(D)
The
contract
clause.
\n
A:
Let's
\
the
man,
not
yet
aware
of
the
new
law,
entered
into
contracts
with
many
out-of-state
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
law
for
help.
The
commerce
\
firms
to
dispose
of
their
nuclear
wastes
in
the
state.
On
account
of
this
new
\
clause
states
that
Congress
shall
have
the
power
to
regulate
commerce
with
foreign
\
law,
however,
the
man
will
be
unable
to
perform
these
contracts.
Assume
that
\
Nations,
and
among
the
several
States,
and
with
the
Indian
Tribes.
The
statute
\
the
man
has
standing
to
challenge
this
state
law.
Which
of
the
following
presents
\
affects
inter-state
commerce
which
puts
it
into
question.
Hence
the
man's
strongest
\
his
strongest
constitutional
grounds
to
challenge
the
state
law
prohibiting
\
argument
should
be
the
commerce
clause.
The
answer
is
(A).
\n\n
Q:
On
October
1,
\
the
disposal
of
nuclear
wastes
within
the
state?
\
1980,
a
developer,
owner
of
several
hundred
acres
in
a
rural
county,
drafted
a
\
\
general
development
plan
for
the
area.
The
duly
recorded
plan
imposed
elaborate
\
(A)
The
commerce
clause.
(B)
The
equal
protection
clause
of
the
Fourteenth
Amendment.
\
limitations
and
restrictions
upon
the
land
in
the
plan,
which
was
to
be
developed
\
(C)
The
privileges
and
immunities
clause
of
Article
IV,
Section
2.
(D)
The
contract
\
as
a
residential
district.
The
restrictions
were
to
extend
to
all
persons
acquiring
\
clause.'
\
any
of
the
lots
and
to
their
heirs,
assigns,
and
lessees.
It
was
further
provided
\
target
:
Let's think step by step. We refer to Wikipedia articles on law for help.
\
that
all
subsequent
owners
would
be
charged
with
due
notice
of
the
restrictions.
\
The commerce clause states that Congress shall have the power to regulate commerce
\
Among
those
restrictions
in
the
general
plan
were
the
following:(22)
A
franchise
\
with foreign Nations, and among the several States, and with the Indian Tribes.
\
right
is
created
in
a
strip
of
land
10
feet
in
width
along
the
rear
of
each
lot
\
The statute affects inter-state commerce which puts it into question. Hence
\
for
the
use
of
public
utility
companies
with
right
of
ingress
and
egress.
(23)
\
the man's strongest argument should be the commerce clause. The answer is (A).
\
No
house
or
structure
of
any
kind
shall
be
built
on
the
aforementioned
strip
of
\
-
question
:
'
On
October
1,
1980,
a
developer,
owner
of
several
hundred
acres
in
a
rural
\
land
running
through
the
said
blocks.
In
2000,
a
retiree
purchased
one
of
the
\
county,
drafted
a
general
development
plan
for
the
area.
The
duly
recorded
plan
\
lots,
built
a
house,
and
erected
a
fence
in
the
rear
of
his
property
within
the
\
imposed
elaborate
limitations
and
restrictions
upon
the
land
in
the
plan,
which
\
restricted
area.
In
2004,
a
teacher
purchased
a
lot
adjacent
to
the
retiree's
\
was
to
be
developed
as
a
residential
district.
The
restrictions
were
to
extend
\
property
and
built
a
new
house.
Two
years
later,
a
librarian
purchased
the
lot
\
to
all
persons
acquiring
any
of
the
lots
and
to
their
heirs,
assigns,
and
lessees.
\
that
adjoined
the
teacher's
property.
The
three
deeds
to
those
properties
each
\
It
was
further
provided
that
all
subsequent
owners
would
be
charged
with
due
\
contained
references
to
the
deed
book
where
the
general
plan
was
recorded.
In
\
notice
of
the
restrictions.
Among
those
restrictions
in
the
general
plan
were
\
2008,
the
librarian
began
the
construction
of
a
seven-foot
post-and-rail
fence
\
the
following:(22)
A
franchise
right
is
created
in
a
strip
of
land
10
feet
in
\
along
the
line
dividing
his
lot
with
the
teacher's,
and
along
the
center
of
the
\
width
along
the
rear
of
each
lot
for
the
use
of
public
utility
companies
with
\
area
subject
to
the
franchise
right.
Although
the
teacher
objected
to
its
construction,
\
right
of
ingress
and
egress.
(23)
No
house
or
structure
of
any
kind
shall
be
\
the
fence
was
completed.
If
the
teacher
seeks
a
mandatory
injunction
to
compel
\
built
on
the
aforementioned
strip
of
land
running
through
the
said
blocks.
In
\
removal
of
the
librarian's
fence,
the
court
will
most
likely
\n
(A)
grant
relief,
\
2000,
a
retiree
purchased
one
of
the
lots,
built
a
house,
and
erected
a
fence
\
because
the
fence
was
in
violation
of
the
easement
restriction.
(B)
grant
relief,
\
in
the
rear
of
his
property
within
the
restricted
area.
In
2004,
a
teacher
purchased
\
because
the
encroachment
of
the
fence
violated
the
restriction
in
the
original
\
a
lot
adjacent
to
the
retiree'
'
s
property
and
built
a
new
house.
Two
years
later,
\
plan.
(C)
deny
relief,
because
the
teacher
failed
to
enforce
the
restriction
against
\
a
librarian
purchased
the
lot
that
adjoined
the
teacher'
'
s
property.
The
three
\
the
retiree.
(D)
deny
relief,
because
the
fence
would
not
be
construed
as
\"
a
\
deeds
to
those
properties
each
contained
references
to
the
deed
book
where
the
\
structure
\"
within
the
terms
of
the
restriction.
\n
A:
Let's
think
step
by
step.
\
general
plan
was
recorded.
In
2008,
the
librarian
began
the
construction
of
\
We
refer
to
Wikipedia
articles
on
law
for
help.
The
restrictions
in
the
original
\
a
seven-foot
post-and-rail
fence
along
the
line
dividing
his
lot
with
the
teacher'
'
s,
\
plan
say
no
house
or
structure
of
any
kind
shall
be
built
on
the
aforementioned
\
and
along
the
center
of
the
area
subject
to
the
franchise
right.
Although
the
\
strip
of
land
running
through
the
said
blocks.
Hence
the
court
will
most
likely
\
teacher
objected
to
its
construction,
the
fence
was
completed.
If
the
teacher
\
grant
relief
because
the
fence
violated
the
restriction
in
the
original
plan.
\
seeks
a
mandatory
injunction
to
compel
removal
of
the
librarian'
'
s
fence,
the
\
The
answer
is
(B).
\n\n
Q:
Judge
took
judicial
notice
of
some
facts
at
the
beginning
\
court
will
most
likely
\
of
the
trial.
Which
of
the
following
is
not
an
appropriate
kind
of
fact
for
judicial
\
\
notice?
\n
(A)
Indisputable
facts.
(B)
Facts
that
have
been
asserted
by
individual
\
(A)
grant
relief,
because
the
fence
was
in
violation
of
the
easement
restriction.
\
political
organizations.
(C)
Facts
recognized
to
be
true
by
common
knowledge.
\
(B)
grant
relief,
because
the
encroachment
of
the
fence
violated
the
restriction
\
(D)
Facts
capable
of
scientific
verification.
\n
A:
Let's
think
step
by
step.
We
\
in
the
original
plan.
(C)
deny
relief,
because
the
teacher
failed
to
enforce
\
refer
to
Wikipedia
articles
on
law
for
help.
Among
the
options,
facts
that
have
\
the
restriction
against
the
retiree.
(D)
deny
relief,
because
the
fence
would
\
been
asserted
by
individual
political
organizations
is
not
an
appropriate
kind
\
not
be
construed
as
"a
structure"
within
the
terms
of
the
restriction.
'
\
of
fact
for
judicial
notice.
The
answer
is
(B).
\n\n
Q:
A
state
legislature
has
\
target
:
Let's think step by step. We refer to Wikipedia articles on law for help.
\
recently
enacted
a
statute
making
it
a
misdemeanor
to
curse
or
revile
or
use
obscene
\
The restrictions in the original plan say no house or structure of any kind
\
or
opprobrious
language
toward
or
in
reference
to
a
police
officer
perfonning
\
shall be built on the aforementioned strip of land running through the said
\
his
duties.
A
student
at
a
state
university
organized
a
demonstration
on
campus
\
blocks. Hence the court will most likely grant relief because the fence violated
\
to
protest
the
war.
The
rally
was
attended
by
a
group
of
50
students
who
shouted
\
the restriction in the original plan. The answer is (B).
\
anti-war
messages
at
cars
passing
by.
To
show
his
contempt
for
the
United
States,
\
-
question
:
'
Judge
took
judicial
notice
of
some
facts
at
the
beginning
of
the
trial.
\
the
student
sewed
the
American
flag
to
the
rear
of
his
jeans.
When
a
police
officer
\
Which
of
the
following
is
not
an
appropriate
kind
of
fact
for
judicial
notice?
\
saw
the
flag
sown
on
the
student's
jeans,
he
approached
and
told
him
to
remove
\
\
the
flag
or
he
would
be
placed
under
arrest.
The
student
became
angered
and
shouted
\
(A)
Indisputable
facts.
(B)
Facts
that
have
been
asserted
by
individual
political
\
at
the
police
officer,
\"
Listen,
you
bastard,
I'll
wear
this
rag
anywhere
I
please.
\
organizations.
(C)
Facts
recognized
to
be
true
by
common
knowledge.
(D)
Facts
\ \"
The
student
was
subsequently
placed
under
arrest
and
charged
with
violating
\
capable
of
scientific
verification.'
\
the
state
statute.
The
student
subsequently
brings
suit
in
state
court
challenging
\
target
:
Let's think step by step. We refer to Wikipedia articles on law for help.
\
the
constitutionality
of
the
statute.
The
strongest
constitutional
argument
for
\
Among the options, facts that have been asserted by individual political organizations
\
the
student
is
that
\n
(A)
the
statute
is
void
for
vagueness
under
the
Fourteenth
\
is not an appropriate kind of fact for judicial notice. The answer is (B).
\
Amendment's
due
process
clause.
(B)
the
statute
is
invalid
because
it
violates
\
-
question
:
'
A
state
legislature
has
recently
enacted
a
statute
making
it
a
misdemeanor
\
the
petitioner's
freedom
of
speech
under
the
First
Amendment.
(C)
the
statute
\
to
curse
or
revile
or
use
obscene
or
opprobrious
language
toward
or
in
reference
\
is
an
abridgment
of
freedom
of
speech
under
the
First
Amendment
because
less
restrictive
\
to
a
police
officer
perfonning
his
duties.
A
student
at
a
state
university
organized
\
means
are
available
for
achieving
the
same
purpose.
(D)
the
statute
is
overbroad
\
a
demonstration
on
campus
to
protest
the
war.
The
rally
was
attended
by
a
group
\
and
consequently
invalid
under
the
First
and
Fourteenth
Amendments.
\n
A:
Let's
\
of
50
students
who
shouted
anti-war
messages
at
cars
passing
by.
To
show
his
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
law
for
help.
The
Fourteenth
\
contempt
for
the
United
States,
the
student
sewed
the
American
flag
to
the
rear
\
Amendment
further
supports
the
First
Amendment
by
establishing
a
due
process
clause.
\
of
his
jeans.
When
a
police
officer
saw
the
flag
sown
on
the
student'
'
s
jeans,
\
Hence
the
strongest
argument
should
be
the
statute
is
overbroad
and
consequently
\
he
approached
and
told
him
to
remove
the
flag
or
he
would
be
placed
under
arrest.
\
invalid
under
the
First
and
Fourteenth
Amendments.
The
answer
is
(D).
\n\n
"
The
student
became
angered
and
shouted
at
the
police
officer,
"Listen,
you
bastard,
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
I'
'
ll
wear
this
rag
anywhere
I
please.
"
The
student
was
subsequently
placed
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
under
arrest
and
charged
with
violating
the
state
statute.
The
student
subsequently
"
task"
:
"
mmlu_flan_cot_fewshot_professional_law"
brings
suit
in
state
court
challenging
the
constitutionality
of
the
statute.
The
strongest
constitutional
argument
for
the
student
is
that
(A)
the
statute
is
void
for
vagueness
under
the
Fourteenth
Amendment'
'
s
due
process
clause.
(B)
the
statute
is
invalid
because
it
violates
the
petitioner'
'
s
freedom
of
speech
under
the
First
Amendment.
(C)
the
statute
is
an
abridgment
of
freedom
of
speech
under
the
First
Amendment
because
less
restrictive
means
are
available
for
achieving
the
same
purpose.
(D)
the
statute
is
overbroad
and
consequently
invalid
under
the
First
and
Fourteenth
Amendments.'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
law
for
help.
The
Fourteenth
Amendment
further
supports
the
First
Amendment
by
establishing
a
due
process
clause.
Hence
the
strongest
argument
should
be
the
statute
is
overbroad
and
consequently
invalid
under
the
First
and
Fourteenth
Amendments.
The
answer
is
(D).'
group
:
mmlu_flan_cot_fewshot_humanities
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_law
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_medicine.yaml
View file @
da211969
"
dataset_name"
:
"
professional_medicine"
dataset_name
:
professional_medicine
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
description
:
The following are multiple choice questions (with answers) about professional
\
medicine.
\n\n
Q:
A
22-year-old
male
marathon
runner
presents
to
the
office
with
\
medicine.
\
the
complaint
of
right-sided
rib
pain
when
he
runs
long
distances.
Physical
examination
\
fewshot_config
:
\
reveals
normal
heart
and
lung
findings
and
an
exhalation
dysfunction
at
ribs 4-5
\
sampler
:
first_n
\
on
the
right.
Which
of
the
following
muscles
or
muscle
groups
will
be
most
useful
\
samples
:
\
in
correcting
this
dysfunction
utilizing
a
direct
method?
\n
(A)
anterior
scalene
\
-
question
:
"
A
22-year-old
male
marathon
runner
presents
to
the
office
with
the
complaint
\
\
(B)
latissimus
dorsi
(C)
pectoralis
minor
(D)
quadratus
lumborum
\n
A:
Let's
think
\
\
of
right-sided
rib
pain
when
he
runs
long
distances.
Physical
examination
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
medicine
for
help.
Among
the
options,
\
\
reveals
normal
heart
and
lung
findings
and
an
exhalation
dysfunction
at
ribs
\_\
\
only
pectoralis
minor
muscle
origins
from
the
outer
surfaces
of
the
3rd
to
5th
\
4-5
on
the
right.
Which
of
the
following
muscles
or
muscle
groups
will
be
most
\
\
ribs.
The
answer
is
(C).
\n\n
Q:
A
36-year-old
male
presents
to
the
office
with
\
\
useful
in
correcting
this
dysfunction
utilizing
a
direct
method?
\n
(A)
anterior
\
\
a 3-week history
of
low
back
pain.
He
denies
any
recent
trauma
but
says
that
he
\
\
scalene
(B)
latissimus
dorsi
(C)
pectoralis
minor
(D)
quadratus
lumborum"
\
climbs
in
and
out
of
his
truck
numerous
times
a
day
for
his
job.
Examination
of
\
target
:
Let's think step by step. We refer to Wikipedia articles on medicine for
\
the
patient
in
the
prone
position
reveals
a
deep
sacral
sulcus
on
the
left,
a
\
help. Among the options, only pectoralis minor muscle origins from the outer
\
posterior
inferior
lateral
angle
on
the
right,
and
a
lumbosacral
junction
that
\
surfaces of the 3rd to 5th ribs. The answer is (C).
\
springs
freely
on
compression.
The
most
likely
diagnosis
is
\n
(A)
left-on-left
\
-
question
:
"
A
36-year-old
male
presents
to
the
office
with
a
\_
3-week
\_
history
of
low
\
\
sacral
torsion
(B)
left-on-right
sacral
torsion
(C)
right
unilateral
sacral
flexion
\
\
back
pain.
He
denies
any
recent
trauma
but
says
that
he
climbs
in
and
out
\
\
(D)
right-on-right
sacral
torsion
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
of
his
truck
numerous
times
a
day
for
his
job.
Examination
of
the
patient
\
\
articles
on
medicine
for
help.
The
deep
sulcus
on
the
left,
a
posterior
ILA
on
\
\
in
the
prone
position
reveals
a
deep
sacral
sulcus
on
the
left,
a
posterior
\
\
the
right,
with
a
negative
spring
test
suggests
a
right-on-right
sacral
torsion.
\
\
inferior
lateral
angle
on
the
right,
and
a
lumbosacral
junction
that
springs
\
\
All
other
options
have
a
deep
sulcus
on
the
right.
The
answer
is
(D).
\n\n
Q:
A
\
\
freely
on
compression.
The
most
likely
diagnosis
is
\n
(A)
left-on-left
sacral
\
\
44-year-old
man
comes
to
the
office
because
of
a
3-day
history
of
sore
throat,
\
\
torsion
(B)
left-on-right
sacral
torsion
(C)
right
unilateral
sacral
flexion
\
\
nonproductive
cough,
runny
nose,
and
frontal
headache.
He
says
the
headache
is
\
\
(D)
right-on-right
sacral
torsion"
\
worse
in
the
morning
and
ibuprofen
does
provide
some
relief.
He
has
not
had
shortness
\
target
:
Let's think step by step. We refer to Wikipedia articles on medicine for
\
of
breath.
Medical
history
is
unremarkable.
He
takes
no
medications
other
than
\
help. The deep sulcus on the left, a posterior ILA on the right, with a negative
\
the
ibuprofen
for
pain.
Vital
signs
are
temperature
37.4°C
(99.4°F),
pulse
88/min,
\
spring test suggests a right-on-right sacral torsion. All other options have
\
respirations
18/min,
and
blood
pressure
120/84
mm
Hg.
Examination
of
the
nares
\
a deep sulcus on the right. The answer is (D).
\
shows
erythematous
mucous
membranes.
Examination
of
the
throat
shows
erythema
\
-
question
:
"
A
44-year-old
man
comes
to
the
office
because
of
a
3-day
history
of
sore
\
\
and
follicular
lymphoid
hyperplasia
on
the
posterior
oropharynx.
There
is
no
palpable
\
\
throat,
nonproductive
cough,
runny
nose,
and
frontal
headache.
He
says
the
\
\
cervical
adenopathy.
Lungs
are
clear
to
auscultation.
Which
of
the
following
is
\
\
headache
is
worse
in
the
morning
and
ibuprofen
does
provide
some
relief.
He
\
\
the
most
likely
cause
of
this
patient's
symptoms?
\n
(A)
Allergic
rhinitis
(B)
Epstein-Barr
\
\
has
not
had
shortness
of
breath.
Medical
history
is
unremarkable.
He
takes
\
\
virus
(C)
Mycoplasma
pneumonia
(D)
Rhinovirus
\n
A:
Let's
think
step
by
step.
We
\
\
no
medications
other
than
the
ibuprofen
for
pain.
Vital
signs
are
temperature
\
\
refer
to
Wikipedia
articles
on
medicine
for
help.
The
symptoms,
especially
the
\
\
37.4
\xB0
C
(99.4
\xB0
F),
pulse
88/min,
respirations
18/min,
and
blood
pressure
\
\
headache,
suggest
that
the
most
likely
cause
is
Rhinovirus.
Epstein-Barr
virus
\
\
120/84
mm
Hg.
Examination
of
the
nares
shows
erythematous
mucous
membranes.
\
\
will
cause
swollen
lymph
nodes
but
there
is
no
palpable
cervical
adenopathy.
Lungs
\
\
Examination
of
the
throat
shows
erythema
and
follicular
lymphoid
hyperplasia
\
\
are
clear
to
auscultation
suggests
it's
not
Mycoplasma
pneumonia.
The
answer
is
\
\
on
the
posterior
oropharynx.
There
is
no
palpable
cervical
adenopathy.
Lungs
\
\
(D).
\n\n
Q:
A
previously
healthy
32-year-old
woman
comes
to
the
physician
8
months
\
\
are
clear
to
auscultation.
Which
of
the
following
is
the
most
likely
cause
\
\
after
her
husband
was
killed
in
a
car
crash.
Since
that
time,
she
has
had
a
decreased
\
\
of
this
patient's
symptoms?
\n
(A)
Allergic
rhinitis
(B)
Epstein-Barr
virus
\
\
appetite
and
difficulty
falling
asleep.
She
states
that
she
is
often
sad
and
cries
\
\
(C)
Mycoplasma
pneumonia
(D)
Rhinovirus"
\
frequently.
She
has
been
rechecking
the
door
lock
five
times
before
leaving
her
\
target
:
Let's think step by step. We refer to Wikipedia articles on medicine for
\
house
and
has
to
count
exactly
five
pieces
of
toilet
paper
before
she
uses
it.
\
help. The symptoms, especially the headache, suggest that the most likely cause
\
She
says
that
she
has
always
been
a
perfectionist
but
these
urges
and
rituals
\
is Rhinovirus. Epstein-Barr virus will cause swollen lymph nodes but there is
\
are
new.
Pharmacotherapy
should
be
targeted
to
which
of
the
following
neurotransmitters?
\n\
no palpable cervical adenopathy. Lungs are clear to auscultation suggests it's
(A)
Dopamine
(B)
Glutamate
(C)
Norepinephrine
(D)
Serotonin
\n
A:
Let's
think
step
\
not Mycoplasma pneumonia. The answer is (D).
\
by
step.
We
refer
to
Wikipedia
articles
on
medicine
for
help.
The
patient
feels
\
-
question
:
'
A
previously
healthy
32-year-old
woman
comes
to
the
physician
8
months
\
sad
and
among
the
options,
only
Dopamine
and
Serotonin
can
help
increase
positive
\
after
her
husband
was
killed
in
a
car
crash.
Since
that
time,
she
has
had
a
\
emotions.
Serotonin
also
affects
digestion
and
metabolism,
which
can
help
the
\
decreased
appetite
and
difficulty
falling
asleep.
She
states
that
she
is
often
\
patient's
decreased
appetite
and
sleep
difficulty.
The
answer
is
(D).
\n\n
Q:
A
\
sad
and
cries
frequently.
She
has
been
rechecking
the
door
lock
five
times
before
\
42-year-old
man
comes
to
the
office
for
preoperative
evaluation
prior
to
undergoing
\
leaving
her
house
and
has
to
count
exactly
five
pieces
of
toilet
paper
before
\
adrenalectomy
scheduled
in
2
weeks.
One
month
ago,
he
received
care
in
the
emergency
\
she
uses
it.
She
says
that
she
has
always
been
a
perfectionist
but
these
urges
\
department
for
pain
over
his
right
flank
following
a
motor
vehicle
collision.
\
and
rituals
are
new.
Pharmacotherapy
should
be
targeted
to
which
of
the
following
\
At
that
time,
blood
pressure
was
160/100
mm
Hg
and
CT
scan
of
the
abdomen
showed
\
neurotransmitters?
\
an
incidental
10-cm
left
adrenal
mass.
Results
of
laboratory
studies,
including
\
\
complete
blood
count,
serum
electrolyte
concentrations,
and
liver
function
tests,
\
(A)
Dopamine
(B)
Glutamate
(C)
Norepinephrine
(D)
Serotonin'
\
were
within
the
reference
ranges.
The
patient
otherwise
had
been
healthy
and
had
\
target
:
Let's think step by step. We refer to Wikipedia articles on medicine for
\
never
been
told
that
he
had
elevated
blood
pressure.
He
takes
no
medications.
\
help. The patient feels sad and among the options, only Dopamine and Serotonin
\
A
follow-up
visit
in
the
office
2
weeks
ago
disclosed
elevated
urinary
normetanephrine
\
can help increase positive emotions. Serotonin also affects digestion and metabolism,
\
and
metanephrine
and
plasma
aldosterone
concentrations.
The
patient
was
referred
\
which can help the patient's decreased appetite and sleep difficulty. The answer
\
to
a
surgeon,
who
recommended
the
adrenalectomy.
Today,
vital
signs
are
temperature
\
is (D).
\
36.6°C
(97.9°F),
pulse
100/min,
respirations
14/min,
and
blood
pressure
170/95
\
-
question
:
"
A
42-year-old
man
comes
to
the
office
for
preoperative
evaluation
prior
\
\
mm
Hg.
Physical
examination
discloses
no
significant
findings.
Initial
preoperative
\
\
to
undergoing
adrenalectomy
scheduled
in
2
weeks.
One
month
ago,
he
received
\
\
preparation
should
include
treatment
with
which
of
the
following?
\n
(A)
Labetalol
\
\
care
in
the
emergency
department
for
pain
over
his
right
flank
following
a
\
\
(B)
A
loading
dose
of
potassium
chloride
(C)
Nifedipine
(D)
Phenoxybenzamine
\n\
\
motor
vehicle
collision.
At
that
time,
blood
pressure
was
160/100
mm
Hg
and
\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medicine
for
help.
\
\
CT
scan
of
the
abdomen
showed
an
incidental
10-cm
left
adrenal
mass.
Results
\
\
The
symptoms
and
the
adrenal
mass
suggested
pheochromocytoma,
and
the
blood
pressure
\
\
of
laboratory
studies,
including
complete
blood
count,
serum
electrolyte
concentrations,
\
\
indicates
hypertension.
Phenoxybenzamine
is
used
to
treat
hypertension
caused
\
\
and
liver
function
tests,
were
within
the
reference
ranges.
The
patient
otherwise
\
\
by
pheochromocytoma.
The
answer
is
(D).
\n\n
"
\
had
been
healthy
and
had
never
been
told
that
he
had
elevated
blood
pressure.
\
"
group"
:
"
mmlu_flan_cot_fewshot_other"
\
He
takes
no
medications.
A
follow-up
visit
in
the
office
2
weeks
ago
disclosed
\
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
\
elevated
urinary
normetanephrine
and
metanephrine
and
plasma
aldosterone
concentrations.
\
"
task"
:
"
mmlu_flan_cot_fewshot_professional_medicine"
\
The
patient
was
referred
to
a
surgeon,
who
recommended
the
adrenalectomy.
\
\
Today,
vital
signs
are
temperature
36.6
\xB0
C
(97.9
\xB0
F),
pulse
100/min,
respirations
\
\
14/min,
and
blood
pressure
170/95
mm
Hg.
Physical
examination
discloses
no
\
\
significant
findings.
Initial
preoperative
preparation
should
include
treatment
\
\
with
which
of
the
following?
\n
(A)
Labetalol
(B)
A
loading
dose
of
potassium
\
\
chloride
(C)
Nifedipine
(D)
Phenoxybenzamine"
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medicine
for
help.
The
symptoms
and
the
adrenal
mass
suggested
pheochromocytoma,
and
the
blood
pressure
indicates
hypertension.
Phenoxybenzamine
is
used
to
treat
hypertension
caused
by
pheochromocytoma.
The
answer
is
(D).'
group
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_medicine
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
View file @
da211969
"
dataset_name"
:
"
professional_psychology"
dataset_name
:
professional_psychology
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
description
:
The following are multiple choice questions (with answers) about professional
\
psychology.
\n\n
Q:
In
the
construction
of
a
multiple
regression
equation
for
purposes
\
psychology.
\
of
prediction,
the
optimal
combination
of
measures
is
one
in
which
the
predictors
\n\
fewshot_config
:
(A)
are
uncorrelated
with
each
other
but
are
moderately
correlated
with
the
criterion
\
sampler
:
first_n
\
(B)
have
low
correlations
with
each
other
and
low
correlations
with
the
criterion
\
samples
:
\
(C)
are
highly
intercorrelated
with
each
other
and
moderately
correlated
with
\
-
question
:
'
In
the
construction
of
a
multiple
regression
equation
for
purposes
of
\
the
criterion
(D)
have
low
correlations
with
the
criterion
bur
are
moderately
\
prediction,
the
optimal
combination
of
measures
is
one
in
which
the
predictors
\
correlated
with
each
other
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
psychology
for
help.
The
basis
of
multiple
regression
is
to
assess
\
(A)
are
uncorrelated
with
each
other
but
are
moderately
correlated
with
the
\
the
relationship
between
one
continuous
variable
and
a
set
of
independent
variables.
\
criterion
(B)
have
low
correlations
with
each
other
and
low
correlations
with
\
So
the
predictors
should
be
uncorrelated
with
each
other
but
are
moderately
correlated
\
the
criterion
(C)
are
highly
intercorrelated
with
each
other
and
moderately
\
with
the
criterion.
The
answer
is
(A).
\n\n
Q:
There
are
three
ways
to
measure
the
\
correlated
with
the
criterion
(D)
have
low
correlations
with
the
criterion
bur
\
Central
Tendency:
the
Mean,
the
Median
and
the
Mode.
From
your
knowledge
about
\
are
moderately
correlated
with
each
other'
\
them,
what
is
the
mode?
\n
(A)
less
sensitive
to
extreme
scores
than
the
mean
(B)
\
target
:
Let's think step by step. We refer to Wikipedia articles on psychology
\
more
useful
for
skewed
distributions
(C)
sensitive
to
extreme
values
and
highly
\
for help. The basis of multiple regression is to assess the relationship between
\
skewed
distributions
(D)
the
most
frequently
occurring
number
\n
A:
Let's
think
\
one continuous variable and a set of independent variables. So the predictors
\
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
The
definition
\
should be uncorrelated with each other but are moderately correlated with the
\
of
mode
is
the
most
frequently
occurring
number.
The
answer
is
(D).
\n\n
Q:
Carl
\
criterion. The answer is (A).
\
Jung
believed
that
a
client's
transference:
\n
(A)
is
a
fantasy
that
distracts
the
\
-
question
:
'
There
are
three
ways
to
measure
the
Central
Tendency:
the
Mean,
the
Median
\
client
from
reality.
(B)
represents
“mixed
feelings”
toward
the
therapist.
(C)
\
and
the
Mode.
From
your
knowledge
about
them,
what
is
the
mode?
\ \"
is
a
form
of
\"\"
acting
out.
\"\"\"
(D)
reflects
the
client’s
personal
and
collective
\
\
unconscious.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
\
(A)
less
sensitive
to
extreme
scores
than
the
mean
(B)
more
useful
for
skewed
\
for
help.
Transference
is
a
phenomenon
that
a
person's
feelings
are
unconsciously
\
distributions
(C)
sensitive
to
extreme
values
and
highly
skewed
distributions
\
redirected,
so
it
reflects
the
client's
personal
and
collective
unconscious.
The
\
(D)
the
most
frequently
occurring
number'
\
answer
is
(D).
\n\n
Q:
In
terms
of
Hofstede’s
(1980)
five
cultural
dimensions,
the
\
target
:
Let's think step by step. We refer to Wikipedia articles on psychology
\
United
States
scores
at
the
top
of
the
scale
on:
\n
(A)
individualism.
(B)
individualism
\
for help. The definition of mode is the most frequently occurring number. The
\
and
power
distance.
(C)
power
distance
and
masculinity.
(D)
uncertainty
avoidance.
\n\
answer is (D).
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
\
-
question
:
"
Carl
Jung
believed
that
a
client's
transference:
\n
(A)
is
a
fantasy
that
\
\
US
scores
highest
on
individualism
among
the
five
cultural
dimensions.
The
answer
\
\
distracts
the
client
from
reality.
(B)
represents
\u201C
mixed
feelings
\u201D\
\
is
(A).
\n\n
Q:
One
of
your
therapy
clients
asks
your
advice
about
a
good
weight-
\
\
toward
the
therapist.
(C)
\"
is
a
form
of
\"\"
acting
out.
\"\"\"
(D)
reflects
\
\
reduction
program.
You
have
investigated
the
programs
in
the
community
and
are
\
\
the
client
\u2019
s
personal
and
collective
unconscious."
\
enrolled
in
the
one
you
consider
the
best.
This
program
offers
a
$50
bonus
to
\
target
:
Let's think step by step. We refer to Wikipedia articles on psychology
\
its
patrons
for
each
new
person
they
bring
into
the
program.
Under
these
circumstances,
\
for help. Transference is a phenomenon that a person's feelings are unconsciously
\
your
most
appropriate
response
would
be
to
\n
(A)
tell
your
client
the
pros
and
\
redirected, so it reflects the client's personal and collective unconscious.
\
cons
of
each
program
you
know
about
except
for
the
one
in
which
you
are
enrolled
\
The answer is (D).
\
(B)
recommend
to
your
client
the
program
in
which
you
are
enrolled
and
explain
\
-
question
:
"
In
terms
of
Hofstede
\u2019
s
(1980)
five
cultural
dimensions,
the
United
\
\
the
$50
bonus
you
will
receive
(C)
recommend
to
your
client
the
program
in
which
\
\
States
scores
at
the
top
of
the
scale
on:
\n
(A)
individualism.
(B)
individualism
\
\
you
are
enrolled
and
offer
to
have
the
$50
bonus
credited
to
your
client's
account
\
\
and
power
distance.
(C)
power
distance
and
masculinity.
(D)
uncertainty
avoidance."
\
in
the
program
(D)
tell
your
client
the
pros
and
cons
of
each
program
you
know
\
target
:
Let's think step by step. We refer to Wikipedia articles on psychology
\
about,
but
do
not
claim
the
$50
bonus
if
your
client
enrolls
in
your
program
\n\
for help. US scores highest on individualism among the five cultural dimensions.
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
\
The answer is (A).
\
Based
on
the
circumstances,
you
should
tell
your
client
about
the
pros
and
cons
\
-
question
:
'
One
of
your
therapy
clients
asks
your
advice
about
a
good
weight-
reduction
\
of
each
program,
but
it
would
be
inappropriate
to
receive
the
bonus,
so
you
should
\
program.
You
have
investigated
the
programs
in
the
community
and
are
enrolled
\
not
claim
the
$50
bonus.
The
answer
is
(D).
\n\n
"
in
the
one
you
consider
the
best.
This
program
offers
a
$50
bonus
to
its
patrons
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
for
each
new
person
they
bring
into
the
program.
Under
these
circumstances,
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
your
most
appropriate
response
would
be
to
"
task"
:
"
mmlu_flan_cot_fewshot_professional_psychology"
(A)
tell
your
client
the
pros
and
cons
of
each
program
you
know
about
except
for
the
one
in
which
you
are
enrolled
(B)
recommend
to
your
client
the
program
in
which
you
are
enrolled
and
explain
the
$50
bonus
you
will
receive
(C)
recommend
to
your
client
the
program
in
which
you
are
enrolled
and
offer
to
have
the
$50
bonus
credited
to
your
client'
'
s
account
in
the
program
(D)
tell
your
client
the
pros
and
cons
of
each
program
you
know
about,
but
do
not
claim
the
$50
bonus
if
your
client
enrolls
in
your
program'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
Based
on
the
circumstances,
you
should
tell
your
client
about
the
pros
and
cons
of
each
program,
but
it
would
be
inappropriate
to
receive
the
bonus,
so
you
should
not
claim
the
$50
bonus.
The
answer
is
(D).'
group
:
mmlu_flan_cot_fewshot_social_sciences
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_psychology
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
View file @
da211969
"
dataset_name"
:
"
public_relations"
dataset_name
:
public_relations
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
public
\
description
:
The following are multiple choice questions (with answers) about public
\
relations.
\n\n
Q:
Earth
Hour
was
a
campaign
launched
by
which
organization?
\n
(A)
\
relations.
\
Greenpeace
(B)
The
UN
(C)
Oxfam
(D)
World
Wildlife
Fund
\n
A:
Let's
think
step
by
\
fewshot_config
:
\
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Earth
Hour
\
sampler
:
first_n
\
is
a
worldwide
movement
oragnized
launched
by
the
World
Wildlife
Fund.
The
answer
\
samples
:
\
is
(D).
\n\n
Q:
In
issues
management,
what
is
the
most
proactive
approach
to
addressing
\
-
question
:
'
Earth
Hour
was
a
campaign
launched
by
which
organization?
\
negative
or
misleading
information
posted
online
about
your
organization?
\n
(A)
\
\
Buy
domain
names
that
could
be
used
by
opposition
groups.
(B)
Post
anonymous
comments
\
(A)
Greenpeace
(B)
The
UN
(C)
Oxfam
(D)
World
Wildlife
Fund'
\
on
blogs
to
combat
this
information.
(C)
Prepare
a
news
release
that
discredits
\
target
:
Let's think step by step. We refer to Wikipedia articles on public relations
\
the
inaccurate
information.
(D)
Make
policy
changes
to
address
complaints
highlighted
\
for help. Earth Hour is a worldwide movement oragnized launched by the World
\
on
these
sites.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
\
Wildlife Fund. The answer is (D).
\
public
relations
for
help.
In
issues
management,
the
most
proactive
approach
to
\
-
question
:
'
In
issues
management,
what
is
the
most
proactive
approach
to
addressing
\
addressing
negative
or
misleading
information
posted
online
is
to
make
policy
\
negative
or
misleading
information
posted
online
about
your
organization?
\
changes
to
address
complaints
highlighted
on
those
sites.
The
answer
is
(D).
\n\
\n
Q:
At
which
stage
in
the
planning
process
would
a
situation
analysis
be
carried
\
(A)
Buy
domain
names
that
could
be
used
by
opposition
groups.
(B)
Post
anonymous
\
out?
\n
(A)
Defining
the
program
(B)
Planning
the
program
(C)
Taking
action
and
\
comments
on
blogs
to
combat
this
information.
(C)
Prepare
a
news
release
that
\
implementing
ideas
(D)
Evaluation
of
the
program
\n
A:
Let's
think
step
by
step.
\
discredits
the
inaccurate
information.
(D)
Make
policy
changes
to
address
complaints
\
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Situation
analyses
\
highlighted
on
these
sites.'
\
are
typically
carried
out
during
the
planning
process
stage
of
defining
the
program.
\
target
:
Let's think step by step. We refer to Wikipedia articles on public relations
\
The
answer
is
(A).
\n\n
Q:
Which
of
these
statements
is
true
of
the
Vatican
in
2010
\
for help. In issues management, the most proactive approach to addressing negative
\
at
the
time
of
the
accusations
of
child
abuse
cover-ups?
\n
(A)
There
was
a
coordinated
\
or misleading information posted online is to make policy changes to address
\
media
response.
(B)
Consistent
messages
were
communicated.
(C)
Criticisms
were
\
complaints highlighted on those sites. The answer is (D).
\
taken
as
attacks
on
the
Catholic
Church.
(D)
The
credibility
of
the
Vatican
was
\
-
question
:
'
At
which
stage
in
the
planning
process
would
a
situation
analysis
be
carried
\
upheld.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
\
out?
\
relations
for
help.
In
2010
when
there
were
accusations
of
child
abuse
cover-ups,
\
\
the
Vatican
took
those
criticisms
as
attacks
on
the
Catholic
Church.
The
answer
\
(A)
Defining
the
program
(B)
Planning
the
program
(C)
Taking
action
and
implementing
\
is
(C).
\n\n
Q:
What
should
a
public
relations
media
practitioner
do
if
she
does
\
ideas
(D)
Evaluation
of
the
program'
\
not
know
the
answer
to
a
reporter's
question?
\n
(A)
Give
the
reporter
other
information
\
target
:
Let's think step by step. We refer to Wikipedia articles on public relations
\
she
is
certain
is
correct.
(B)
Say
that
the
information
is
'off
the
record'
and
\
for help. Situation analyses are typically carried out during the planning process
\
will
be
disseminated
later.
(C)
Say
'I
don't
know'
and
promise
to
provide
the
\
stage of defining the program. The answer is (A).
\
information
later.
(D)
Say
'no
comment,'
rather
than
appear
uninformed.
\n
A:
Let's
\
-
question
:
'
Which
of
these
statements
is
true
of
the
Vatican
in
2010
at
the
time
of
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
\
the
accusations
of
child
abuse
cover-ups?
\
If
a
public
relations
media
practitioner
does
not
know
the
answer
to
a
reporter's
\
\
question,
they
should
say
'I
don't
know'
and
offer
to
provide
the
information
\
(A)
There
was
a
coordinated
media
response.
(B)
Consistent
messages
were
communicated.
\
later.
The
answer
is
(C).
\n\n
"
(C)
Criticisms
were
taken
as
attacks
on
the
Catholic
Church.
(D)
The
credibility
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
of
the
Vatican
was
upheld.'
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
target
:
Let's think step by step. We refer to Wikipedia articles on public relations
"
task"
:
"
mmlu_flan_cot_fewshot_public_relations"
for help. In 2010 when there were accusations of child abuse cover-ups, the
Vatican took those criticisms as attacks on the Catholic Church. The answer
is (C).
-
question
:
'
What
should
a
public
relations
media
practitioner
do
if
she
does
not
know
the
answer
to
a
reporter'
'
s
question?
(A)
Give
the
reporter
other
information
she
is
certain
is
correct.
(B)
Say
that
the
information
is
'
'
off
the
record'
'
and
will
be
disseminated
later.
(C)
Say
'
'
I
don'
'
t
know'
'
and
promise
to
provide
the
information
later.
(D)
Say
'
'
no
comment,'
'
rather
than
appear
uninformed.'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
If
a
public
relations
media
practitioner
does
not
know
the
answer
to
a
reporter'
'
s
question,
they
should
say
'
'
I
don'
'
t
know'
'
and
offer
to
provide
the
information
later.
The
answer
is
(C).'
group
:
mmlu_flan_cot_fewshot_social_sciences
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_public_relations
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
View file @
da211969
"
dataset_name"
:
"
security_studies"
dataset_name
:
security_studies
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
security
\
description
:
The following are multiple choice questions (with answers) about security
\
studies.
\n\n
Q:
What
are
the
frameworks
of
analysis
within
which
terrorism
has
\
studies.
\
been
considered
(as
of
2020)?
\n
(A)
Competition
between
larger
nations
has
resulted
\
fewshot_config
:
\
in
some
countries
actively
supporting
terrorist
groups
to
undermine
the
strength
\
sampler
:
first_n
\
of
rival
states.
Terrorist
networks
are
extended
patronage
clubs
maintained
and
\
samples
:
\
paid
for
by
their
donor
states
and
are
conceptualised
as
being
like
state
actors,
\
-
question
:
'
What
are
the
frameworks
of
analysis
within
which
terrorism
has
been
considered
\
to
be
dealt
with
using
military
force.
(B)
Globalization
has
enabled
the
internationalization
\
(as
of
2020)?
\
of
terrorist
activities
by
opening
up
their
operational
space,
although
coordination
\
\
is
still
managed
from
a
geographical
base.
This
suggests
that
terrorist
groups
\
(A)
Competition
between
larger
nations
has
resulted
in
some
countries
actively
\
are
nationally
structured
which
means
that
terrorism
cannot
be
considered
in
terms
\
supporting
terrorist
groups
to
undermine
the
strength
of
rival
states.
Terrorist
\
of
a
war
to
be
defeated
militarily
without
having
serious
implications
on
the
\
networks
are
extended
patronage
clubs
maintained
and
paid
for
by
their
donor
\
indigenous
population.
(C)
Terrorism
can
be
viewed
as
a
problem
to
be
resolved
\
states
and
are
conceptualised
as
being
like
state
actors,
to
be
dealt
with
using
\
by
military
means
(war
on
terrorism),
by
normal
police
techniques
(terrorism
as
\
military
force.
(B)
Globalization
has
enabled
the
internationalization
of
terrorist
\
crime),
or
as
a
medical
problem
with
underlying
causes
and
symptoms
(terrorism
\
activities
by
opening
up
their
operational
space,
although
coordination
is
still
\
as
disease).
(D)
Terrorism
is
viewed
as
a
criminal
problem.
The
criminalization
\
managed
from
a
geographical
base.
This
suggests
that
terrorist
groups
are
nationally
\
of
terrorism
has
two
important
implications.
Firstly,
it
suggests
that
terrorism
\
structured
which
means
that
terrorism
cannot
be
considered
in
terms
of
a
war
\
can
be
eradicated
-
terrorists
can
be
caught
and
brought
to
trial
by
normal
judicial
\
to
be
defeated
militarily
without
having
serious
implications
on
the
indigenous
\
proceedings
thereby
removing
the
threat
from
society
-
and
secondly,
it
suggests
\
population.
(C)
Terrorism
can
be
viewed
as
a
problem
to
be
resolved
by
military
\
that
preventative
crime
techniques
are
applicable
to
prevent
its
development.
\n\
means
(war
on
terrorism),
by
normal
police
techniques
(terrorism
as
crime),
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
security
studies
\
or
as
a
medical
problem
with
underlying
causes
and
symptoms
(terrorism
as
disease).
\
for
help.
(A)
is
wrong
because
it
is
not
competition
between
larger
nations
that
\
(D)
Terrorism
is
viewed
as
a
criminal
problem.
The
criminalization
of
terrorism
\
causes
terrorism.
\n
(B)
is
wrong
because
globalization
is
not
the
cause
of
terrorism.
\n\
has
two
important
implications.
Firstly,
it
suggests
that
terrorism
can
be
eradicated
(C)
is
correct
because
the
US
undertook
the
war
on
terrorism.
\n
(D)
is
wrong
because
\
-
terrorists
can
be
caught
and
brought
to
trial
by
normal
judicial
proceedings
\
preventative
crime
techniques
will
likely
not
end
terrorism.
The
answer
is
(C).
\n\
thereby
removing
the
threat
from
society
-
and
secondly,
it
suggests
that
preventative
\n
Q:
Which
of
the
following
is
the
best
lens
through
which
to
investigate
the
role
\
crime
techniques
are
applicable
to
prevent
its
development.'
\
of
child
soldiers?
\n
(A)
Child
soldiers
are
victims
of
combat
that
need
re-education
\
target
:
"
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
security
\
\
and
rehabilitation.
(B)
Children
and
their
mothers
are
not
active
subjects
in
\
\
studies
for
help.
(A)
is
wrong
because
it
is
not
competition
between
larger
\
\
warfare
and
are
best
considered
as
subjects
in
the
private
sphere.
(C)
Children
\
\
nations
that
causes
terrorism.
\n
(B)
is
wrong
because
globalization
is
not
\
\
are
most
often
innocent
bystanders
in
war
and
are
best
used
as
signifiers
of
peace.
\
\
the
cause
of
terrorism.
\n
(C)
is
correct
because
the
US
undertook
the
war
on
\
\
(D)
Children
have
political
subjecthood
that
is
missed
when
they
are
considered
\
\
terrorism.
\n
(D)
is
wrong
because
preventative
crime
techniques
will
likely
\
\
as
passive
victims
of
warfare.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
not
end
terrorism.
The
answer
is
(C)."
\
articles
on
security
studies
for
help.
Child
soliders
as
a
political
topic
can
\
-
question
:
'
Which
of
the
following
is
the
best
lens
through
which
to
investigate
the
\
be
missed
when
they
are
considered
passive
victims
of
warfare.
The
answer
is
(D).
\n\
role
of
child
soldiers?
\n
Q:
How
can
we
best
describe
the
relationship
between
the
state-centric
approach
\
\
and
the
concept
of
human
security?
\n
(A)
There
are
such
wide
divisions
within
the
\
(A)
Child
soldiers
are
victims
of
combat
that
need
re-education
and
rehabilitation.
\
human
security
framework
regarding
the
nature
of
threats
and
referent
objects
\
(B)
Children
and
their
mothers
are
not
active
subjects
in
warfare
and
are
best
\
that
no
widely
applicable
comparisons
between
state-centric
approaches
and
human
\
considered
as
subjects
in
the
private
sphere.
(C)
Children
are
most
often
innocent
\
security
can
be
drawn.
(B)
By
adopting
the
framework
of
human
security,
the
limitations
\
bystanders
in
war
and
are
best
used
as
signifiers
of
peace.
(D)
Children
have
\
of
the
realist
state-centric
approach
become
evident.
Whilst
human
security
defines
\
political
subjecthood
that
is
missed
when
they
are
considered
as
passive
victims
\
the
referent
object
as
the
person
or
population,
state-centric
approaches
prioritise
\
of
warfare.'
\
the
security
of
the
state,
de-prioritizing
the
pursuit
of
human
security.
(C)
\
target
:
Let's think step by step. We refer to Wikipedia articles on security studies
\
The
state-centric
approach
to
security
is
a
faction
of
human
security,
usually
\
for help. Child soliders as a political topic can be missed when they are considered
\
defined
within
the
broad
school
of
human
security.
By
being
state-centric
this
\
passive victims of warfare. The answer is (D).
\
approach
prioritises
the
individual
as
the
referent
object
in
security
studies.
\
-
question
:
'
How
can
we
best
describe
the
relationship
between
the
state-centric
approach
\
(D)
Both
the
state-centric
and
human-centric
approaches
to
security
are
mutually
\
and
the
concept
of
human
security?
\
exclusive
and
offer
a
sufficient
analytic
framework
with
which
to
understand
the
\
\
international
security
system.
It
is
therefore
the
role
of
security
analysts
to
\
(A)
There
are
such
wide
divisions
within
the
human
security
framework
regarding
\
determine
which
of
these
substantial
concepts
is
correct,
and
which
should
be
\
the
nature
of
threats
and
referent
objects
that
no
widely
applicable
comparisons
\
discarded.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
security
\
between
state-centric
approaches
and
human
security
can
be
drawn.
(B)
By
adopting
\
studies
for
help.
Human
security
focuses
on
a
person
or
population
whereas
state-centric
\
the
framework
of
human
security,
the
limitations
of
the
realist
state-centric
\
approaches
focus
on
the
state
while
deprioritizing
human
security.
The
answer
\
approach
become
evident.
Whilst
human
security
defines
the
referent
object
as
\
is
(B).
\n\n
Q:
In
order
to
become
securitized,
a
threat
must
be
presented
in
which
\
the
person
or
population,
state-centric
approaches
prioritise
the
security
of
\
of
these
ways?
\n
(A)
As
an
existential
threat
that
requires
immediate
and
extraordinary
\
the
state,
de-prioritizing
the
pursuit
of
human
security.
(C)
The
state-centric
\
action,
posing
a
threat
to
the
survival
of
the
state
or
to
societal
security.
\
approach
to
security
is
a
faction
of
human
security,
usually
defined
within
\
(B)
As
requiring
immediate
and
extraordinary
action
by
the
state,
threatening
\
the
broad
school
of
human
security.
By
being
state-centric
this
approach
prioritises
\
the
survival
of
a
referent
object
and
therefore
warranting
the
use
of
measures
\
the
individual
as
the
referent
object
in
security
studies.
(D)
Both
the
state-centric
\
not
normally
employed
in
the
political
realm.
(C)
As
an
urgent
threat
to
the
survival
\
and
human-centric
approaches
to
security
are
mutually
exclusive
and
offer
a
\
of
the
referent
object,
so
serious
that
it
legitimises
the
employment
of
extraordinary
\
sufficient
analytic
framework
with
which
to
understand
the
international
security
\
action
in
response.
(D)
As
an
urgent
threat
to
the
survival
of
the
audience
that
\
system.
It
is
therefore
the
role
of
security
analysts
to
determine
which
of
\
requires
extraordinary
or
emergency
measures.
\n
A:
Let's
think
step
by
step.
We
\
these
substantial
concepts
is
correct,
and
which
should
be
discarded.'
\
refer
to
Wikipedia
articles
on
security
studies
for
help.
To
be
securitized,
a
\
target
:
Let's think step by step. We refer to Wikipedia articles on security studies
\
threat
must
be
an
urgent
threat
to
the
survival
of
the
referent
object.
The
answer
\
for help. Human security focuses on a person or population whereas state-centric
\
is
(C).
\n\n
Q:
What
distinguishes
coercive
diplomacy
from
military
force?
\n
(A)
\
approaches focus on the state while deprioritizing human security. The answer
\
Compellence
is
another
term
for
coercive
diplomacy,
but
covering
a
narrower
set
\
is (B).
\
of
criteria;
compellence
covers
those
threats
aimed
at
initiating
adversary
action.
\
-
question
:
'
In
order
to
become
securitized,
a
threat
must
be
presented
in
which
of
\
A
threat
to
coerce
a
state
to
give
up
part
of
its
territory
would
count
as
coercive
\
these
ways?
\
diplomacy,
as
long
as
that
threat
proactively
initiates
action
before
reactive
\
\
diplomacy
is
taken.
(B)
Coercive
diplomacy
constitutes
the
threats
of
limited
\
(A)
As
an
existential
threat
that
requires
immediate
and
extraordinary
action,
\
force
to
induce
adversary's
incentive
to
comply
with
the
coercer's
demands.
It
\
posing
a
threat
to
the
survival
of
the
state
or
to
societal
security.
(B)
As
\
is
an
influence
strategy
that
is
intended
to
obtain
compliance:
the
use
of
force
\
requiring
immediate
and
extraordinary
action
by
the
state,
threatening
the
survival
\
to
defeat
an
opponent
first
does
not
count.
It
leaves
an
element
of
choice
with
\
of
a
referent
object
and
therefore
warranting
the
use
of
measures
not
normally
\
the
target
to
comply,
or
to
continue.
(C)
Military
force,
or
the
threat
of
military
\
employed
in
the
political
realm.
(C)
As
an
urgent
threat
to
the
survival
of
\
force,
utilises
fear
to
achieve
strategic
objectives.
Coercive
diplomacy
is
differentiated
\
the
referent
object,
so
serious
that
it
legitimises
the
employment
of
extraordinary
\
from
this
approach,
because
it
does
not
use
fear
as
a
tool
for
coercing
an
adversary.
\
action
in
response.
(D)
As
an
urgent
threat
to
the
survival
of
the
audience
\
(D)
Coercive
diplomacy
is
employed
to
use
force
but
to
limit
its
effects
on
the
\
that
requires
extraordinary
or
emergency
measures.'
\
international
community.
Coercive
diplomacy
is
an
aggressive
strategy
that
is
\
target
:
Let's think step by step. We refer to Wikipedia articles on security studies
\
intended
to
obtain
compliance
through
defeat.
It
does
not
leave
an
element
of
\
for help. To be securitized, a threat must be an urgent threat to the survival
\
choice
with
the
target,
the
target
either
being
forced
to
comply
or
engage
in
\
of the referent object. The answer is (C).
\
conflict.
It
seeks
to
control
by
imposing
compliance
by
removing
any
opportunity
\
-
question
:
'
What
distinguishes
coercive
diplomacy
from
military
force?
\
for
negotiation
or
concession.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
security
studies
for
help.
Coercive
diplomacy
uses
the
threat
of
force
\
(A)
Compellence
is
another
term
for
coercive
diplomacy,
but
covering
a
narrower
\
to
induce
the
opponent
to
comply
with
demands.
The
answer
is
(B).
\n\n
"
set
of
criteria;
compellence
covers
those
threats
aimed
at
initiating
adversary
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
action.
A
threat
to
coerce
a
state
to
give
up
part
of
its
territory
would
count
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
as
coercive
diplomacy,
as
long
as
that
threat
proactively
initiates
action
before
"
task"
:
"
mmlu_flan_cot_fewshot_security_studies"
reactive
diplomacy
is
taken.
(B)
Coercive
diplomacy
constitutes
the
threats
of
limited
force
to
induce
adversary'
'
s
incentive
to
comply
with
the
coercer'
'
s
demands.
It
is
an
influence
strategy
that
is
intended
to
obtain
compliance:
the
use
of
force
to
defeat
an
opponent
first
does
not
count.
It
leaves
an
element
of
choice
with
the
target
to
comply,
or
to
continue.
(C)
Military
force,
or
the
threat
of
military
force,
utilises
fear
to
achieve
strategic
objectives.
Coercive
diplomacy
is
differentiated
from
this
approach,
because
it
does
not
use
fear
as
a
tool
for
coercing
an
adversary.
(D)
Coercive
diplomacy
is
employed
to
use
force
but
to
limit
its
effects
on
the
international
community.
Coercive
diplomacy
is
an
aggressive
strategy
that
is
intended
to
obtain
compliance
through
defeat.
It
does
not
leave
an
element
of
choice
with
the
target,
the
target
either
being
forced
to
comply
or
engage
in
conflict.
It
seeks
to
control
by
imposing
compliance
by
removing
any
opportunity
for
negotiation
or
concession.'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
security
studies
for
help.
Coercive
diplomacy
uses
the
threat
of
force
to
induce
the
opponent
to
comply
with
demands.
The
answer
is
(B).'
group
:
mmlu_flan_cot_fewshot_social_sciences
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_security_studies
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
View file @
da211969
"
dataset_name"
:
"
sociology"
dataset_name
:
sociology
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
sociology.
\n\
description
:
The following are multiple choice questions (with answers) about sociology.
\n
Q:
Which
of
the
following
is
not
a
problem
associated
with
official
statistics
\
fewshot_config
:
\
on
strike
action?
\n
(A)
most
strikes
go
unnoticed
by
employers
and
the
mass
media
\
sampler
:
first_n
\
(B)
not
all
industrial
disputes
will
be
reported
by
the
employer
(C)
the
definition
\
samples
:
\
of
strikes
excludes
those
that
involve
fewer
than
ten
workers
or
last
less
than
\
-
question
:
'
Which
of
the
following
is
not
a
problem
associated
with
official
statistics
\
one
day
(D)
it
is
hard
to
compare
strikes
that
were
measured
in
different
ways
\n\
on
strike
action?
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
\
\
Official
statistics
on
strike
action
can
be
problematic
because
not
all
industrial
\
(A)
most
strikes
go
unnoticed
by
employers
and
the
mass
media
(B)
not
all
industrial
\
disputes
will
be
reported
by
employers,
the
definition
of
strikes
excludes
those
\
disputes
will
be
reported
by
the
employer
(C)
the
definition
of
strikes
excludes
\
that
involves
fewer
than
ten
workers
or
last
less
than
one
day,
and
it
is
hard
\
those
that
involve
fewer
than
ten
workers
or
last
less
than
one
day
(D)
it
is
\
to
compare
strikes
that
were
measured
in
different
ways.
Thus,
(A)
is
not
a
problem
\
hard
to
compare
strikes
that
were
measured
in
different
ways'
\
associated
with
official
statistics
on
strike
action.
The
answer
is
(A).
\n\n
Q:
\
target
:
Let's think step by step. We refer to Wikipedia articles on sociology
\
What
does
Berger
(1963)
describe
as
a
metaphor
for
social
reality?
\n
(A)
a
fairground
\
for help. Official statistics on strike action can be problematic because not
\
ride
(B)
a
circus
(C)
a
puppet
theatre
(D)
a
ballet
\n
A:
Let's
think
step
by
step.
\
all industrial disputes will be reported by employers, the definition of strikes
\
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Berger
describes
social
\
excludes those that involves fewer than ten workers or last less than one day,
\
reality
using
the
metaphor
of
a
puppet
theatre.
The
answer
is
(C).
\n\n
Q:
The
term
\
and it is hard to compare strikes that were measured in different ways. Thus,
\
'hegemony'
refers
to:
\n
(A)
the
tendency
for
the
working
class
not
to
realize
their
\
(A) is not a problem associated with official statistics on strike action. The
\
own
interests
(B)
a
dominant
ideology
that
legitimates
economic,
political
and
\
answer is (A).
\
cultural
power
(C)
a
form
of
dual
consciousness
based
on
ideology
and
everyday
\
-
question
:
'
What
does
Berger
(1963)
describe
as
a
metaphor
for
social
reality?
\
experiences
(D)
a
mode
of
payment
given
for
outstanding
topiary
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Hegemony
refers
\
(A)
a
fairground
ride
(B)
a
circus
(C)
a
puppet
theatre
(D)
a
ballet'
\
to
a
dominant
ideology
that
legitimates
economic,
policital,
and
cultural
power.
\
target
:
Let's think step by step. We refer to Wikipedia articles on sociology
\
The
answer
is
(B).
\n\n
Q:
The
shift
from
'civil
religion'
to
'common
religion'
\
for help. Berger describes social reality using the metaphor of a puppet theatre.
\
means
that:
\n
(A)
the
increasing
bureaucracy
of
the
state
has
made
religion
only
\
The answer is (C).
\
a
marginal
part
of
our
lives
(B)
despite
the
weakening
of
traditional
authority,
\
-
question
:
'
The
term
'
'
hegemony'
'
refers
to:
\
our
everyday
lives
and
'common
sense'
remain
shaped
by
religious
beliefs
and
values
\
\
(C)
religious
participation
in
collective
worship
may
have
declined,
but
people
\
(A)
the
tendency
for
the
working
class
not
to
realize
their
own
interests
(B)
\
still
practise
their
faiths
in
private
(D)
people
are
much
more
likely
to
discuss
\
a
dominant
ideology
that
legitimates
economic,
political
and
cultural
power
\
their
religious
beliefs
in
public,
informal
settings
\n
A:
Let's
think
step
by
step.
\
(C)
a
form
of
dual
consciousness
based
on
ideology
and
everyday
experiences
\
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
shift
from
civil
religion
\
(D)
a
mode
of
payment
given
for
outstanding
topiary'
\
to
common
religion
means
that
despite
the
weakening
of
traditional
authority,
\
target
:
Let's think step by step. We refer to Wikipedia articles on sociology
\
our
everyday
lives
and
common
sense
remain
shaped
by
religious
beliefs
and
values.
\
for help. Hegemony refers to a dominant ideology that legitimates economic,
\
The
answer
is
(B).
\n\n
Q:
Which
of
the
following
did
the
post-war
welfare
state
\
policital, and cultural power. The answer is (B).
\
of
1948
not
aim
to
provide:
\n
(A)
free
health
care
and
education
for
all
(B)
a
\
-
question
:
'
The
shift
from
'
'
civil
religion'
'
to
'
'
common
religion'
'
means
that:
\
minimum
wage
(C)
full
employment
(D)
universal
welfare
\n
A:
Let's
think
step
by
\
\
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
post-war
welfare
\
(A)
the
increasing
bureaucracy
of
the
state
has
made
religion
only
a
marginal
\
state
of
1948
aimed
to
provide
free
healthcare
and
education,
full
employment,
\
part
of
our
lives
(B)
despite
the
weakening
of
traditional
authority,
our
everyday
\
and
universal
welfare.
But
it
did
not
aim
to
provide
a
minimum
wage.
The
answer
\
lives
and
'
'
common
sense'
'
remain
shaped
by
religious
beliefs
and
values
(C)
\
is
(B).
\n\n
"
religious
participation
in
collective
worship
may
have
declined,
but
people
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
still
practise
their
faiths
in
private
(D)
people
are
much
more
likely
to
discuss
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
their
religious
beliefs
in
public,
informal
settings'
"
task"
:
"
mmlu_flan_cot_fewshot_sociology"
target
:
Let's think step by step. We refer to Wikipedia articles on sociology
for help. The shift from civil religion to common religion means that despite
the weakening of traditional authority, our everyday lives and common sense
remain shaped by religious beliefs and values. The answer is (B).
-
question
:
'
Which
of
the
following
did
the
post-war
welfare
state
of
1948
not
aim
to
provide:
(A)
free
health
care
and
education
for
all
(B)
a
minimum
wage
(C)
full
employment
(D)
universal
welfare'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
post-war
welfare
state
of
1948
aimed
to
provide
free
healthcare
and
education,
full
employment,
and
universal
welfare.
But
it
did
not
aim
to
provide
a
minimum
wage.
The
answer
is
(B).'
group
:
mmlu_flan_cot_fewshot_social_sciences
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_sociology
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
View file @
da211969
"
dataset_name"
:
"
us_foreign_policy"
dataset_name
:
us_foreign_policy
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
us
\
description
:
The following are multiple choice questions (with answers) about us foreign
\
foreign
policy.
\n\n
Q:
How
did
Donald
Trump
attack
globalization
in
the
2016
campaign?
\n\
policy.
(A)
Globalization
had
made
men
like
him
too
rich
(B)
Globalization
only
benefited
\
fewshot_config
:
\
certain
American
states,
such
as
New
York
(C)
Liberal
elites
had
encouraged
globalization,
\
sampler
:
first_n
\
while
'ordinary
Americans'
lost
jobs
because
of
it
(D)
Globalization
encouraged
\
samples
:
\
damaging
trade
wars
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
-
question
:
'
How
did
Donald
Trump
attack
globalization
in
the
2016
campaign?
\
on
us
foreign
policy
for
help.
Trump
attacked
globalization
because
he
believed
\
\
ordinary
Americans
lost
jobs
due
to
it,
and
so
he
wanted
to
blame
liberals
who
\
(A)
Globalization
had
made
men
like
him
too
rich
(B)
Globalization
only
benefited
\
had
encouraged
it.
The
answer
is
(C).
\n\n
Q:
How
did
NSC-68
change
U.S.
strategy?
\n\
certain
American
states,
such
as
New
York
(C)
Liberal
elites
had
encouraged
(A)
It
globalized
containment.
(B)
It
militarized
containment.
(C)
It
called
for
\
globalization,
while
'
'
ordinary
Americans'
'
lost
jobs
because
of
it
(D)
Globalization
\
the
development
of
the
hydrogen
bomb.
(D)
All
of
the
above
\n
A:
Let's
think
step
\
encouraged
damaging
trade
wars'
\
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
NSC-68
\
target
:
Let's think step by step. We refer to Wikipedia articles on us foreign
\
outlined
a
variety
of
courses
of
action,
including
globalization
of
containment,
\
policy for help. Trump attacked globalization because he believed ordinary Americans
\
militarization
of
contaiment,
and
the
development
of
the
hydrogen
bomb.
The
answer
\
lost jobs due to it, and so he wanted to blame liberals who had encouraged it.
\
is
(D).
\n\n
Q:
How
do
Defensive
Realism
and
Offensive
Realism
differ
in
their
explanation
\
The answer is (C).
\
of
state
behaviour?
\n
(A)
Defensive
realists
place
greater
emphasis
on
the
role
\
-
question
:
'
How
did
NSC-68
change
U.S.
strategy?
\
of
international
institutions
(B)
Defensive
realists
place
less
emphasis
on
geographical
\
\
factors
(C)
Offensive
realists
give
more
priority
to
the
national
interest
than
\
(A)
It
globalized
containment.
(B)
It
militarized
containment.
(C)
It
called
\
Defensive
realists.
(D)
Defensive
realists
believe
states
are
security
maximizers,
\
for
the
development
of
the
hydrogen
bomb.
(D)
All
of
the
above'
\
while
Offensive
realists
believe
states
to
be
power
maximizers
\n
A:
Let's
think
\
target
:
Let's think step by step. We refer to Wikipedia articles on us foreign
\
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
While
\
policy for help. NSC-68 outlined a variety of courses of action, including globalization
\
defensive
realism
advocates
that
states
are
security
maximizers,
offensive
realists
\
of containment, militarization of contaiment, and the development of the hydrogen
\
think
of
states
as
power
maximizers.
The
answer
is
(D).
\n\n
Q:
The
realm
of
policy
\
bomb. The answer is (D).
\
decisions
concerned
primarily
with
relations
between
the
United
States
and
the
\
-
question
:
'
How
do
Defensive
Realism
and
Offensive
Realism
differ
in
their
explanation
\
rest
of
the
world
is
known
as
\n
(A)
terrorism
policy.
(B)
economic
policy.
(C)
\
of
state
behaviour?
\
foreign
policy.
(D)
international
policy.
\n
A:
Let's
think
step
by
step.
We
refer
\
\
to
Wikipedia
articles
on
us
foreign
policy
for
help.
The
topic
of
policy
decisions
\
(A)
Defensive
realists
place
greater
emphasis
on
the
role
of
international
institutions
\
concerns
with
relations
between
the
US
and
the
rest
of
the
world
is
known
as
foreign
\
(B)
Defensive
realists
place
less
emphasis
on
geographical
factors
(C)
Offensive
\
policy.
The
answer
is
(C).
\n\n
Q:
How
did
the
2008
financial
crisis
affect
America's
\
realists
give
more
priority
to
the
national
interest
than
Defensive
realists.
\
international
reputation?
\n
(A)
It
damaged
support
for
the
US
model
of
political
\
(D)
Defensive
realists
believe
states
are
security
maximizers,
while
Offensive
\
economy
and
capitalism
(B)
It
created
anger
at
the
United
States
for
exaggerating
\
realists
believe
states
to
be
power
maximizers'
\
the
crisis
(C)
It
increased
support
for
American
global
leadership
under
President
\
target
:
Let's think step by step. We refer to Wikipedia articles on us foreign
\
Obama
(D)
It
reduced
global
use
of
the
US
dollar
\n
A:
Let's
think
step
by
step.
\
policy for help. While defensive realism advocates that states are security
\
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
The
2008
financial
\
maximizers, offensive realists think of states as power maximizers. The answer
\
crisis
damanged
the
international
reputation
of
the
American
model
of
political
\
is (D).
\
economy
and
capitalism.
The
answer
is
(A).
\n\n
"
-
question
:
'
The
realm
of
policy
decisions
concerned
primarily
with
relations
between
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
the
United
States
and
the
rest
of
the
world
is
known
as
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_us_foreign_policy"
(A)
terrorism
policy.
(B)
economic
policy.
(C)
foreign
policy.
(D)
international
policy.'
target
:
Let's think step by step. We refer to Wikipedia articles on us foreign
policy for help. The topic of policy decisions concerns with relations between
the US and the rest of the world is known as foreign policy. The answer is (C).
-
question
:
'
How
did
the
2008
financial
crisis
affect
America'
'
s
international
reputation?
(A)
It
damaged
support
for
the
US
model
of
political
economy
and
capitalism
(B)
It
created
anger
at
the
United
States
for
exaggerating
the
crisis
(C)
It
increased
support
for
American
global
leadership
under
President
Obama
(D)
It
reduced
global
use
of
the
US
dollar'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
The
2008
financial
crisis
damanged
the
international
reputation
of
the
American
model
of
political
economy
and
capitalism.
The
answer
is
(A).'
group
:
mmlu_flan_cot_fewshot_social_sciences
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_us_foreign_policy
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
View file @
da211969
"
dataset_name"
:
"
virology"
dataset_name
:
virology
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
virology.
\n\
description
:
The following are multiple choice questions (with answers) about virology.
\n
Q:
The
median
survival
time
to
AIDS
and
death
was
established
by
following:
\n\
fewshot_config
:
(A)
Seroprevalent
HIV-infected
individuals
(B)
Seronegatives
(C)
Seroconverters
\
sampler
:
first_n
\
(D)
High-risk
seronegatives
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
samples
:
\
articles
on
virology
for
help.
The
median
survival
time
to
AIDS
and
death
was
\
-
question
:
'
The
median
survival
time
to
AIDS
and
death
was
established
by
following:
\
established
as
a
result
of
the
development
of
seroconverters.
The
answer
is
(C).
\n\
\n
Q:
Which
of
the
following
is
a
morphological
characteristic
of
the
paramyxoviruses.
\n\
(A)
Seroprevalent
HIV-infected
individuals
(B)
Seronegatives
(C)
Seroconverters
(A)
Fragile
viruses
often
visualised
with
RNA
spewing
from
the
inside
(B)
Elongate
\
(D)
High-risk
seronegatives'
\
viruses
(C)
Icosahedral
viruses
with
envelope
(D)
Very
large
viruses
\n
A:
Let's
\
target
:
Let's think step by step. We refer to Wikipedia articles on virology for
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
Paramyxoviruses
\
help. The median survival time to AIDS and death was established as a result
\
are
fragile
viruses
often
visualised
with
RNA
spewing
from
the
inside.
The
answer
\
of the development of seroconverters. The answer is (C).
\
is
(A).
\n\n
Q:
The
most
important
goal
of
a
behavioral
intervention
is:
\n
(A)
Change
\
-
question
:
'
Which
of
the
following
is
a
morphological
characteristic
of
the
paramyxoviruses.
\
in
behavior
(B)
Comprehensive
coverage
(C)
Effective
use
of
behavioral
theory
\
\
(D)
Sustained
behavior
change
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
(A)
Fragile
viruses
often
visualised
with
RNA
spewing
from
the
inside
(B)
Elongate
\
articles
on
virology
for
help.
The
prim
goal
of
a
behavioral
intervention
is
to
\
viruses
(C)
Icosahedral
viruses
with
envelope
(D)
Very
large
viruses'
\
cause
sustained
behavior
change.
The
answer
is
(D).
\n\n
Q:
A
key
factor
facilitating
\
target
:
Let's think step by step. We refer to Wikipedia articles on virology for
\
the
application
of
nested
case-control
studies
from
the
MACS
was:
\n
(A)
Data
collection
\
help. Paramyxoviruses are fragile viruses often visualised with RNA spewing
\
(B)
Establishment
of
a
repository
of
biologic
specimens
(C)
Participant
interest
\
from the inside. The answer is (A).
\
(D)
Administration
of
the
questionnaire
by
staff
\n
A:
Let's
think
step
by
step.
\
-
question
:
'
The
most
important
goal
of
a
behavioral
intervention
is:
\
We
refer
to
Wikipedia
articles
on
virology
for
help.
The
Multicenter
AIDS
Cohort
\
\
Study's
use
of
nested
case-control
studies
was
facilitated
by
the
establishment
\
(A)
Change
in
behavior
(B)
Comprehensive
coverage
(C)
Effective
use
of
behavioral
\
of
a
repository
of
biologic
specimens.
The
answer
is
(B).
\n\n
Q:
Why
are
parvoviruses
\
theory
(D)
Sustained
behavior
change'
\
a
highly
impactful
parasite?
\n
(A)
Because
they
have
no
nucleic
acid
(B)
They
require
\
target
:
Let's think step by step. We refer to Wikipedia articles on virology for
\
a
helper
virus
(C)
Only
replicate
in
dividing
cells
(D)
Can
integrate
into
host
\
help. The prim goal of a behavioral intervention is to cause sustained behavior
\
chromosomes
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
\
change. The answer is (D).
\
for
help.
Paroviruses
are
highly
impactful
because
they
do
not
have
nucleic
acid.
\
-
question
:
'
A
key
factor
facilitating
the
application
of
nested
case-control
studies
\
The
answer
is
(A).
\n\n
"
from
the
MACS
was:
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
(A)
Data
collection
(B)
Establishment
of
a
repository
of
biologic
specimens
"
task"
:
"
mmlu_flan_cot_fewshot_virology"
(C)
Participant
interest
(D)
Administration
of
the
questionnaire
by
staff'
target
:
Let's think step by step. We refer to Wikipedia articles on virology for
help. The Multicenter AIDS Cohort Study's use of nested case-control studies
was facilitated by the establishment of a repository of biologic specimens.
The answer is (B).
-
question
:
'
Why
are
parvoviruses
a
highly
impactful
parasite?
(A)
Because
they
have
no
nucleic
acid
(B)
They
require
a
helper
virus
(C)
Only
replicate
in
dividing
cells
(D)
Can
integrate
into
host
chromosomes'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
Paroviruses
are
highly
impactful
because
they
do
not
have
nucleic
acid.
The
answer
is
(A).'
group
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_virology
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
View file @
da211969
"
dataset_name"
:
"
world_religions"
dataset_name
:
world_religions
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
world
\
description
:
The following are multiple choice questions (with answers) about world
\
religions.
\n\n
Q:
How
can
the
Upanishads
be
characterized?
\n
(A)
Ritual
texts
(B)
\
religions.
\
Philosophical
texts
(C)
Hymns
(D)
Origin
stories
\n
A:
Let's
think
step
by
step.
\
fewshot_config
:
\
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
The
Upanishads
are
\
sampler
:
first_n
\
the
most
recent
part
of
Vedas
(the
oldest
scriptures
in
Hinduism)
and
supplied
\
samples
:
\
the
basis
of
later
Hindu
philosophy.
So
they
are
philosophical
texts.
The
answer
\
-
question
:
'
How
can
the
Upanishads
be
characterized?
\
is
(B).
\n\n
Q:
What
is
the
Second
Gem
in
Buddhism?
\n
(A)
The
Dharma
(B)
The
Sangha
\
\
(C)
The
Buddha
(D)
The
Bodhisattva
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
(A)
Ritual
texts
(B)
Philosophical
texts
(C)
Hymns
(D)
Origin
stories'
\
articles
on
world
religions
for
help.
The
Second
Gem
in
Buddhism
is
The
Dharma.
\
target
:
Let's think step by step. We refer to Wikipedia articles on world religions
\
The
answer
is
(A).
\n\n
Q:
Which
Japanese
government
promoted
a
kind
of
national
\
for help. The Upanishads are the most recent part of Vedas (the oldest scriptures
\
cult
based
on
the
emperor
and
his
associations
with
kami?
\n
(A)
Honen
(B)
Tanaka
\
in Hinduism) and supplied the basis of later Hindu philosophy. So they are philosophical
\
(C)
Tokugawa
(D)
Meiji
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
texts. The answer is (B).
\
on
world
religions
for
help.
The
promotion
of
a
national
cult
based
on
the
emperor
\
-
question
:
'
What
is
the
Second
Gem
in
Buddhism?
\
and
his
associations
with
Kami
happened
during
the
reign
of
Emperor
Meiji
(1852-1912).
\
\
The
answer
is
(D).
\n\n
Q:
In
which
dynasty
was
the
\"
Mandate
of
Heaven
\"
developed
\
(A)
The
Dharma
(B)
The
Sangha
(C)
The
Buddha
(D)
The
Bodhisattva'
\
to
legitimatize
the
new
rulers?
\n
(A)
Shang
(B)
Zhou
(C)
Han
(D)
Xia
\n
A:
Let's
\
target
:
Let's think step by step. We refer to Wikipedia articles on world religions
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
\
for help. The Second Gem in Buddhism is The Dharma. The answer is (A).
\
The
\"
Mandate
of
Heaven
\"
was
developed
as
an
ancient
Chinese
philosophical
concept
\
-
question
:
'
Which
Japanese
government
promoted
a
kind
of
national
cult
based
on
the
\
during
the
Zhou
Dynasty
(1046-256
BCE).
The
answer
is
(B).
\n\n
Q:
What
is
the
sign
\
emperor
and
his
associations
with
kami?
\
of
the
covenant
for
Jewish
males?
\n
(A)
The
rainbow
(B)
Circumcision
(C)
A
son
\
\
(D)
Bar
mitzvah
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
\
(A)
Honen
(B)
Tanaka
(C)
Tokugawa
(D)
Meiji'
\
world
religions
for
help.
In
Judaism,
the
most
distinctive
sign
of
the
covenant
\
target
:
Let's think step by step. We refer to Wikipedia articles on world religions
\
is
circumcision
(brit
milah).
The
answer
is
(B).
\n\n
"
for help. The promotion of a national cult based on the emperor and his associations
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
with Kami happened during the reign of Emperor Meiji (1852-1912). The answer
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
is (D).
"
task"
:
"
mmlu_flan_cot_fewshot_world_religions"
-
question
:
'
In
which
dynasty
was
the
"Mandate
of
Heaven"
developed
to
legitimatize
the
new
rulers?
(A)
Shang
(B)
Zhou
(C)
Han
(D)
Xia'
target
:
Let's think step by step. We refer to Wikipedia articles on world religions
for help. The "Mandate of Heaven" was developed as an ancient Chinese philosophical
concept during the Zhou Dynasty (1046-256 BCE). The answer is (B).
-
question
:
'
What
is
the
sign
of
the
covenant
for
Jewish
males?
(A)
The
rainbow
(B)
Circumcision
(C)
A
son
(D)
Bar
mitzvah'
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
In
Judaism,
the
most
distinctive
sign
of
the
covenant
is
circumcision
(brit
milah).
The
answer
is
(B).'
group
:
mmlu_flan_cot_fewshot_humanities
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_world_religions
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
View file @
da211969
...
@@ -8,7 +8,7 @@ filter_list:
...
@@ -8,7 +8,7 @@ filter_list:
- name: "strict-match"
- name: "strict-match"
filter:
filter:
- function: "regex"
- function: "regex"
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=
the
answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
- function: "take_first"
- function: "take_first"
- name: "flexible-extract"
- name: "flexible-extract"
filter:
filter:
...
@@ -33,4 +33,6 @@ metric_list:
...
@@ -33,4 +33,6 @@ metric_list:
ignore_case: true
ignore_case: true
ignore_punctuation: true
ignore_punctuation: true
metadata:
metadata:
version: 1.0
version: 2.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml
View file @
da211969
...
@@ -2,8 +2,10 @@ group: mmlu_flan_n_shot_generative
...
@@ -2,8 +2,10 @@ group: mmlu_flan_n_shot_generative
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
test_split: test
test_split: test
fewshot_split: dev
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: generate_until
output_type: generate_until
doc_to_text: "Q: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA:
"
doc_to_text: "Q: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA:"
doc_to_target: "{{['(A)', '(B)', '(C)', '(D)'][answer]}}"
doc_to_target: "{{['(A)', '(B)', '(C)', '(D)'][answer]}}"
filter_list:
filter_list:
- name: "strict-match"
- name: "strict-match"
...
@@ -22,9 +24,12 @@ generation_kwargs:
...
@@ -22,9 +24,12 @@ generation_kwargs:
- "</s>"
- "</s>"
- "Q:"
- "Q:"
- "<|im_end|>"
- "<|im_end|>"
- "\n"
metric_list:
metric_list:
- metric: exact_match
- metric: exact_match
aggregation: mean
aggregation: mean
higher_is_better: true
higher_is_better: true
metadata:
metadata:
version: 1.0
version: 2.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/flan_n_shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
View file @
da211969
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
test_split: test
test_split: test
fewshot_split: dev
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: multiple_choice
output_type: multiple_choice
doc_to_text: "Q: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA:
"
doc_to_text: "Q: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA:"
doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
doc_to_target: answer
doc_to_target: answer
metric_list:
metric_list:
- metric: acc
- metric: acc
aggregation: mean
aggregation: mean
higher_is_better: true
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
metadata:
version: 0.0
version: 1.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/generative/_default_template_yaml
0 → 100644
View file @
da211969
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
output_type: generate_until
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{['A', 'B', 'C', 'D'][answer]}}"
generation_kwargs:
until:
- "</s>"
- "\n"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
Prev
1
…
23
24
25
26
27
28
29
30
31
…
33
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment