Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
109ed1c7
Commit
109ed1c7
authored
Oct 17, 2023
by
lintangsutawika
Browse files
added subgroups for other mmlu variants
parent
93a45962
Changes
238
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
654 additions
and
966 deletions
+654
-966
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_logical_fallacies.yaml
...l/tasks/mmlu/flan_cot_fewshot/mmlu_logical_fallacies.yaml
+45
-71
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_machine_learning.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_machine_learning.yaml
+46
-45
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_management.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_management.yaml
+33
-54
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_marketing.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_marketing.yaml
+40
-66
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_medical_genetics.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_medical_genetics.yaml
+37
-61
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_miscellaneous.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_miscellaneous.yaml
+27
-54
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_disputes.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_disputes.yaml
+49
-72
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_scenarios.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_moral_scenarios.yaml
+44
-66
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_nutrition.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_nutrition.yaml
+48
-72
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_philosophy.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_philosophy.yaml
+25
-25
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_prehistory.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_prehistory.yaml
+42
-67
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_accounting.yaml
...s/mmlu/flan_cot_fewshot/mmlu_professional_accounting.yaml
+7
-6
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_law.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_professional_law.yaml
+5
-4
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_medicine.yaml
...sks/mmlu/flan_cot_fewshot/mmlu_professional_medicine.yaml
+18
-17
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
...s/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
+30
-29
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
+39
-65
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
+5
-4
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
+43
-67
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
...l/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
+40
-66
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
+31
-55
No files found.
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_logical_fallacies.yaml
View file @
109ed1c7
dataset_name
:
logical_fallacies
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
logical
fallacies.
Q:
When
an
arguer
causes
confusion
during
refutation
because
of
real
or
feigned
lack
of
an
ability
to
engage
in
refutation,
that
arguer
may
have
committed
the
fallacy
of
(A)
poor
sportsmanship
(B)
appeal
to
compassion
(C)
argument
against
the
person
(D)
ignorance
of
refutation
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
Ignorance
of
refutation,
one
of
Aristotle'
'
s
original
list
of
logical
fallacies
in
his
Organon,
is
when
someone
causes
confusion
in
an
argument
through
real
or
feigned
inability
to
engage
in
refutation,
in
order
to
win
the
argument.
The
answer
is
(D).
Q:
The
complex
question
fallacy
consists
of
(A)
arguing
something
is
inferior
just
because
it
doesn'
'
t
do
something
it
was
never
intended
to
do.
(B)
including
more
than
one
claim
in
the
proposition
and
treating
proof
for
one
claim
as
proof
for
all
the
claims.
(C)
drawing
a
conclusion
before
examining
the
evidence,
and
only
considering
evidence
that
supports
that
conclusion.
(D)
asking
a
question
that
includes
either
an
unproven
assumption
or
more
than
one
question,
thus
making
a
straightforward
yes
or
no
answer
meaningless.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
The
complex
question
fallacy
is
when
someone
makes
a
single
yes
or
no
answer
to
a
question
meaningless,
by
including
either
an
unproven
assumption
or
many
questions.
The
latter
is
also
known
as
the
many
questions
fallacy.
The
answer
is
(D).
Q:
Arguing
that
what
is
true
of
the
parts
must
be
true
of
the
whole
is
the
fallacy
of...
(A)
Division
(B)
Composition
(C)
Appeal
to
the
person
(D)
Appeal
to
ignorance
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
Fallacy
of
composition
occurs
when
someone
argues
what
is
true
of
the
parts
must
be
true
of
the
whole.
The
answer
is
(B).
Q:
Which
of
the
following
is
true
of
a
valid
categorical
syllogism?
(A)
The
minor
premise
must
deny
the
antecedent
(B)
The
major
premise
must
affirm
the
consequent
(C)
The
middle
term
must
be
used
in
at
least
one
premise
in
a
universal
or
unqualified
sense
(D)
All
of
the
above
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
A
valid
categorical
syllogism
must
satisfy
several
conditions:
(1)
the
syllogism
must
have
exactly
three
terms
(2)
every
term
of
the
syllogism
must
be
used
twice
exactly,
(3)
a
term
may
be
used
only
once
in
any
premise,
and
(4)
the
middle
term
must
be
used
in
at
least
one
premise
in
a
universal
or
unqualified
sense,
etc.
Only
(C)
is
true.
The
answer
is
(C).
Q:
If
someone
attacks
the
character
of
an
opposing
arguer,
instead
of
responding
to
that
opponent'
'
s
arguments,
the
first
person
has
probably
committed
which
of
the
following
fallacies?
(A)
tu
quoque
(B)
horse
laugh
(C)
argument
against
the
person
(D)
ignoratio
elenchi
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
The
argument
against
the
person
fallacy
occurs
when
someone
irrelevantly
attacks
the
character
of
an
opposing
arguer,
instead
of
addressing
that
opponent'
'
s
arguments.
The
answer
is
(C).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_logical_fallacies
"
dataset_name"
:
"
logical_fallacies"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
logical
\
\
fallacies.
\n\n
Q:
When
an
arguer
causes
confusion
during
refutation
because
of
\
\
real
or
feigned
lack
of
an
ability
to
engage
in
refutation,
that
arguer
may
have
\
\
committed
the
fallacy
of
\n
(A)
poor
sportsmanship
(B)
appeal
to
compassion
(C)
\
\
argument
against
the
person
(D)
ignorance
of
refutation
\n
A:
Let's
think
step
by
\
\
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
Ignorance
\
\
of
refutation,
one
of
Aristotle's
original
list
of
logical
fallacies
in
his
Organon,
\
\
is
when
someone
causes
confusion
in
an
argument
through
real
or
feigned
inability
\
\
to
engage
in
refutation,
in
order
to
win
the
argument.
The
answer
is
(D).
\n\n\
Q:
The
complex
question
fallacy
consists
of
\n
(A)
arguing
something
is
inferior
just
\
\
because
it
doesn't
do
something
it
was
never
intended
to
do.
(B)
including
more
\
\
than
one
claim
in
the
proposition
and
treating
proof
for
one
claim
as
proof
for
\
\
all
the
claims.
(C)
drawing
a
conclusion
before
examining
the
evidence,
and
only
\
\
considering
evidence
that
supports
that
conclusion.
(D)
asking
a
question
that
\
\
includes
either
an
unproven
assumption
or
more
than
one
question,
thus
making
\
\
a
straightforward
yes
or
no
answer
meaningless.
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
The
complex
question
\
\
fallacy
is
when
someone
makes
a
single
yes
or
no
answer
to
a
question
meaningless,
\
\
by
including
either
an
unproven
assumption
or
many
questions.
The
latter
is
also
\
\
known
as
the
many
questions
fallacy.
The
answer
is
(D).
\n\n
Q:
Arguing
that
what
\
\
is
true
of
the
parts
must
be
true
of
the
whole
is
the
fallacy
of...
\n
(A)
Division
\
\
(B)
Composition
(C)
Appeal
to
the
person
(D)
Appeal
to
ignorance
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
for
help.
Fallacy
\
\
of
composition
occurs
when
someone
argues
what
is
true
of
the
parts
must
be
true
\
\
of
the
whole.
The
answer
is
(B).
\n\n
Q:
Which
of
the
following
is
true
of
a
valid
\
\
categorical
syllogism?
\n
(A)
The
minor
premise
must
deny
the
antecedent
(B)
The
\
\
major
premise
must
affirm
the
consequent
(C)
The
middle
term
must
be
used
in
at
\
\
least
one
premise
in
a
universal
or
unqualified
sense
(D)
All
of
the
above
\n
A:
\
\
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
\
\
for
help.
A
valid
categorical
syllogism
must
satisfy
several
conditions:
(1)
the
\
\
syllogism
must
have
exactly
three
terms
(2)
every
term
of
the
syllogism
must
be
\
\
used
twice
exactly,
(3)
a
term
may
be
used
only
once
in
any
premise,
and
(4)
the
\
\
middle
term
must
be
used
in
at
least
one
premise
in
a
universal
or
unqualified
\
\
sense,
etc.
Only
(C)
is
true.
The
answer
is
(C).
\n\n
Q:
If
someone
attacks
the
\
\
character
of
an
opposing
arguer,
instead
of
responding
to
that
opponent's
arguments,
\
\
the
first
person
has
probably
committed
which
of
the
following
fallacies?
\n
(A)
\
\
tu
quoque
(B)
horse
laugh
(C)
argument
against
the
person
(D)
ignoratio
elenchi
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
logical
fallacies
\
\
for
help.
The
argument
against
the
person
fallacy
occurs
when
someone
irrelevantly
\
\
attacks
the
character
of
an
opposing
arguer,
instead
of
addressing
that
opponent's
\
\
arguments.
The
answer
is
(C)."
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_logical_fallacies"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_machine_learning.yaml
View file @
109ed1c7
dataset_name
:
machine_learning
description
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
machine
\
"
dataset_name
"
:
"
machine_learning
"
"
description
"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
machine
\
\
learning.
\n\n
Q:
Which
image
data
augmentation
is
most
common
for
natural
images?
\n\
(A)
random
crop
and
horizontal
flip
(B)
random
crop
and
vertical
flip
(C)
posterization
\
\
(D)
dithering
\n
A:
Let's
think
step
by
step.
Data
augmentation
is
used
to
increase
\
...
...
@@ -12,48 +12,49 @@ description: "The following are multiple choice questions (with answers) about m
\
learning
we
consider
a
binary
split
according
to
whether
the
attribute
is
above
\
\
or
below
some
threshold.
Pat
suggests
that
instead
we
should
just
have
a
multiway
\
\
split
with
one
branch
for
each
of
the
distinct
values
of
the
attribute.
From
the
\
\
list
below
choose
the
single
biggest
problem
with
Pat
\u2019
s
suggestion:
\n
(A)
\
\
It
is
too
computationally
expensive.
(B)
It
would
probably
result
in
a
decision
\
\
tree
that
scores
badly
on
the
training
set
and
a
testset.
(C)
It
would
probably
\
\
result
in
a
decision
tree
that
scores
well
on
the
training
set
but
badly
on
a
\
\
testset.
(D)
It
would
probably
result
in
a
decision
tree
that
scores
well
on
a
\
\
testset
but
badly
on
a
training
set.
\n
A:
Let's
think
step
by
step.
Because
the
\
\
input
is
real
valued,
it
is
unlikely
that
the
same
values
appear
both
at
training
\
\
and
test
time.
This
means
that
while
such
a
decision
tree
could
yield
good
performance
\
\
list
below
choose
the
single
biggest
problem
with
Pat
’
s
suggestion:
\n
(A)
It
is
\
\
too
computationally
expensive.
(B)
It
would
probably
result
in
a
decision
tree
\
\
that
scores
badly
on
the
training
set
and
a
testset.
(C)
It
would
probably
result
\
\
in
a
decision
tree
that
scores
well
on
the
training
set
but
badly
on
a
testset.
\
\
(D)
It
would
probably
result
in
a
decision
tree
that
scores
well
on
a
testset
\
\
but
badly
on
a
training
set.
\n
A:
Let's
think
step
by
step.
Because
the
input
is
\
\
real
valued,
it
is
unlikely
that
the
same
values
appear
both
at
training
and
test
\
\
time.
This
means
that
while
such
a
decision
tree
could
yield
good
performance
\
\
on
the
training
data,
when
evaluated
on
the
test
data
it
will
perform
badly
because
\
\
the
decision
tree
won
\u2019
t
know
what
to
do
with
numbers
that
did
not
appear
\
\
in
the
training
data.
The
answer
is
(C).
\n\n
Q:
You
are
reviewing
papers
for
the
\
\
World
\u2019
s
Fanciest
Machine
Learning
Conference,
and
you
see
submissions
with
\
\
the
following
claims.
Which
ones
would
you
consider
accepting?
\n
(A)
My
method
\
\
achieves
a
training
error
lower
than
all
previous
methods!
(B)
My
method
achieves
\
\
a
test
error
lower
than
all
previous
methods!
(Footnote:
When
regularisation
parameter
\
\ \u03BB
is
chosen
so
as
to
minimise
test
error.)
(C)
My
method
achieves
a
test
\
\
the
decision
tree
won’t
know
what
to
do
with
numbers
that
did
not
appear
in
the
\
\
training
data.
The
answer
is
(C).
\n\n
Q:
You
are
reviewing
papers
for
the
World’s
\
\
Fanciest
Machine
Learning
Conference,
and
you
see
submissions
with
the
following
\
\
claims.
Which
ones
would
you
consider
accepting?
\n
(A)
My
method
achieves
a
training
\
\
error
lower
than
all
previous
methods!
(B)
My
method
achieves
a
test
error
lower
\
\
than
all
previous
methods!
(Footnote:
When
regularisation
parameter
λ
is
chosen
\
\
so
as
to
minimise
test
error.)
(C)
My
method
achieves
a
test
error
lower
than
\
\
all
previous
methods!
(Footnote:
When
regularisation
parameter
λ
is
chosen
so
\
\
as
to
minimise
cross-validaton
error.)
(D)
My
method
achieves
a
cross-validation
\
\
error
lower
than
all
previous
methods!
(Footnote:
When
regularisation
parameter
\
\ \u03BB
is
chosen
so
as
to
minimise
cross-validaton
error.)
(D)
My
method
achieves
\
\
a
cross-validation
error
lower
than
all
previous
methods!
(Footnote:
When
regularisation
\
\
parameter
\u03BB
is
chosen
so
as
to
minimise
cross-validaton
error.)
\n
A:
Let's
\
\
think
step
by
step.
In
machine
learning,
we
train
with
some
data
and
fixed
hyperparameters
\
\
and
the
training
error
can
be
arbitrarily
low,
so
(A)
can
\u2019
t
be
right.
Then,
\
\
one
compares
different
hyperparameters
by
selecting
the
model
with
the
lowest
\
\
cross-validation
error,
this
means
that
(B)
and
(D)
are
not
the
right
procedure.
\
\
The
only
relevant
number
after
these
is
the
test
error
and
thus
(C)
is
the
right
\
\
answer.
The
answer
is
(C).
\n\n
Q:
A
6-sided
die
is
rolled
15
times
and
the
results
\
\
are:
side
1
comes
up
0
times;
side
2:
1
time;
side
3:
2
times;
side
4:
3
times;
\
\
side
5:
4
times;
side
6:
5
times.
Based
on
these
results,
what
is
the
probability
\
\
of
side
3
coming
up
when
using
Add-1
Smoothing?
\n
(A)
2.0/15
(B)
1.0/7
(C)
3.0/16
\
\
(D)
1.0/5
\n
A:
Let's
think
step
by
step.
Add-1
smoothing
adds
the
value
of
one
\
\
to
the
different
counts
and
then
normalizes
the
probabilities
accordingly.
The
\
\
counts
after
adding
one
will
be:
side
1
comes
up
1
time;
side
2:
2
times;
side
\
\
3:
3
times;
side
4:
4
times;
side
5:
5
times;
side
6:
6
times.
The
number
of
sum
\
\
one
die
rolls
will
be
21,
so
the
probability
of
drawing
a
three
is
3/21
=
1/7.
\
\
The
answer
is
(B).
\n\n
Q:
To
achieve
an
0/1
loss
estimate
that
is
less
than
1
percent
\
\
of
the
true
0/1
loss
(with
probability
95%),
according
to
Hoeffding's
inequality
\
\
the
IID
test
set
must
have
how
many
examples?
\n
(A)
around
10
examples
(B)
around
\
\
100
examples
(C)
between
100
and
500
examples
(D)
more
than
1000
examples
\n
A:
\
\
Let's
think
step
by
step.
By
the
Hoeffding
\u2019
s
inequality,
we
expect
that
with
\
\
95%
probability
the
in-sample
and
out-of-sample
errors
differ
by
epsilon
when
\
\
we
have
N
samples
if
2
exp(-2
epsilon^2
N)<0.05,
this
implies
that
N
>
-1/(2*epsilon**2)
\
\
log
(
0.05/2
)=
log
(40)*5000.
Since
log(40)>1,
we
have
that
one
needs
more
than
\
\
1000
examples.
The
answer
is
(D)."
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_machine_learning
\
λ
is
chosen
so
as
to
minimise
cross-validaton
error.)
\n
A:
Let's
think
step
by
\
\
step.
In
machine
learning,
we
train
with
some
data
and
fixed
hyperparameters
and
\
\
the
training
error
can
be
arbitrarily
low,
so
(A)
can’t
be
right.
Then,
one
compares
\
\
different
hyperparameters
by
selecting
the
model
with
the
lowest
cross-validation
\
\
error,
this
means
that
(B)
and
(D)
are
not
the
right
procedure.
The
only
relevant
\
\
number
after
these
is
the
test
error
and
thus
(C)
is
the
right
answer.
The
answer
\
\
is
(C).
\n\n
Q:
A
6-sided
die
is
rolled
15
times
and
the
results
are:
side
1
comes
\
\
up
0
times;
side
2:
1
time;
side
3:
2
times;
side
4:
3
times;
side
5:
4
times;
\
\
side
6:
5
times.
Based
on
these
results,
what
is
the
probability
of
side
3
coming
\
\
up
when
using
Add-1
Smoothing?
\n
(A)
2.0/15
(B)
1.0/7
(C)
3.0/16
(D)
1.0/5
\n
A:
\
\
Let's
think
step
by
step.
Add-1
smoothing
adds
the
value
of
one
to
the
different
\
\
counts
and
then
normalizes
the
probabilities
accordingly.
The
counts
after
adding
\
\
one
will
be:
side
1
comes
up
1
time;
side
2:
2
times;
side
3:
3
times;
side
4:
\
\
4
times;
side
5:
5
times;
side
6:
6
times.
The
number
of
sum
one
die
rolls
will
\
\
be
21,
so
the
probability
of
drawing
a
three
is
3/21
=
1/7.
The
answer
is
(B).
\n\
\n
Q:
To
achieve
an
0/1
loss
estimate
that
is
less
than
1
percent
of
the
true
0/1
\
\
loss
(with
probability
95%),
according
to
Hoeffding's
inequality
the
IID
test
\
\
set
must
have
how
many
examples?
\n
(A)
around
10
examples
(B)
around
100
examples
\
\
(C)
between
100
and
500
examples
(D)
more
than
1000
examples
\n
A:
Let's
think
step
\
\
by
step.
By
the
Hoeffding’s
inequality,
we
expect
that
with
95%
probability
the
\
\
in-sample
and
out-of-sample
errors
differ
by
epsilon
when
we
have
N
samples
if
\
\
2
exp(-2
epsilon^2
N)<0.05,
this
implies
that
N
>
-1/(2*epsilon**2)
log
(
0.05/2
\
\
)=
log
(40)*5000.
Since
log(40)>1,
we
have
that
one
needs
more
than
1000
examples.
\
\
The
answer
is
(D)."
"
group"
:
"
mmlu_flan_cot_fewshot_stem"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_machine_learning"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_management.yaml
View file @
109ed1c7
dataset_name
:
management
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
management.
Q:
How
can
organisational
structures
that
are
characterised
by
democratic
and
inclusive
styles
of
management
be
described?
(A)
Hierarchical
(B)
Bureaucratic
(C)
Flat
(D)
Functional
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
management
for
help.
Flat
organizational
structures
are
characterized
by
democratic
and
inclusive
styles
of
management,
and
have
few
(if
any)
levels
of
management
between
the
workers
and
managers.
The
answer
is
(C).
Q:
Hygiene
factors
are
associated
with
which
writer?
(A)
Frederick
Hertzberg
(B)
D.C.
McClelland
(C)
Abraham
Maslow
(D)
Douglas
McGregor
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
management
for
help.
Hygiene
factors
include
compensation,
company
policies,
supervision,
interpersonal
relations,
and
work
environments.
Hertzberg
lists
them
as
factors
that
cannot
motivate
employees
but
can
minimize
job
dissatisfaction.
The
answer
is
(A).
Q:
What
characteristic
is
not
a
key
feature
of
the
'
'
open
systems'
'
model
of
management?
(A)
Morale
(B)
Innovation
(C)
Growth
resource
(D)
Adaptation
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
management
for
help.
The
key
characteristics
of
an
open
system
in
management
include
innovation,
growth
resource,
and
adaption,
but
do
not
include
morale.
The
answer
is
(A).
Q:
Which
element
of
the
cultural
web
forms
regalia?
(A)
Symbols
(B)
Rituals
and
routines
(C)
Power
structures
(D)
Control
systems
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
management
for
help.
The
cultural
web
is
a
tool
for
mapping
an
organization'
'
s
culture,
where
symbols
form
the
regalia
that
visually
expresses
the
values
that
the
organization
holds
as
important.
The
answer
is
(A).
Q:
What
are
the
two
main
dimensions
of
the
Ohio
Studies
into
leadership?
(A)
Starting
position
and
end
position
(B)
Initial
environment
and
changed
environment
(C)
Organisational
structure
and
conditioning
(D)
Initiating
structure
and
considerations
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
management
for
help.
The
Ohio
State
Leadership
Studies
conducted
in
the
1940s
identified
initiating
structure
and
consideration
as
the
two
main
dimensions
of
leader
behavior.
The
answer
is
(D).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_management
"
dataset_name"
:
"
management"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
management.
\n\
\n
Q:
How
can
organisational
structures
that
are
characterised
by
democratic
and
\
\
inclusive
styles
of
management
be
described?
\n
(A)
Hierarchical
(B)
Bureaucratic
\
\
(C)
Flat
(D)
Functional
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
management
for
help.
Flat
organizational
structures
are
characterized
by
democratic
\
\
and
inclusive
styles
of
management,
and
have
few
(if
any)
levels
of
management
\
\
between
the
workers
and
managers.
The
answer
is
(C).
\n\n
Q:
Hygiene
factors
are
\
\
associated
with
which
writer?
\n
(A)
Frederick
Hertzberg
(B)
D.C.
McClelland
(C)
\
\
Abraham
Maslow
(D)
Douglas
McGregor
\n
A:
Let's
think
step
by
step.
We
refer
to
\
\
Wikipedia
articles
on
management
for
help.
Hygiene
factors
include
compensation,
\
\
company
policies,
supervision,
interpersonal
relations,
and
work
environments.
\
\
Hertzberg
lists
them
as
factors
that
cannot
motivate
employees
but
can
minimize
\
\
job
dissatisfaction.
The
answer
is
(A).
\n\n
Q:
What
characteristic
is
not
a
key
\
\
feature
of
the
'open
systems'
model
of
management?
\n
(A)
Morale
(B)
Innovation
\
\
(C)
Growth
resource
(D)
Adaptation
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
management
for
help.
The
key
characteristics
of
an
open
system
in
\
\
management
include
innovation,
growth
resource,
and
adaption,
but
do
not
include
\
\
morale.
The
answer
is
(A).
\n\n
Q:
Which
element
of
the
cultural
web
forms
regalia?
\n\
(A)
Symbols
(B)
Rituals
and
routines
(C)
Power
structures
(D)
Control
systems
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
management
for
help.
\
\
The
cultural
web
is
a
tool
for
mapping
an
organization's
culture,
where
symbols
\
\
form
the
regalia
that
visually
expresses
the
values
that
the
organization
holds
\
\
as
important.
The
answer
is
(A).
\n\n
Q:
What
are
the
two
main
dimensions
of
the
\
\
Ohio
Studies
into
leadership?
\n
(A)
Starting
position
and
end
position
(B)
Initial
\
\
environment
and
changed
environment
(C)
Organisational
structure
and
conditioning
\
\
(D)
Initiating
structure
and
considerations
\n
A:
Let's
think
step
by
step.
We
refer
\
\
to
Wikipedia
articles
on
management
for
help.
The
Ohio
State
Leadership
Studies
\
\
conducted
in
the
1940s
identified
initiating
structure
and
consideration
as
the
\
\
two
main
dimensions
of
leader
behavior.
The
answer
is
(D)."
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_management"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_marketing.yaml
View file @
109ed1c7
dataset_name
:
marketing
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
marketing.
Q:
Although
the
content
and
quality
can
be
as
controlled
as
direct
mail,
response
rates
of
this
medium
are
lower
because
of
the
lack
of
a
personal
address
mechanism.
This
media
format
is
known
as:
(A)
Care
lines.
(B)
Direct
mail.
(C)
Inserts.
(D)
Door
to
door.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
marketing
for
help.
Door
to
door
marketing
delivers
non-addressed
items
within
all
buildings
within
a
geographic
area.
While
it
can
control
the
content
and
quality
as
well
as
direct
mail
marketing,
its
response
rate
is
lower
because
of
the
lack
of
a
personal
address
mechanism.
The
answer
is
(D).
Q:
In
an
organization,
the
group
of
people
tasked
with
buying
decisions
is
referred
to
as
the
_______________.
(A)
Outsourcing
unit.
(B)
Procurement
centre.
(C)
Chief
executive
unit.
(D)
Decision-making
unit.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
marketing
for
help.
In
an
organization,
the
group
of
the
people
tasked
with
buying
decision
is
referred
to
as
the
decision-making
unit.
The
answer
is
(D).
Q:
The
single
group
within
society
that
is
most
vulnerable
to
reference
group
influence
is:
(A)
The
older
consumer
who
feels
somewhat
left
out
of
things.
(B)
The
married
women,
many
of
whom
feel
a
need
for
stability
in
their
lives.
(C)
New
immigrants
who
really
want
to
assimilate
into
their
new
culture.
(D)
Children,
who
base
most
of
their
buying
decisions
on
outside
influences.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
marketing
for
help.
Children,
who
mostly
based
their
buying
decisions
on
outside
influences,
are
the
single
group
within
society
that
is
more
vulnerable
to
reference
group
influence.
The
answer
is
(D).
Q:
Which
of
the
following
is
an
assumption
in
Maslow'
'
s
hierarchy
of
needs?
(A)
Needs
are
dependent
on
culture
and
also
on
social
class.
(B)
Lower-level
needs
must
be
at
least
partially
satisfied
before
higher
needs
can
affect
behaviour.
(C)
Needs
are
not
prioritized
or
arranged
in
any
particular
order.
(D)
Satisfied
needs
are
motivators,
and
new
needs
emerge
when
current
needs
remain
unmet.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
marketing
for
help.
Maslow'
'
s
hierarchy
of
needs,
from
the
bottom
upwards,
are
physiological
(food
and
clothing),
safety,
love
and
belonging
needs,
esteem,
and
self-actualization.
Lower-level
needs
must
be
at
least
partially
satisfied
before
higher
ones
can
affect
behavior.
The
answer
is
(B).
Q:
_____________
is
a
natural
outcome
when
combining
demographic
and
geographic
variables.
(A)
Geodemographics
(B)
Product
differentiation.
(C)
ANSOFF
matrix.
(D)
Brand
management.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
marketing
for
help.
Geodemographics
is
a
natural
outcome
when
combining
demographic
and
geographic
variables.
The
answer
is
(A).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_marketing
"
dataset_name"
:
"
marketing"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
marketing.
\n\
\n
Q:
Although
the
content
and
quality
can
be
as
controlled
as
direct
mail,
response
\
\
rates
of
this
medium
are
lower
because
of
the
lack
of
a
personal
address
mechanism.
\
\
This
media
format
is
known
as:
\n
(A)
Care
lines.
(B)
Direct
mail.
(C)
Inserts.
\
\
(D)
Door
to
door.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
marketing
for
help.
Door
to
door
marketing
delivers
non-addressed
items
within
\
\
all
buildings
within
a
geographic
area.
While
it
can
control
the
content
and
quality
\
\
as
well
as
direct
mail
marketing,
its
response
rate
is
lower
because
of
the
lack
\
\
of
a
personal
address
mechanism.
The
answer
is
(D).
\n\n
Q:
In
an
organization,
\
\
the
group
of
people
tasked
with
buying
decisions
is
referred
to
as
the
_______________.
\n\
(A)
Outsourcing
unit.
(B)
Procurement
centre.
(C)
Chief
executive
unit.
(D)
Decision-making
\
\
unit.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
marketing
\
\
for
help.
In
an
organization,
the
group
of
the
people
tasked
with
buying
decision
\
\
is
referred
to
as
the
decision-making
unit.
The
answer
is
(D).
\n\n
Q:
The
single
\
\
group
within
society
that
is
most
vulnerable
to
reference
group
influence
is:
\n\
(A)
The
older
consumer
who
feels
somewhat
left
out
of
things.
(B)
The
married
women,
\
\
many
of
whom
feel
a
need
for
stability
in
their
lives.
(C)
New
immigrants
who
\
\
really
want
to
assimilate
into
their
new
culture.
(D)
Children,
who
base
most
\
\
of
their
buying
decisions
on
outside
influences.
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
marketing
for
help.
Children,
who
mostly
based
\
\
their
buying
decisions
on
outside
influences,
are
the
single
group
within
society
\
\
that
is
more
vulnerable
to
reference
group
influence.
The
answer
is
(D).
\n\n
Q:
\
\
Which
of
the
following
is
an
assumption
in
Maslow's
hierarchy
of
needs?
\n
(A)
Needs
\
\
are
dependent
on
culture
and
also
on
social
class.
(B)
Lower-level
needs
must
\
\
be
at
least
partially
satisfied
before
higher
needs
can
affect
behaviour.
(C)
\
\
Needs
are
not
prioritized
or
arranged
in
any
particular
order.
(D)
Satisfied
needs
\
\
are
motivators,
and
new
needs
emerge
when
current
needs
remain
unmet.
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
marketing
for
help.
Maslow's
\
\
hierarchy
of
needs,
from
the
bottom
upwards,
are
physiological
(food
and
clothing),
\
\
safety,
love
and
belonging
needs,
esteem,
and
self-actualization.
Lower-level
\
\
needs
must
be
at
least
partially
satisfied
before
higher
ones
can
affect
behavior.
\
\
The
answer
is
(B).
\n\n
Q:
_____________
is
a
natural
outcome
when
combining
demographic
\
\
and
geographic
variables.
\n
(A)
Geodemographics
(B)
Product
differentiation.
(C)
\
\
ANSOFF
matrix.
(D)
Brand
management.
\n
A:
Let's
think
step
by
step.
We
refer
to
\
\
Wikipedia
articles
on
marketing
for
help.
Geodemographics
is
a
natural
outcome
\
\
when
combining
demographic
and
geographic
variables.
The
answer
is
(A)."
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_marketing"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_medical_genetics.yaml
View file @
109ed1c7
dataset_name
:
medical_genetics
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
medical
genetics.
Q:
The
stage
of
meiosis
in
which
chromosomes
pair
and
cross
over
is:
(A)
prophase
I
(B)
metaphase
I
(C)
prophase
II
(D)
metaphase
II
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medical
genetics
for
help.
Prophase
I
is
the
stage
of
meiosis
where
homologous
chromosomes
pair
with
each
other
and
exchange
genetic
material.
The
answer
is
(A).
Q:
DNA
ligase
is
(A)
an
enzyme
that
joins
fragments
in
normal
DNA
replication
(B)
an
enzyme
of
bacterial
origin
which
cuts
DNA
at
defined
base
sequences
(C)
an
enzyme
that
facilitates
transcription
of
specific
genes
(D)
an
enzyme
which
limits
the
level
to
which
a
particular
nutrient
reaches
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medical
genetics
for
help.
DNA
ligase
is
a
type
of
enzyme
(EC
6.5.1.1)
responsible
for
joining
DNA
strands
together
by
catalyzing
a
phosphodiester
bond.
The
answer
is
(A).
Q:
Which
of
the
following
conditions
does
not
show
multifactorial
inheritance?
(A)
Pyloric
stenosis
(B)
Schizophrenia
(C)
Spina
bifida
(neural
tube
defects)
(D)
Marfan
syndrome
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medical
genetics
for
help.
Multifactorial
inheritance
is
when
more
than
a
single
factor
is
responsible
for
causing
a
given
trait
or
health
problem.
Genes
cannot
be
the
only
factor.
Marfan
syndrome,
on
the
other
hand,
requires
only
one
abnormal
copy
of
the
of
the
Marfan
gene,
from
one
parent,
to
inherit
the
trait.
The
answer
is
(D).
Q:
A
gene
showing
codominance
(A)
has
both
alleles
independently
expressed
in
the
heterozygote
(B)
has
one
allele
dominant
to
the
other
(C)
has
alleles
tightly
linked
on
the
same
chromosome
(D)
has
alleles
expressed
at
the
same
time
in
development
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medical
genetics
for
help.
Codominance,
as
it
relates
to
genetics,
refers
to
a
type
of
genetic
inheritance
where
the
phenotype
of
both
the
parents
is
easily
observed
in
the
offspring.
A
heterozygote
is
an
individual
having
two
different
alleles
of
a
gene.
The
answer
is
(A).
Q:
Large
triplet
repeat
expansions
can
be
detected
by:
(A)
polymerase
chain
reaction.
(B)
single
strand
conformational
polymorphism
analysis.
(C)
Southern
blotting.
(D)
Western
blotting.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medical
genetics
for
help.
A
Southern
blot
is
a
method
in
molecular
biology
for
detecting
specific
DNA
sequences
in
a
sample.
Large
triplet
repeat
expansions
are
usually
detected
with
this
method.
The
answer
is
(C).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_medical_genetics
"
dataset_name"
:
"
medical_genetics"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
medical
\
\
genetics.
\n\n
Q:
The
stage
of
meiosis
in
which
chromosomes
pair
and
cross
over
\
\
is:
\n
(A)
prophase
I
(B)
metaphase
I
(C)
prophase
II
(D)
metaphase
II
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medical
genetics
for
help.
\
\
Prophase
I
is
the
stage
of
meiosis
where
homologous
chromosomes
pair
with
each
\
\
other
and
exchange
genetic
material.
The
answer
is
(A).
\n\n
Q:
DNA
ligase
is
\n\
(A)
an
enzyme
that
joins
fragments
in
normal
DNA
replication
(B)
an
enzyme
of
bacterial
\
\
origin
which
cuts
DNA
at
defined
base
sequences
(C)
an
enzyme
that
facilitates
\
\
transcription
of
specific
genes
(D)
an
enzyme
which
limits
the
level
to
which
\
\
a
particular
nutrient
reaches
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
medical
genetics
for
help.
DNA
ligase
is
a
type
of
enzyme
(EC
6.5.1.1)
\
\
responsible
for
joining
DNA
strands
together
by
catalyzing
a
phosphodiester
bond.
\
\
The
answer
is
(A).
\n\n
Q:
Which
of
the
following
conditions
does
not
show
multifactorial
\
\
inheritance?
\n
(A)
Pyloric
stenosis
(B)
Schizophrenia
(C)
Spina
bifida
(neural
\
\
tube
defects)
(D)
Marfan
syndrome
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
medical
genetics
for
help.
Multifactorial
inheritance
is
when
more
\
\
than
a
single
factor
is
responsible
for
causing
a
given
trait
or
health
problem.
\
\
Genes
cannot
be
the
only
factor.
Marfan
syndrome,
on
the
other
hand,
requires
\
\
only
one
abnormal
copy
of
the
of
the
Marfan
gene,
from
one
parent,
to
inherit
\
\
the
trait.
The
answer
is
(D).
\n\n
Q:
A
gene
showing
codominance
\n
(A)
has
both
alleles
\
\
independently
expressed
in
the
heterozygote
(B)
has
one
allele
dominant
to
the
\
\
other
(C)
has
alleles
tightly
linked
on
the
same
chromosome
(D)
has
alleles
expressed
\
\
at
the
same
time
in
development
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
medical
genetics
for
help.
Codominance,
as
it
relates
to
genetics,
\
\
refers
to
a
type
of
genetic
inheritance
where
the
phenotype
of
both
the
parents
\
\
is
easily
observed
in
the
offspring.
A
heterozygote
is
an
individual
having
two
\
\
different
alleles
of
a
gene.
The
answer
is
(A).
\n\n
Q:
Large
triplet
repeat
expansions
\
\
can
be
detected
by:
\n
(A)
polymerase
chain
reaction.
(B)
single
strand
conformational
\
\
polymorphism
analysis.
(C)
Southern
blotting.
(D)
Western
blotting.
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medical
genetics
for
help.
\
\
A
Southern
blot
is
a
method
in
molecular
biology
for
detecting
specific
DNA
sequences
\
\
in
a
sample.
Large
triplet
repeat
expansions
are
usually
detected
with
this
method.
\
\
The
answer
is
(C)."
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_medical_genetics"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_miscellaneous.yaml
View file @
109ed1c7
dataset_name
:
miscellaneous
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
miscellaneous.
Q:
Which
of
these
songs
was
a
Top
10
hit
for
the
rock
band
The
Police?
(A)
'
'
Radio
Ga-Ga'
'
(B)
'
'
Ob-la-di
Ob-la-da'
'
(C)
'
'
De
Do
Do
Do
De
Da
Da
Da'
'
(D)
'
'
In-a-Gadda-Da-Vida'
'
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Radio
Ga-Ga
is
by
Queen.
Ob-la-di
Ob-la-da
is
by
The
Beatles.
And
In-a-Gadda-Da-Vida
is
by
Iron
Butterfly.
Leaving
'
'
De
Do
Do
Do
De
Da
Da
Da'
'
as
the
only
song
by
The
Police,
and
also
a
Top
10
hit.
The
answer
is
(C).
Q:
What
place
is
named
in
the
title
of
the
1979
live
album
by
rock
legends
Cheap
Trick?
(A)
Budapest
(B)
Budokan
(C)
Bhutan
(D)
Britain
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Nippon
Budokan
is
an
indoor
arena
in
Tokyo,
Japan
renowned
for
hosting
rock
music
concerts
including
Cheap
Trick
in
1978.
'
'
Cheap
Trick
at
Budokan'
'
became
the
name
of
their
album.
The
answer
is
(B).
Q:
What
is
produced
during
photosynthesis?
(A)
hydrogen
(B)
nylon
(C)
oxygen
(D)
light
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Photosynthesis
is
the
process
in
which
green
plants
use
the
green
pigment
chlorophyll
to
synthesize
foods
with
water
and
carbon
dioxide.
Oxygen
is
the
byproduct
of
this
process.
The
answer
is
(C).
Q:
Who
is
the
shortest
man
to
ever
win
an
NBA
slam
dunk
competition?
(A)
Anthony
'
'
Spud'
'
Webb
(B)
Michael
'
'
Air'
'
Jordan
(C)
Tyrone
'
'
Muggsy'
'
Bogues
(D)
Julius
'
'
Dr
J'
'
Erving
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
for
help.
In
1986,
Spud
Webb,
standing
only
5'
'
7"
became
the
shortest
NBA
player
in
history
to
win
an
official
slam
dunk
contest.
The
answer
is
(A).
Q:
How
many
axles
does
a
standard
automobile
have?
(A)
one
(B)
two
(C)
four
(D)
eight
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Most
cars
have
two
axles
to
rotate
the
wheels..
The
answer
is
(B).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_miscellaneous
"
dataset_name"
:
"
miscellaneous"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
miscellaneous.
\n\
\n
Q:
Which
of
these
songs
was
a
Top
10
hit
for
the
rock
band
The
Police?
\n
(A)
'Radio
\
\
Ga-Ga'
(B)
'Ob-la-di
Ob-la-da'
(C)
'De
Do
Do
Do
De
Da
Da
Da'
(D)
'In-a-Gadda-Da-Vida'
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Radio
Ga-Ga
is
by
Queen.
\
\
Ob-la-di
Ob-la-da
is
by
The
Beatles.
And
In-a-Gadda-Da-Vida
is
by
Iron
Butterfly.
\
\
Leaving
'De
Do
Do
Do
De
Da
Da
Da'
as
the
only
song
by
The
Police,
and
also
a
Top
\
\
10
hit.
The
answer
is
(C).
\n\n
Q:
What
place
is
named
in
the
title
of
the
1979
\
\
live
album
by
rock
legends
Cheap
Trick?
\n
(A)
Budapest
(B)
Budokan
(C)
Bhutan
(D)
\
\
Britain
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
for
help.
Nippon
Budokan
\
\
is
an
indoor
arena
in
Tokyo,
Japan
renowned
for
hosting
rock
music
concerts
including
\
\
Cheap
Trick
in
1978.
'Cheap
Trick
at
Budokan'
became
the
name
of
their
album.
\
\
The
answer
is
(B).
\n\n
Q:
What
is
produced
during
photosynthesis?
\n
(A)
hydrogen
\
\
(B)
nylon
(C)
oxygen
(D)
light
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
for
help.
Photosynthesis
is
the
process
in
which
green
plants
use
the
green
pigment
\
\
chlorophyll
to
synthesize
foods
with
water
and
carbon
dioxide.
Oxygen
is
the
byproduct
\
\
of
this
process.
The
answer
is
(C).
\n\n
Q:
Who
is
the
shortest
man
to
ever
win
\
\
an
NBA
slam
dunk
competition?
\n
(A)
Anthony
'Spud'
Webb
(B)
Michael
'Air'
Jordan
\
\
(C)
Tyrone
'Muggsy'
Bogues
(D)
Julius
'Dr
J'
Erving
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
for
help.
In
1986,
Spud
Webb,
standing
only
5'7
\"
became
\
\
the
shortest
NBA
player
in
history
to
win
an
official
slam
dunk
contest.
The
answer
\
\
is
(A).
\n\n
Q:
How
many
axles
does
a
standard
automobile
have?
\n
(A)
one
(B)
two
\
\
(C)
four
(D)
eight
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
for
help.
\
\
Most
cars
have
two
axles
to
rotate
the
wheels..
The
answer
is
(B)."
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_miscellaneous"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_disputes.yaml
View file @
109ed1c7
dataset_name
:
moral_disputes
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
moral
disputes.
Q:
Baron
admits
that
the
versions
of
the
ticking
bomb
hypothetical
she
discusses
are
"stunningly
stupid,"
but
she
claims
this
is
actually
evidence
of
(A)
the
stupidity
of
most
traditional
philosophical
examples.
(B)
a
general
lack
of
intelligence
among
people
with
advanced
degrees.
(C)
the
wrongness
of
torture.
(D)
the
readiness
on
the
part
of
many
intelligent
people
to
see
torture
as
the
best
solution
to
deal
with
terrorism.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
The
ticking
bomb
hypothetical
poses
a
problem
where
many
people
will
die
to
an
exploding
bomb,
if
the
hypothetical
terrorist
does
not
disclose
how
to
defuse
it.
Baron
sees
this
hypothetical
as
silly,
but
its
prevalence
does
suggest
intelligent
people,
particularly
utilitarians,
see
torture
as
justifiable
to
save
the
lives
in
this
scenario.
The
answer
is
(D).
Q:
A
fertilized
ovum
is
also
known
as
(A)
a
zygote.
(B)
an
embryo.
(C)
a
viability.
(D)
a
blastocyst.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
Once
a
single
sperm
penetrates
the
layers
of
an
egg
to
form
a
new
cell,
that
cell
is
called
a
zygote.
The
answer
is
(A).
Q:
Pence
compares
six
different
cases
of
reproduction,
from
natural
twinning
to
SCNT.
What
conclusion
does
he
draw
from
this
comparison?
(A)
SCNT
is
not
a
different
kind
of
reproduction
because
there
are
no
morally
relevant
differences
between
it
and
other
permissible
means
of
reproduction.
(B)
Because
there
is
a
low
risk
of
harm
for
natural
twinning,
there
will
be
a
low
risk
of
harm
for
SCNT.
(C)
both
A
and
B
(D)
neither
A
nor
B
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
SCNT,
known
as
'
'
somatic
cell
nuclear
transfer'
'
,
is
a
technique
in
which
the
nucleus
of
a
somatic
(body)
cell
is
transferred
to
the
cytoplasm
of
an
enucleated
egg.
This
is
used
in
farm
animal
cloning.
Pence
may
compare
this
to
other
processes
of
reproduction,
and
see
little
moral
separation.
The
answer
is
(A).
Q:
Which
of
the
following
is
an
intrinsic
good,
according
to
Singer?
(A)
being
in
the
upper
class
(B)
the
ability
to
learn
and
solve
problems
(C)
being
more
beautiful
than
most
of
the
population
(D)
being
shorter
than
average
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
Bioethicist
Peter
Singer
sees
intrinsic
value
as
innate
values
conferred
by
oneself,
for
oneself.
Innanimate
objects
can
be
beautiful,
short,
or
have
some
valuable
criteria,
but
capabilities
are
intrinsically
good.
The
answer
is
(B).
Q:
According
to
Metz,
what
is
wrong
with
consequentialist
arguments
against
capital
punishment
based
on
African
values?
(A)
It
is
unclear
as
of
yet
whether
or
not
capital
punishment
deters
harm
to
the
community.
(B)
It
is
unclear
as
of
yet
whether
or
not
capital
punishment
deters
harm
to
any
individuals.
(C)
Consequentialism
is
not
supported
by
African
values.
(D)
Even
though
consequentialism
is
supported
by
African
values,
no
consequentialist
arguments
framed
in
terms
of
African
values
have
been
offered.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
Thaddeus
Metz
is
a
humanities
research
professor
focusing
on
ethical
philosophy
in
South
Africa.
Metz
has
written
the
death
penalty
is
unjustified
as
it
treats
individuals
as
incapable
of
communal
relations.
It
is
unclear
that
capital
punishment
is
to
the
benefit
of,
or
a
deterrent
of
harm
to
the
community.
The
answer
is
(A).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_moral_disputes
"
dataset_name"
:
"
moral_disputes"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
moral
\
\
disputes.
\n\n
Q:
Baron
admits
that
the
versions
of
the
ticking
bomb
hypothetical
\
\
she
discusses
are
\"
stunningly
stupid,
\"
but
she
claims
this
is
actually
evidence
\
\
of
\n
(A)
the
stupidity
of
most
traditional
philosophical
examples.
(B)
a
general
\
\
lack
of
intelligence
among
people
with
advanced
degrees.
(C)
the
wrongness
of
\
\
torture.
(D)
the
readiness
on
the
part
of
many
intelligent
people
to
see
torture
\
\
as
the
best
solution
to
deal
with
terrorism.
\n
A:
Let's
think
step
by
step.
We
\
\
refer
to
Wikipedia
articles
on
moral
disputes
for
help.
The
ticking
bomb
hypothetical
\
\
poses
a
problem
where
many
people
will
die
to
an
exploding
bomb,
if
the
hypothetical
\
\
terrorist
does
not
disclose
how
to
defuse
it.
Baron
sees
this
hypothetical
as
\
\
silly,
but
its
prevalence
does
suggest
intelligent
people,
particularly
utilitarians,
\
\
see
torture
as
justifiable
to
save
the
lives
in
this
scenario.
The
answer
is
(D).
\n\
\n
Q:
A
fertilized
ovum
is
also
known
as
\n
(A)
a
zygote.
(B)
an
embryo.
(C)
a
viability.
\
\
(D)
a
blastocyst.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
moral
disputes
for
help.
Once
a
single
sperm
penetrates
the
layers
of
an
egg
\
\
to
form
a
new
cell,
that
cell
is
called
a
zygote.
The
answer
is
(A).
\n\n
Q:
Pence
\
\
compares
six
different
cases
of
reproduction,
from
natural
twinning
to
SCNT.
What
\
\
conclusion
does
he
draw
from
this
comparison?
\n
(A)
SCNT
is
not
a
different
kind
\
\
of
reproduction
because
there
are
no
morally
relevant
differences
between
it
and
\
\
other
permissible
means
of
reproduction.
(B)
Because
there
is
a
low
risk
of
harm
\
\
for
natural
twinning,
there
will
be
a
low
risk
of
harm
for
SCNT.
(C)
both
A
and
\
\
B
(D)
neither
A
nor
B
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
moral
disputes
for
help.
SCNT,
known
as
'somatic
cell
nuclear
transfer',
is
\
\
a
technique
in
which
the
nucleus
of
a
somatic
(body)
cell
is
transferred
to
the
\
\
cytoplasm
of
an
enucleated
egg.
This
is
used
in
farm
animal
cloning.
Pence
may
\
\
compare
this
to
other
processes
of
reproduction,
and
see
little
moral
separation.
\
\
The
answer
is
(A).
\n\n
Q:
Which
of
the
following
is
an
intrinsic
good,
according
\
\
to
Singer?
\n
(A)
being
in
the
upper
class
(B)
the
ability
to
learn
and
solve
problems
\
\
(C)
being
more
beautiful
than
most
of
the
population
(D)
being
shorter
than
average
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
disputes
for
\
\
help.
Bioethicist
Peter
Singer
sees
intrinsic
value
as
innate
values
conferred
\
\
by
oneself,
for
oneself.
Innanimate
objects
can
be
beautiful,
short,
or
have
some
\
\
valuable
criteria,
but
capabilities
are
intrinsically
good.
The
answer
is
(B).
\n\
\n
Q:
According
to
Metz,
what
is
wrong
with
consequentialist
arguments
against
capital
\
\
punishment
based
on
African
values?
\n
(A)
It
is
unclear
as
of
yet
whether
or
not
\
\
capital
punishment
deters
harm
to
the
community.
(B)
It
is
unclear
as
of
yet
whether
\
\
or
not
capital
punishment
deters
harm
to
any
individuals.
(C)
Consequentialism
\
\
is
not
supported
by
African
values.
(D)
Even
though
consequentialism
is
supported
\
\
by
African
values,
no
consequentialist
arguments
framed
in
terms
of
African
values
\
\
have
been
offered.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
moral
disputes
for
help.
Thaddeus
Metz
is
a
humanities
research
professor
focusing
\
\
on
ethical
philosophy
in
South
Africa.
Metz
has
written
the
death
penalty
is
unjustified
\
\
as
it
treats
individuals
as
incapable
of
communal
relations.
It
is
unclear
that
\
\
capital
punishment
is
to
the
benefit
of,
or
a
deterrent
of
harm
to
the
community.
\
\
The
answer
is
(A)."
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_moral_disputes"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_moral_scenarios.yaml
View file @
109ed1c7
dataset_name
:
moral_scenarios
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
moral
scenarios.
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
I
took
his
wallet
after
he
harassed
me
at
the
bar.
Scenario
2
|
I
made
love
to
my
newborn
baby
boy.
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Stealing
from
someone
is
not
considered
an
ethical
response
to
harrassment.
And
having
sex
with
a
minor
is
egregiously
unethical.
The
answer
is
(A).
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
I
asked
for
a
cigarette
from
the
guy
outside
the
store.
Scenario
2
|
When
the
security
guard
left
his
post,
I
pocketed
a
rare
coin.
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Asking
for
something
is
not
morally
wrong.
Stealing
is
morally
wrong.
The
answer
is
(C).
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
During
the
house'
'
s
demolition,
I
pushed
him
into
a
wall.
Scenario
2
|
When
I
choose
a
babysitter
for
my
twin
nieces,
I
don'
'
t
care
about
her
racial
background.
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Physical
violence
is
morally
wrong.
Being
open
to
hiring
people
from
different
racial
backgrounds
is
an
expected
moral
stance.
The
answer
is
(B).
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
I
bought
an
old
pick-up
truck
for
work.
Scenario
2
|
I
looked
up
at
the
gorgeous
waterfall
and
threw
a
rock
at
it.
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Neither
of
these
are
wrong
as
no
one
is
wronged
by
buing
a
pick-up
truck
or
throwing
a
rock
into
a
stream.
The
answer
is
(D).
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
I
really
loved
my
newborn
baby
boy.
Scenario
2
|
The
new
girl
is
embarrassed
to
live
in
a
trailer
park,
so
I
invite
the
class
to
her
house.
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Loving
someone
is
not
wrong.
However,
exposing
something
that
someone
is
embarrassed
about
could
be
considered
quite
mean.
The
answer
is
(C).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_moral_scenarios
"
dataset_name"
:
"
moral_scenarios"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
moral
\
\
scenarios.
\n\n
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
\
\
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
\
\
standards
in
the
US
as
of
2020?
Scenario
1
|
I
took
his
wallet
after
he
harassed
\
\
me
at
the
bar.
Scenario
2
|
I
made
love
to
my
newborn
baby
boy.
\n
(A)
Wrong,
Wrong
\
\
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
Stealing
\
\
from
someone
is
not
considered
an
ethical
response
to
harrassment.
And
having
\
\
sex
with
a
minor
is
egregiously
unethical.
The
answer
is
(A).
\n\n
Q:
For
which
\
\
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
something
\
\
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
US
as
of
2020?
\
\
Scenario
1
|
I
asked
for
a
cigarette
from
the
guy
outside
the
store.
Scenario
\
\
2
|
When
the
security
guard
left
his
post,
I
pocketed
a
rare
coin.
\n
(A)
Wrong,
\
\
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
scenarios
for
help.
\
\
Asking
for
something
is
not
morally
wrong.
Stealing
is
morally
wrong.
The
answer
\
\
is
(C).
\n\n
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
\
\
I/me/my)
do
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
\
\
in
the
US
as
of
2020?
Scenario
1
|
During
the
house's
demolition,
I
pushed
him
\
\
into
a
wall.
Scenario
2
|
When
I
choose
a
babysitter
for
my
twin
nieces,
I
don't
\
\
care
about
her
racial
background.
\n
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
\
\
wrong,
Wrong
(D)
Not
wrong,
Not
wrong
\n
A:
Let's
think
step
by
step.
We
refer
to
\
\
Wikipedia
articles
on
moral
scenarios
for
help.
Physical
violence
is
morally
wrong.
\
\
Being
open
to
hiring
people
from
different
racial
backgrounds
is
an
expected
moral
\
\
stance.
The
answer
is
(B).
\n\n
Q:
For
which
of
these
two
scenarios
does
the
main
\
\
character
(who
uses
I/me/my)
do
something
clearly
morally
wrong,
according
to
\
\
ordinary
moral
standards
in
the
US
as
of
2020?
Scenario
1
|
I
bought
an
old
pick-up
\
\
truck
for
work.
Scenario
2
|
I
looked
up
at
the
gorgeous
waterfall
and
threw
a
\
\
rock
at
it.
\n
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
\
\
wrong,
Not
wrong
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
moral
scenarios
for
help.
Neither
of
these
are
wrong
as
no
one
is
wronged
by
\
\
buing
a
pick-up
truck
or
throwing
a
rock
into
a
stream.
The
answer
is
(D).
\n\n\
Q:
For
which
of
these
two
scenarios
does
the
main
character
(who
uses
I/me/my)
do
\
\
something
clearly
morally
wrong,
according
to
ordinary
moral
standards
in
the
\
\
US
as
of
2020?
Scenario
1
|
I
really
loved
my
newborn
baby
boy.
Scenario
2
|
The
\
\
new
girl
is
embarrassed
to
live
in
a
trailer
park,
so
I
invite
the
class
to
her
\
\
house.
\n
(A)
Wrong,
Wrong
(B)
Wrong,
Not
wrong
(C)
Not
wrong,
Wrong
(D)
Not
wrong,
\
\
Not
wrong
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
moral
\
\
scenarios
for
help.
Loving
someone
is
not
wrong.
However,
exposing
something
that
\
\
someone
is
embarrassed
about
could
be
considered
quite
mean.
The
answer
is
(C)."
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_moral_scenarios"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_nutrition.yaml
View file @
109ed1c7
dataset_name
:
nutrition
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
nutrition.
Q:
What
is
the
first-line
drug
for
patients
with
type
2
diabetes
and
obesity,
as
of
2020?
(A)
Acarbose
(B)
Metformin
(C)
Sulphonylureas
(D)
Insulin
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
Metformin
(Fortamet,
Glumetza,
or
others)
is
usually
the
first
medication
prescribed
for
type
2
diabetes,
as
well
as
obesity.
It
works
by
lowering
glucose
production
in
the
liver
and
improving
the
body'
'
s
sensitivity
to
insulin.
The
answer
is
(B).
Q:
Which
of
the
following
statements
is
correct
(according
to
knowledge
in
2020)?
(A)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
aspartame
(B)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
saccharin
(C)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
sucralose
(D)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
acesulfame
K
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
People
with
phenylketonuria
(PKU)
cannot
break
down
the
amino
acid
phenylalanine.
As
it
builds
up
in
the
blood
and
brain
it
can
lead
to
brain
damage.
People
with
PKU
should
avoid
foods
that
are
converted
to
phenylalanine
in
the
body,
such
as
aspartame.
The
answer
is
(A).
Q:
Which
of
the
following
statements
about
iodine
is
correct,
as
of
2020?
(A)
50%
of
adults
consume
iodine
at
levels
below
the
RNI
(B)
Dairy
products
are
a
poor
source
of
iodine
(C)
The
iodine
content
of
organic
milk
is
generally
lower
that
the
level
in
non-organic
milk
(D)
UK
dietary
reference
values
recommend
an
increase
in
iodine
intake
in
pregnancy
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
Organic
milk
usually
has
less
iodine
content
than
non-organic
milk.
The
answer
is
(C).
Q:
Which
of
the
following
is
the
most
plausible
explanation
for
the
protective
effect
of
dietary
fibre
against
cancer
of
the
colon,
as
of
2020?
(A)
Propionic
acid,
formed
during
colonic
fibre
fermentation
inhibits
liver
fatty
acid
synthesis
(B)
Butyric
acid,
formed
during
colonic
fibre
fermentation
stimulates
"silencing"
of
the
SLC5A8
tumour
suppressor
gene
(C)
None
of
these
options
are
correct
(D)
Butyric
acid,
formed
during
colonic
fibre
fermentation
stimulates
anti-oxidant
defences
in
the
colon
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
Dietary
fibre
is
inversely
proportional
to
the
risk
of
colorectal
cancer.
This
is
presumed
because
butyric
acid
(BA)
stimulates
antioxidants
which
help
protect
the
colon
from
cancerous
tumors.
The
answer
is
(D).
Q:
In
a
cohort
study,
the
risk
ratio
of
developing
diabetes
was
0.86
when
comparing
consumers
of
tea
(the
exposed)
to
those
who
did
not
drink
tea
(the
unexposed).
Which
one
statement
is
correct
(according
to
knowledge
in
2020)?
(A)
The
tea
drinkers
have
lower
risk
of
developing
diabetes.
(B)
The
tea
drinkers
have
higher
risk
of
developing
diabetes.
(C)
Based
on
the
information
given
we
cannot
tell
if
the
observed
difference
in
disease
risk
is
the
result
of
chance.
(D)
The
risk
ratio
is
close
to
the
value
one,
so
there
is
no
difference
in
disease
risk
between
the
two
groups.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
The
risk
ratio
is
not
sufficiently
reduced
that
it
could
not
be
explained
by
random
chance
given
the
studies
sample
size.
The
answer
is
(C).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_nutrition
"
dataset_name"
:
"
nutrition"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
nutrition.
\n\
\n
Q:
What
is
the
first-line
drug
for
patients
with
type
2
diabetes
and
obesity,
\
\
as
of
2020?
\n
(A)
Acarbose
(B)
Metformin
(C)
Sulphonylureas
(D)
Insulin
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
nutrition
for
help.
Metformin
\
\
(Fortamet,
Glumetza,
or
others)
is
usually
the
first
medication
prescribed
for
\
\
type
2
diabetes,
as
well
as
obesity.
It
works
by
lowering
glucose
production
in
\
\
the
liver
and
improving
the
body's
sensitivity
to
insulin.
The
answer
is
(B).
\n\
\n
Q:
Which
of
the
following
statements
is
correct
(according
to
knowledge
in
2020)?
\n\
(A)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
aspartame
\
\
(B)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
sweetener
\
\
saccharin
(C)
Consumers
with
phenylketonuria
must
avoid
the
consumption
of
the
\
\
sweetener
sucralose
(D)
Consumers
with
phenylketonuria
must
avoid
the
consumption
\
\
of
the
sweetener
acesulfame
K
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
nutrition
for
help.
People
with
phenylketonuria
(PKU)
cannot
break
\
\
down
the
amino
acid
phenylalanine.
As
it
builds
up
in
the
blood
and
brain
it
can
\
\
lead
to
brain
damage.
People
with
PKU
should
avoid
foods
that
are
converted
to
\
\
phenylalanine
in
the
body,
such
as
aspartame.
The
answer
is
(A).
\n\n
Q:
Which
of
\
\
the
following
statements
about
iodine
is
correct,
as
of
2020?
\n
(A)
50%
of
adults
\
\
consume
iodine
at
levels
below
the
RNI
(B)
Dairy
products
are
a
poor
source
of
\
\
iodine
(C)
The
iodine
content
of
organic
milk
is
generally
lower
that
the
level
\
\
in
non-organic
milk
(D)
UK
dietary
reference
values
recommend
an
increase
in
iodine
\
\
intake
in
pregnancy
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
nutrition
for
help.
Organic
milk
usually
has
less
iodine
content
than
non-organic
\
\
milk.
The
answer
is
(C).
\n\n
Q:
Which
of
the
following
is
the
most
plausible
explanation
\
\
for
the
protective
effect
of
dietary
fibre
against
cancer
of
the
colon,
as
of
\
\
2020?
\n
(A)
Propionic
acid,
formed
during
colonic
fibre
fermentation
inhibits
liver
\
\
fatty
acid
synthesis
(B)
Butyric
acid,
formed
during
colonic
fibre
fermentation
\
\
stimulates
\"
silencing
\"
of
the
SLC5A8
tumour
suppressor
gene
(C)
None
of
these
\
\
options
are
correct
(D)
Butyric
acid,
formed
during
colonic
fibre
fermentation
\
\
stimulates
anti-oxidant
defences
in
the
colon
\n
A:
Let's
think
step
by
step.
We
\
\
refer
to
Wikipedia
articles
on
nutrition
for
help.
Dietary
fibre
is
inversely
\
\
proportional
to
the
risk
of
colorectal
cancer.
This
is
presumed
because
butyric
\
\
acid
(BA)
stimulates
antioxidants
which
help
protect
the
colon
from
cancerous
\
\
tumors.
The
answer
is
(D).
\n\n
Q:
In
a
cohort
study,
the
risk
ratio
of
developing
\
\
diabetes
was
0.86
when
comparing
consumers
of
tea
(the
exposed)
to
those
who
did
\
\
not
drink
tea
(the
unexposed).
Which
one
statement
is
correct
(according
to
knowledge
\
\
in
2020)?
\n
(A)
The
tea
drinkers
have
lower
risk
of
developing
diabetes.
(B)
The
\
\
tea
drinkers
have
higher
risk
of
developing
diabetes.
(C)
Based
on
the
information
\
\
given
we
cannot
tell
if
the
observed
difference
in
disease
risk
is
the
result
\
\
of
chance.
(D)
The
risk
ratio
is
close
to
the
value
one,
so
there
is
no
difference
\
\
in
disease
risk
between
the
two
groups.
\n
A:
Let's
think
step
by
step.
We
refer
\
\
to
Wikipedia
articles
on
nutrition
for
help.
The
risk
ratio
is
not
sufficiently
\
\
reduced
that
it
could
not
be
explained
by
random
chance
given
the
studies
sample
\
\
size.
The
answer
is
(C)."
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_nutrition"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_philosophy.yaml
View file @
109ed1c7
dataset_name
:
philosophy
description
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
philosophy.
\n\
"
dataset_name
"
:
"
philosophy
"
"
description
"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
philosophy.
\n\
\n
Q:
The
study
of
reality
in
the
broadest
sense,
an
inquiry
into
the
elemental
nature
\
\
of
the
universe
and
the
things
in
it,
is
known
as
_____.
\n
(A)
metaphysics
(B)
\
\
epistemology
(C)
quantum
physics
(D)
axiology
\n
A:
Let's
think
step
by
step.
We
\
\
refer
to
Wikipedia
articles
on
philosophy
for
help.
Among
the
options,
only
metaphysics
\
\
studies
the
nature
of
reality
and
existence.
The
answer
is
(A).
\n\n
Q:
According
\
\
to
Moore
\u2019
s
\u201C
ideal
utilitarianism,
\u201D
the
right
action
is
the
one
\
\
that
brings
about
the
greatest
amount
of:
\n
(A)
pleasure.
(B)
happiness.
(C)
good.
\
\
(D)
virtue.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
philosophy
\
\
for
help.
Moore's
\"
ideal
utilitarianism
\"
states
that
one's
actions
should
maximize
\
\
intrinsic
goods.
The
answer
is
(C).
\n\n
Q:
Before
Tolstoy's
Christian
conversion,
\
\
what
was
his
perspective
on
the
meaning
of
life?
\n
(A)
optimist
(B)
satisfied
(C)
\
\
nominally
religious
(D)
pessimist
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
philosophy
for
help.
Before
his
conversion,
Tolstoy
feels
that
life
\
\
was
uncertain,
which
is
a
pessimist's
point
of
view.
The
answer
is
(D).
\n\n
Q:
\
\
According
to
d'Holbach,
people
always
act
according
to
_____.
\n
(A)
free
choices
\
\
(B)
dictates
of
the
soul
(C)
necessary
natural
laws
(D)
undetermined
will
\n
A:
\
\
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
philosophy
for
help.
\
\
d'Holbach
believes
that
people
act
according
to
necessary
laws,
and
it
proves
\
\
nothing
about
people's
free
will.
The
answer
is
(C).
\n\n
Q:
Psychological
egoism
\
\
is:
\n
(A)
an
ethical
theory
about
how
we
ought
to
behave.
(B)
a
generalization
\
\
concerning
the
way
people
tend
to
behave.
(C)
a
claim
about
human
nature
and
the
\
\
ways
people
are
capable
of
behaving.
(D)
none
of
the
above.
\n
A:
Let's
think
step
\
\
by
step.
We
refer
to
Wikipedia
articles
on
philosophy
for
help.
Psychological
\
\
egoism
suggests
that
one
behaves
based
on
what
makes
one
feels
good,
hence
it
\
\
is
a
claim
about
human
nature
and
how
humans
are
capable
of
behaving.
The
answer
\
\
is
(C).
"
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_philosophy
\
to
Moore
’s
“
ideal
utilitarianism,
”
the
right
action
is
the
one
that
brings
about
\
\
the
greatest
amount
of:
\n
(A)
pleasure.
(B)
happiness.
(C)
good.
(D)
virtue.
\n
\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
philosophy
for
help.
\
\
Moore's
\"
ideal
utilitarianism
\"
states
that
one's
actions
should
maximize
intrinsic
\
\
goods.
The
answer
is
(C).
\n\n
Q:
Before
Tolstoy's
Christian
conversion,
what
was
\
\
his
perspective
on
the
meaning
of
life?
\n
(A)
optimist
(B)
satisfied
(C)
nominally
\
\
religious
(D)
pessimist
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
philosophy
for
help.
Before
his
conversion,
Tolstoy
feels
that
life
was
uncertain,
\
\
which
is
a
pessimist's
point
of
view.
The
answer
is
(D).
\n\n
Q:
According
to
d'Holbach,
\
\
people
always
act
according
to
_____.
\n
(A)
free
choices
(B)
dictates
of
the
soul
\
\
(C)
necessary
natural
laws
(D)
undetermined
will
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
philosophy
for
help.
d'Holbach
believes
that
\
\
people
act
according
to
necessary
laws,
and
it
proves
nothing
about
people's
free
\
\
will.
The
answer
is
(C).
\n\n
Q:
Psychological
egoism
is:
\n
(A)
an
ethical
theory
\
\
about
how
we
ought
to
behave.
(B)
a
generalization
concerning
the
way
people
tend
\
\
to
behave.
(C)
a
claim
about
human
nature
and
the
ways
people
are
capable
of
behaving.
\
\
(D)
none
of
the
above.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
philosophy
for
help.
Psychological
egoism
suggests
that
one
behaves
based
on
\
\
what
makes
one
feels
good,
hence
it
is
a
claim
about
human
nature
and
how
humans
\
\
are
capable
of
behaving.
The
answer
is
(C)."
"
group"
:
"
mmlu_flan_cot_fewshot_humanities
"
"
include
"
:
"
_mmlu_flan_cot_fewshot_template_yaml
"
"
task
"
:
"
mmlu_flan_cot_fewshot_philosophy
"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_prehistory.yaml
View file @
109ed1c7
dataset_name
:
prehistory
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
prehistory.
Q:
What
is
the
approximate
mean
cranial
capacity
of
Homo
erectus?
(A)
under
650
cc
(B)
about
800
cc
(C)
just
under
1000
cc
(D)
1200
cc
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
The
average
cranium
capacity
of
Homo
erectus
is
less
than
1000
cubic
cm.
The
answer
is
(C).
Q:
According
to
Timothy
Pauketat,
the
evidence
for
social
stratification
and
political
power
at
Cahokia
suggests:
(A)
a
center
of
Mississippian
civilization
with
conditions
similar
to
the
rise
of
early
states.
(B)
the
limitations
of
authority
in
a
Native
American
society
of
egalitarian
foragers.
(C)
a
simple
chiefdom
or
perhaps
a
complex
chiefdom
had
evolved
by
A.D.
1500.
(D)
a
center
of
Mississippian
civilization
with
conditions
similar
to
societies
on
the
Northwest
Coast
of
North
America.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
Timothy
Pauketat
is
known
for
his
research
on
Cahokia,
the
center
of
the
Mississippian
culture,
where
he
found
similar
conditions
to
the
rise
of
early
states.
The
answer
is
(A).
Q:
Recent
research
on
hominid
species
dating
from
the
Middle
Pliocene
indicates
there
was
(as
of
2020):
(A)
a
great
amount
of
species
diversity,
or
a
single
species
that
exhibited
a
lot
of
diversity.
(B)
very
little
species
diversity
during
this
period
and
very
few
hominids.
(C)
decreased
species
diversity
due
to
a
prolonged
ice
age
followed
by
a
severe
drought.
(D)
decreased
species
diversity
but
increased
numbers
of
hammerstones
and
flakes,
indicating
stone
tool
manufacture.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
Recent
research
has
recognized
multiple
hominid
species
from
the
Middle
Pliocene,
meaning
that
there
is
a
great
amount
of
species
diversity
or
diversity
in
a
single
species.
The
answer
is
(A).
Q:
Researchers
now
believe
that
the
decline
of
the
Maya
was
caused
chiefly
by:
(A)
a
cataclysm
of
some
kind,
such
as
an
earthquake,
volcano,
or
tsunami.
(B)
ecological
degradation
resulting
from
slash-and-burn
farming
techniques.
(C)
endless
wars
between
neighboring
Mayan
city-states.
(D)
practices
of
interbreeding
that
led
to
a
steep
rise
in
congenital
disorders.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
Researchers
believe
that
the
Maya
collapse
was
mainly
caused
by
over-exploitation
of
natural
resources
like
the
slash-and-burn
farming
techniques.
The
answer
is
(B).
Q:
The
great
Mayan
king
Pacal
built
temples
in
the
city
of
Palenque
in
order
to:
(A)
satisfy
the
powerful
Mayan
astronomer
priests.
(B)
display
his
generosity
to
the
common
people,
since
they
were
allowed
to
live
in
the
temples.
(C)
frighten
away
enemies,
in
particular
the
Spaniards.
(D)
legitimize
his
kingship,
since
his
father
was
not
royal.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
Pacal
built
the
temples
as
the
funerary
monument
to
legitimize
his
kingship.
The
answer
is
(D).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_prehistory
"
dataset_name"
:
"
prehistory"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
prehistory.
\n\
\n
Q:
What
is
the
approximate
mean
cranial
capacity
of
Homo
erectus?
\n
(A)
under
650
\
\
cc
(B)
about
800
cc
(C)
just
under
1000
cc
(D)
1200
cc
\n
A:
Let's
think
step
by
\
\
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
The
average
cranium
\
\
capacity
of
Homo
erectus
is
less
than
1000
cubic
cm.
The
answer
is
(C).
\n\n
Q:
\
\
According
to
Timothy
Pauketat,
the
evidence
for
social
stratification
and
political
\
\
power
at
Cahokia
suggests:
\n
(A)
a
center
of
Mississippian
civilization
with
conditions
\
\
similar
to
the
rise
of
early
states.
(B)
the
limitations
of
authority
in
a
Native
\
\
American
society
of
egalitarian
foragers.
(C)
a
simple
chiefdom
or
perhaps
a
complex
\
\
chiefdom
had
evolved
by
A.D.
1500.
(D)
a
center
of
Mississippian
civilization
\
\
with
conditions
similar
to
societies
on
the
Northwest
Coast
of
North
America.
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
\
\
Timothy
Pauketat
is
known
for
his
research
on
Cahokia,
the
center
of
the
Mississippian
\
\
culture,
where
he
found
similar
conditions
to
the
rise
of
early
states.
The
answer
\
\
is
(A).
\n\n
Q:
Recent
research
on
hominid
species
dating
from
the
Middle
Pliocene
\
\
indicates
there
was
(as
of
2020):
\n
(A)
a
great
amount
of
species
diversity,
or
\
\
a
single
species
that
exhibited
a
lot
of
diversity.
(B)
very
little
species
diversity
\
\
during
this
period
and
very
few
hominids.
(C)
decreased
species
diversity
due
\
\
to
a
prolonged
ice
age
followed
by
a
severe
drought.
(D)
decreased
species
diversity
\
\
but
increased
numbers
of
hammerstones
and
flakes,
indicating
stone
tool
manufacture.
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
\
\
Recent
research
has
recognized
multiple
hominid
species
from
the
Middle
Pliocene,
\
\
meaning
that
there
is
a
great
amount
of
species
diversity
or
diversity
in
a
single
\
\
species.
The
answer
is
(A).
\n\n
Q:
Researchers
now
believe
that
the
decline
of
\
\
the
Maya
was
caused
chiefly
by:
\n
(A)
a
cataclysm
of
some
kind,
such
as
an
earthquake,
\
\
volcano,
or
tsunami.
(B)
ecological
degradation
resulting
from
slash-and-burn
\
\
farming
techniques.
(C)
endless
wars
between
neighboring
Mayan
city-states.
(D)
\
\
practices
of
interbreeding
that
led
to
a
steep
rise
in
congenital
disorders.
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
prehistory
for
help.
\
\
Researchers
believe
that
the
Maya
collapse
was
mainly
caused
by
over-exploitation
\
\
of
natural
resources
like
the
slash-and-burn
farming
techniques.
The
answer
is
\
\
(B).
\n\n
Q:
The
great
Mayan
king
Pacal
built
temples
in
the
city
of
Palenque
in
\
\
order
to:
\n
(A)
satisfy
the
powerful
Mayan
astronomer
priests.
(B)
display
his
\
\
generosity
to
the
common
people,
since
they
were
allowed
to
live
in
the
temples.
\
\
(C)
frighten
away
enemies,
in
particular
the
Spaniards.
(D)
legitimize
his
kingship,
\
\
since
his
father
was
not
royal.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
prehistory
for
help.
Pacal
built
the
temples
as
the
funerary
monument
\
\
to
legitimize
his
kingship.
The
answer
is
(D)."
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_prehistory"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_accounting.yaml
View file @
109ed1c7
dataset_name
:
professional_accounting
description
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
\
accounting.
\n\n
Q:
An
auditor
traces
the
serial
numbers
on
equipment
to
a
nonissuer
\u2019
\
s
subledger.
Which
of
the
following
management
assertions
is
supported
by
this
test?
\n\
"
dataset_name
"
:
"
professional_accounting
"
"
description
"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
\
accounting.
\n\n
Q:
An
auditor
traces
the
serial
numbers
on
equipment
to
a
nonissuer
’s
\
\
subledger.
Which
of
the
following
management
assertions
is
supported
by
this
test?
\n\
(A)
Valuation
and
allocation
(B)
Completeness
(C)
Rights
and
obligations
(D)
Presentation
\
\
and
disclosure
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
\
\
accounting
for
help.
The
completeness
assertion
is
tested
by
tracing
supporting
\
...
...
@@ -43,5 +43,6 @@ description: "The following are multiple choice questions (with answers) about p
\
transactions,
only
Proceeds
from
long-term
debt
belongs
to
the
financing
activities
\
\
section
of
cashflow,
hence
the
amount
reported
should
be
$100000.
The
answer
is
\
\
(D)."
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_accounting
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_professional_accounting"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_law.yaml
View file @
109ed1c7
dataset_name
:
professional_law
description
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
"
dataset_name
"
:
"
professional_law
"
"
description
"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
\
law.
\n\n
Q:
A
son
owed
a
creditor
$5,000.
The
son's
father
contacted
the
creditor
\
\
and
told
him
that
he
wanted
to
pay
the
son's
debt.
The
father
signed
a
document
\
\
that
stated
the
father
would
pay
the
son's
debt
at
a
rate
of
$500
a
month
for
\
...
...
@@ -101,5 +101,6 @@ description: "The following are multiple choice questions (with answers) about p
\
Amendment
further
supports
the
First
Amendment
by
establishing
a
due
process
clause.
\
\
Hence
the
strongest
argument
should
be
the
statute
is
overbroad
and
consequently
\
\
invalid
under
the
First
and
Fourteenth
Amendments.
The
answer
is
(D)."
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_law
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_professional_law"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_medicine.yaml
View file @
109ed1c7
dataset_name
:
professional_medicine
description
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
"
dataset_name
"
:
"
professional_medicine
"
"
description
"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
\
medicine.
\n\n
Q:
A
22-year-old
male
marathon
runner
presents
to
the
office
with
\
\
the
complaint
of
right-sided
rib
pain
when
he
runs
long
distances.
Physical
examination
\
\
reveals
normal
heart
and
lung
findings
and
an
exhalation
dysfunction
at
ribs
\_
\
4-5
on
the
right.
Which
of
the
following
muscles
or
muscle
groups
will
be
most
useful
\
\
reveals
normal
heart
and
lung
findings
and
an
exhalation
dysfunction
at
ribs
4-5
\
\
on
the
right.
Which
of
the
following
muscles
or
muscle
groups
will
be
most
useful
\
\
in
correcting
this
dysfunction
utilizing
a
direct
method?
\n
(A)
anterior
scalene
\
\
(B)
latissimus
dorsi
(C)
pectoralis
minor
(D)
quadratus
lumborum
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
medicine
for
help.
Among
the
options,
\
\
only
pectoralis
minor
muscle
origins
from
the
outer
surfaces
of
the
3rd
to
5th
\
\
ribs.
The
answer
is
(C).
\n\n
Q:
A
36-year-old
male
presents
to
the
office
with
\
\
a
\_
3-week
\_
history
of
low
back
pain.
He
denies
any
recent
trauma
but
says
that
\
\
he
climbs
in
and
out
of
his
truck
numerous
times
a
day
for
his
job.
Examination
\
\
of
the
patient
in
the
prone
position
reveals
a
deep
sacral
sulcus
on
the
left,
\
\
a
posterior
inferior
lateral
angle
on
the
right,
and
a
lumbosacral
junction
that
\
\
a
3-week
history
of
low
back
pain.
He
denies
any
recent
trauma
but
says
that
he
\
\
climbs
in
and
out
of
his
truck
numerous
times
a
day
for
his
job.
Examination
of
\
\
the
patient
in
the
prone
position
reveals
a
deep
sacral
sulcus
on
the
left,
a
\
\
posterior
inferior
lateral
angle
on
the
right,
and
a
lumbosacral
junction
that
\
\
springs
freely
on
compression.
The
most
likely
diagnosis
is
\n
(A)
left-on-left
\
\
sacral
torsion
(B)
left-on-right
sacral
torsion
(C)
right
unilateral
sacral
flexion
\
\
(D)
right-on-right
sacral
torsion
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
...
...
@@ -23,9 +23,9 @@ description: "The following are multiple choice questions (with answers) about p
\
nonproductive
cough,
runny
nose,
and
frontal
headache.
He
says
the
headache
is
\
\
worse
in
the
morning
and
ibuprofen
does
provide
some
relief.
He
has
not
had
shortness
\
\
of
breath.
Medical
history
is
unremarkable.
He
takes
no
medications
other
than
\
\
the
ibuprofen
for
pain.
Vital
signs
are
temperature
37.4
\xB0
C
(99.4
\xB0
F),
pulse
\
\
88/min,
respirations
18/min,
and
blood
pressure
120/84
mm
Hg.
Examination
of
the
\
\
nares
shows
erythematous
mucous
membranes.
Examination
of
the
throat
shows
erythema
\
\
the
ibuprofen
for
pain.
Vital
signs
are
temperature
37.4
°
C
(99.4
°
F),
pulse
88/min,
\
\
respirations
18/min,
and
blood
pressure
120/84
mm
Hg.
Examination
of
the
nares
\
\
shows
erythematous
mucous
membranes.
Examination
of
the
throat
shows
erythema
\
\
and
follicular
lymphoid
hyperplasia
on
the
posterior
oropharynx.
There
is
no
palpable
\
\
cervical
adenopathy.
Lungs
are
clear
to
auscultation.
Which
of
the
following
is
\
\
the
most
likely
cause
of
this
patient's
symptoms?
\n
(A)
Allergic
rhinitis
(B)
Epstein-Barr
\
...
...
@@ -57,13 +57,14 @@ description: "The following are multiple choice questions (with answers) about p
\
A
follow-up
visit
in
the
office
2
weeks
ago
disclosed
elevated
urinary
normetanephrine
\
\
and
metanephrine
and
plasma
aldosterone
concentrations.
The
patient
was
referred
\
\
to
a
surgeon,
who
recommended
the
adrenalectomy.
Today,
vital
signs
are
temperature
\
\
36.6
\xB0
C
(97.9
\xB0
F),
pulse
100/min,
respirations
14/min,
and
blood
pressure
\
\
170/95
mm
Hg.
Physical
examination
discloses
no
significant
findings.
Initial
\
\
preoperative
preparation
should
include
treatment
with
which
of
the
following?
\n\
(A)
Labetalol
(B)
A
loading
dose
of
potassium
chloride
(C)
Nifedipine
(D)
Phenoxybenzamine
\n\
\
36.6
°
C
(97.9
°
F),
pulse
100/min,
respirations
14/min,
and
blood
pressure
170/95
\
\
mm
Hg.
Physical
examination
discloses
no
significant
findings.
Initial
preoperative
\
\
preparation
should
include
treatment
with
which
of
the
following?
\n
(A)
Labetalol
\
\
(B)
A
loading
dose
of
potassium
chloride
(C)
Nifedipine
(D)
Phenoxybenzamine
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
medicine
for
help.
\
\
The
symptoms
and
the
adrenal
mass
suggested
pheochromocytoma,
and
the
blood
pressure
\
\
indicates
hypertension.
Phenoxybenzamine
is
used
to
treat
hypertension
caused
\
\
by
pheochromocytoma.
The
answer
is
(D)."
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_medicine
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_professional_medicine"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
View file @
109ed1c7
dataset_name
:
professional_psychology
description
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
"
dataset_name
"
:
"
professional_psychology
"
"
description
"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
\
psychology.
\n\n
Q:
In
the
construction
of
a
multiple
regression
equation
for
purposes
\
\
of
prediction,
the
optimal
combination
of
measures
is
one
in
which
the
predictors
\n\
(A)
are
uncorrelated
with
each
other
but
are
moderately
correlated
with
the
criterion
\
...
...
@@ -18,30 +18,31 @@ description: "The following are multiple choice questions (with answers) about p
\
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
The
definition
\
\
of
mode
is
the
most
frequently
occurring
number.
The
answer
is
(D).
\n\n
Q:
Carl
\
\
Jung
believed
that
a
client's
transference:
\n
(A)
is
a
fantasy
that
distracts
the
\
\
client
from
reality.
(B)
represents
\u201C
mixed
feelings
\u201D
toward
the
therapist.
\
\
(C)
\"
is
a
form
of
\"\"
acting
out.
\"\"\"
(D)
reflects
the
client
\u2019
s
personal
\
\
and
collective
unconscious.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
psychology
for
help.
Transference
is
a
phenomenon
that
a
person's
\
\
feelings
are
unconsciously
redirected,
so
it
reflects
the
client's
personal
and
\
\
collective
unconscious.
The
answer
is
(D).
\n\n
Q:
In
terms
of
Hofstede
\u2019
s
(1980)
\
\
five
cultural
dimensions,
the
United
States
scores
at
the
top
of
the
scale
on:
\n\
(A)
individualism.
(B)
individualism
and
power
distance.
(C)
power
distance
and
\
\
masculinity.
(D)
uncertainty
avoidance.
\n
A:
Let's
think
step
by
step.
We
refer
\
\
to
Wikipedia
articles
on
psychology
for
help.
US
scores
highest
on
individualism
\
\
among
the
five
cultural
dimensions.
The
answer
is
(A).
\n\n
Q:
One
of
your
therapy
\
\
clients
asks
your
advice
about
a
good
weight-
reduction
program.
You
have
investigated
\
\
the
programs
in
the
community
and
are
enrolled
in
the
one
you
consider
the
best.
\
\
This
program
offers
a
$50
bonus
to
its
patrons
for
each
new
person
they
bring
\
\
into
the
program.
Under
these
circumstances,
your
most
appropriate
response
would
\
\
be
to
\n
(A)
tell
your
client
the
pros
and
cons
of
each
program
you
know
about
except
\
\
for
the
one
in
which
you
are
enrolled
(B)
recommend
to
your
client
the
program
\
\
in
which
you
are
enrolled
and
explain
the
$50
bonus
you
will
receive
(C)
recommend
\
\
to
your
client
the
program
in
which
you
are
enrolled
and
offer
to
have
the
$50
\
\
bonus
credited
to
your
client's
account
in
the
program
(D)
tell
your
client
the
\
\
pros
and
cons
of
each
program
you
know
about,
but
do
not
claim
the
$50
bonus
if
\
\
your
client
enrolls
in
your
program
\n
A:
Let's
think
step
by
step.
We
refer
to
\
\
Wikipedia
articles
on
psychology
for
help.
Based
on
the
circumstances,
you
should
\
\
tell
your
client
about
the
pros
and
cons
of
each
program,
but
it
would
be
inappropriate
\
\
to
receive
the
bonus,
so
you
should
not
claim
the
$50
bonus.
The
answer
is
(D)."
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_professional_psychology
\
client
from
reality.
(B)
represents
“mixed
feelings”
toward
the
therapist.
(C)
\
\ \"
is
a
form
of
\"\"
acting
out.
\"\"\"
(D)
reflects
the
client’s
personal
and
collective
\
\
unconscious.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
\
\
for
help.
Transference
is
a
phenomenon
that
a
person's
feelings
are
unconsciously
\
\
redirected,
so
it
reflects
the
client's
personal
and
collective
unconscious.
The
\
\
answer
is
(D).
\n\n
Q:
In
terms
of
Hofstede’s
(1980)
five
cultural
dimensions,
the
\
\
United
States
scores
at
the
top
of
the
scale
on:
\n
(A)
individualism.
(B)
individualism
\
\
and
power
distance.
(C)
power
distance
and
masculinity.
(D)
uncertainty
avoidance.
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
\
\
US
scores
highest
on
individualism
among
the
five
cultural
dimensions.
The
answer
\
\
is
(A).
\n\n
Q:
One
of
your
therapy
clients
asks
your
advice
about
a
good
weight-
\
\
reduction
program.
You
have
investigated
the
programs
in
the
community
and
are
\
\
enrolled
in
the
one
you
consider
the
best.
This
program
offers
a
$50
bonus
to
\
\
its
patrons
for
each
new
person
they
bring
into
the
program.
Under
these
circumstances,
\
\
your
most
appropriate
response
would
be
to
\n
(A)
tell
your
client
the
pros
and
\
\
cons
of
each
program
you
know
about
except
for
the
one
in
which
you
are
enrolled
\
\
(B)
recommend
to
your
client
the
program
in
which
you
are
enrolled
and
explain
\
\
the
$50
bonus
you
will
receive
(C)
recommend
to
your
client
the
program
in
which
\
\
you
are
enrolled
and
offer
to
have
the
$50
bonus
credited
to
your
client's
account
\
\
in
the
program
(D)
tell
your
client
the
pros
and
cons
of
each
program
you
know
\
\
about,
but
do
not
claim
the
$50
bonus
if
your
client
enrolls
in
your
program
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
\
\
Based
on
the
circumstances,
you
should
tell
your
client
about
the
pros
and
cons
\
\
of
each
program,
but
it
would
be
inappropriate
to
receive
the
bonus,
so
you
should
\
\
not
claim
the
$50
bonus.
The
answer
is
(D)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_professional_psychology"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
View file @
109ed1c7
dataset_name
:
public_relations
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
public
relations.
Q:
Earth
Hour
was
a
campaign
launched
by
which
organization?
(A)
Greenpeace
(B)
The
UN
(C)
Oxfam
(D)
World
Wildlife
Fund
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Earth
Hour
is
a
worldwide
movement
oragnized
launched
by
the
World
Wildlife
Fund.
The
answer
is
(D).
Q:
In
issues
management,
what
is
the
most
proactive
approach
to
addressing
negative
or
misleading
information
posted
online
about
your
organization?
(A)
Buy
domain
names
that
could
be
used
by
opposition
groups.
(B)
Post
anonymous
comments
on
blogs
to
combat
this
information.
(C)
Prepare
a
news
release
that
discredits
the
inaccurate
information.
(D)
Make
policy
changes
to
address
complaints
highlighted
on
these
sites.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
In
issues
management,
the
most
proactive
approach
to
addressing
negative
or
misleading
information
posted
online
is
to
make
policy
changes
to
address
complaints
highlighted
on
those
sites.
The
answer
is
(D).
Q:
At
which
stage
in
the
planning
process
would
a
situation
analysis
be
carried
out?
(A)
Defining
the
program
(B)
Planning
the
program
(C)
Taking
action
and
implementing
ideas
(D)
Evaluation
of
the
program
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Situation
analyses
are
typically
carried
out
during
the
planning
process
stage
of
defining
the
program.
The
answer
is
(A).
Q:
Which
of
these
statements
is
true
of
the
Vatican
in
2010
at
the
time
of
the
accusations
of
child
abuse
cover-ups?
(A)
There
was
a
coordinated
media
response.
(B)
Consistent
messages
were
communicated.
(C)
Criticisms
were
taken
as
attacks
on
the
Catholic
Church.
(D)
The
credibility
of
the
Vatican
was
upheld.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
In
2010
when
there
were
accusations
of
child
abuse
cover-ups,
the
Vatican
took
those
criticisms
as
attacks
on
the
Catholic
Church.
The
answer
is
(C).
Q:
What
should
a
public
relations
media
practitioner
do
if
she
does
not
know
the
answer
to
a
reporter'
'
s
question?
(A)
Give
the
reporter
other
information
she
is
certain
is
correct.
(B)
Say
that
the
information
is
'
'
off
the
record'
'
and
will
be
disseminated
later.
(C)
Say
'
'
I
don'
'
t
know'
'
and
promise
to
provide
the
information
later.
(D)
Say
'
'
no
comment,'
'
rather
than
appear
uninformed.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
If
a
public
relations
media
practitioner
does
not
know
the
answer
to
a
reporter'
'
s
question,
they
should
say
'
'
I
don'
'
t
know'
'
and
offer
to
provide
the
information
later.
The
answer
is
(C).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_public_relations
"
dataset_name"
:
"
public_relations"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
public
\
\
relations.
\n\n
Q:
Earth
Hour
was
a
campaign
launched
by
which
organization?
\n
(A)
\
\
Greenpeace
(B)
The
UN
(C)
Oxfam
(D)
World
Wildlife
Fund
\n
A:
Let's
think
step
by
\
\
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Earth
Hour
\
\
is
a
worldwide
movement
oragnized
launched
by
the
World
Wildlife
Fund.
The
answer
\
\
is
(D).
\n\n
Q:
In
issues
management,
what
is
the
most
proactive
approach
to
addressing
\
\
negative
or
misleading
information
posted
online
about
your
organization?
\n
(A)
\
\
Buy
domain
names
that
could
be
used
by
opposition
groups.
(B)
Post
anonymous
comments
\
\
on
blogs
to
combat
this
information.
(C)
Prepare
a
news
release
that
discredits
\
\
the
inaccurate
information.
(D)
Make
policy
changes
to
address
complaints
highlighted
\
\
on
these
sites.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
\
\
public
relations
for
help.
In
issues
management,
the
most
proactive
approach
to
\
\
addressing
negative
or
misleading
information
posted
online
is
to
make
policy
\
\
changes
to
address
complaints
highlighted
on
those
sites.
The
answer
is
(D).
\n\
\n
Q:
At
which
stage
in
the
planning
process
would
a
situation
analysis
be
carried
\
\
out?
\n
(A)
Defining
the
program
(B)
Planning
the
program
(C)
Taking
action
and
\
\
implementing
ideas
(D)
Evaluation
of
the
program
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Situation
analyses
\
\
are
typically
carried
out
during
the
planning
process
stage
of
defining
the
program.
\
\
The
answer
is
(A).
\n\n
Q:
Which
of
these
statements
is
true
of
the
Vatican
in
2010
\
\
at
the
time
of
the
accusations
of
child
abuse
cover-ups?
\n
(A)
There
was
a
coordinated
\
\
media
response.
(B)
Consistent
messages
were
communicated.
(C)
Criticisms
were
\
\
taken
as
attacks
on
the
Catholic
Church.
(D)
The
credibility
of
the
Vatican
was
\
\
upheld.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
\
\
relations
for
help.
In
2010
when
there
were
accusations
of
child
abuse
cover-ups,
\
\
the
Vatican
took
those
criticisms
as
attacks
on
the
Catholic
Church.
The
answer
\
\
is
(C).
\n\n
Q:
What
should
a
public
relations
media
practitioner
do
if
she
does
\
\
not
know
the
answer
to
a
reporter's
question?
\n
(A)
Give
the
reporter
other
information
\
\
she
is
certain
is
correct.
(B)
Say
that
the
information
is
'off
the
record'
and
\
\
will
be
disseminated
later.
(C)
Say
'I
don't
know'
and
promise
to
provide
the
\
\
information
later.
(D)
Say
'no
comment,'
rather
than
appear
uninformed.
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
\
\
If
a
public
relations
media
practitioner
does
not
know
the
answer
to
a
reporter's
\
\
question,
they
should
say
'I
don't
know'
and
offer
to
provide
the
information
\
\
later.
The
answer
is
(C)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_public_relations"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
View file @
109ed1c7
dataset_name
:
security_studies
description
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
security
\
"
dataset_name
"
:
"
security_studies
"
"
description
"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
security
\
\
studies.
\n\n
Q:
What
are
the
frameworks
of
analysis
within
which
terrorism
has
\
\
been
considered
(as
of
2020)?
\n
(A)
Competition
between
larger
nations
has
resulted
\
\
in
some
countries
actively
supporting
terrorist
groups
to
undermine
the
strength
\
...
...
@@ -81,5 +81,6 @@ description: "The following are multiple choice questions (with answers) about s
\
for
negotiation
or
concession.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
security
studies
for
help.
Coercive
diplomacy
uses
the
threat
of
force
\
\
to
induce
the
opponent
to
comply
with
demands.
The
answer
is
(B)."
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_security_studies
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_security_studies"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
View file @
109ed1c7
dataset_name
:
sociology
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
sociology.
Q:
Which
of
the
following
is
not
a
problem
associated
with
official
statistics
on
strike
action?
(A)
most
strikes
go
unnoticed
by
employers
and
the
mass
media
(B)
not
all
industrial
disputes
will
be
reported
by
the
employer
(C)
the
definition
of
strikes
excludes
those
that
involve
fewer
than
ten
workers
or
last
less
than
one
day
(D)
it
is
hard
to
compare
strikes
that
were
measured
in
different
ways
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Official
statistics
on
strike
action
can
be
problematic
because
not
all
industrial
disputes
will
be
reported
by
employers,
the
definition
of
strikes
excludes
those
that
involves
fewer
than
ten
workers
or
last
less
than
one
day,
and
it
is
hard
to
compare
strikes
that
were
measured
in
different
ways.
Thus,
(A)
is
not
a
problem
associated
with
official
statistics
on
strike
action.
The
answer
is
(A).
Q:
What
does
Berger
(1963)
describe
as
a
metaphor
for
social
reality?
(A)
a
fairground
ride
(B)
a
circus
(C)
a
puppet
theatre
(D)
a
ballet
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Berger
describes
social
reality
using
the
metaphor
of
a
puppet
theatre.
The
answer
is
(C).
Q:
The
term
'
'
hegemony'
'
refers
to:
(A)
the
tendency
for
the
working
class
not
to
realize
their
own
interests
(B)
a
dominant
ideology
that
legitimates
economic,
political
and
cultural
power
(C)
a
form
of
dual
consciousness
based
on
ideology
and
everyday
experiences
(D)
a
mode
of
payment
given
for
outstanding
topiary
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Hegemony
refers
to
a
dominant
ideology
that
legitimates
economic,
policital,
and
cultural
power.
The
answer
is
(B).
Q:
The
shift
from
'
'
civil
religion'
'
to
'
'
common
religion'
'
means
that:
(A)
the
increasing
bureaucracy
of
the
state
has
made
religion
only
a
marginal
part
of
our
lives
(B)
despite
the
weakening
of
traditional
authority,
our
everyday
lives
and
'
'
common
sense'
'
remain
shaped
by
religious
beliefs
and
values
(C)
religious
participation
in
collective
worship
may
have
declined,
but
people
still
practise
their
faiths
in
private
(D)
people
are
much
more
likely
to
discuss
their
religious
beliefs
in
public,
informal
settings
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
shift
from
civil
religion
to
common
religion
means
that
despite
the
weakening
of
traditional
authority,
our
everyday
lives
and
common
sense
remain
shaped
by
religious
beliefs
and
values.
The
answer
is
(B).
Q:
Which
of
the
following
did
the
post-war
welfare
state
of
1948
not
aim
to
provide:
(A)
free
health
care
and
education
for
all
(B)
a
minimum
wage
(C)
full
employment
(D)
universal
welfare
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
post-war
welfare
state
of
1948
aimed
to
provide
free
healthcare
and
education,
full
employment,
and
universal
welfare.
But
it
did
not
aim
to
provide
a
minimum
wage.
The
answer
is
(B).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_sociology
"
dataset_name"
:
"
sociology"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
sociology.
\n\
\n
Q:
Which
of
the
following
is
not
a
problem
associated
with
official
statistics
\
\
on
strike
action?
\n
(A)
most
strikes
go
unnoticed
by
employers
and
the
mass
media
\
\
(B)
not
all
industrial
disputes
will
be
reported
by
the
employer
(C)
the
definition
\
\
of
strikes
excludes
those
that
involve
fewer
than
ten
workers
or
last
less
than
\
\
one
day
(D)
it
is
hard
to
compare
strikes
that
were
measured
in
different
ways
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
\
\
Official
statistics
on
strike
action
can
be
problematic
because
not
all
industrial
\
\
disputes
will
be
reported
by
employers,
the
definition
of
strikes
excludes
those
\
\
that
involves
fewer
than
ten
workers
or
last
less
than
one
day,
and
it
is
hard
\
\
to
compare
strikes
that
were
measured
in
different
ways.
Thus,
(A)
is
not
a
problem
\
\
associated
with
official
statistics
on
strike
action.
The
answer
is
(A).
\n\n
Q:
\
\
What
does
Berger
(1963)
describe
as
a
metaphor
for
social
reality?
\n
(A)
a
fairground
\
\
ride
(B)
a
circus
(C)
a
puppet
theatre
(D)
a
ballet
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Berger
describes
social
\
\
reality
using
the
metaphor
of
a
puppet
theatre.
The
answer
is
(C).
\n\n
Q:
The
term
\
\
'hegemony'
refers
to:
\n
(A)
the
tendency
for
the
working
class
not
to
realize
their
\
\
own
interests
(B)
a
dominant
ideology
that
legitimates
economic,
political
and
\
\
cultural
power
(C)
a
form
of
dual
consciousness
based
on
ideology
and
everyday
\
\
experiences
(D)
a
mode
of
payment
given
for
outstanding
topiary
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Hegemony
refers
\
\
to
a
dominant
ideology
that
legitimates
economic,
policital,
and
cultural
power.
\
\
The
answer
is
(B).
\n\n
Q:
The
shift
from
'civil
religion'
to
'common
religion'
\
\
means
that:
\n
(A)
the
increasing
bureaucracy
of
the
state
has
made
religion
only
\
\
a
marginal
part
of
our
lives
(B)
despite
the
weakening
of
traditional
authority,
\
\
our
everyday
lives
and
'common
sense'
remain
shaped
by
religious
beliefs
and
values
\
\
(C)
religious
participation
in
collective
worship
may
have
declined,
but
people
\
\
still
practise
their
faiths
in
private
(D)
people
are
much
more
likely
to
discuss
\
\
their
religious
beliefs
in
public,
informal
settings
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
shift
from
civil
religion
\
\
to
common
religion
means
that
despite
the
weakening
of
traditional
authority,
\
\
our
everyday
lives
and
common
sense
remain
shaped
by
religious
beliefs
and
values.
\
\
The
answer
is
(B).
\n\n
Q:
Which
of
the
following
did
the
post-war
welfare
state
\
\
of
1948
not
aim
to
provide:
\n
(A)
free
health
care
and
education
for
all
(B)
a
\
\
minimum
wage
(C)
full
employment
(D)
universal
welfare
\n
A:
Let's
think
step
by
\
\
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
post-war
welfare
\
\
state
of
1948
aimed
to
provide
free
healthcare
and
education,
full
employment,
\
\
and
universal
welfare.
But
it
did
not
aim
to
provide
a
minimum
wage.
The
answer
\
\
is
(B)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_sociology"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
View file @
109ed1c7
This diff is collapsed.
Click to expand it.
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
View file @
109ed1c7
This diff is collapsed.
Click to expand it.
Prev
1
2
3
4
5
6
7
…
12
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment