Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
470059f6
Commit
470059f6
authored
Nov 24, 2023
by
lintangsutawika
Browse files
merge conflict
parents
b8d7d6c3
9d030712
Changes
1000
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
409 additions
and
0 deletions
+409
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
...s/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
+48
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
+39
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
+86
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
+43
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
...l/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
+40
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
+31
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
+27
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu.yaml
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
...u/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
+23
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_abstract_algebra.yaml
...l/tasks/mmlu/flan_cot_zeroshot/mmlu_abstract_algebra.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_anatomy.yaml
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_anatomy.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_astronomy.yaml
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_astronomy.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_business_ethics.yaml
...al/tasks/mmlu/flan_cot_zeroshot/mmlu_business_ethics.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_clinical_knowledge.yaml
...tasks/mmlu/flan_cot_zeroshot/mmlu_clinical_knowledge.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_biology.yaml
...al/tasks/mmlu/flan_cot_zeroshot/mmlu_college_biology.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_chemistry.yaml
.../tasks/mmlu/flan_cot_zeroshot/mmlu_college_chemistry.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_computer_science.yaml
...mmlu/flan_cot_zeroshot/mmlu_college_computer_science.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_mathematics.yaml
...asks/mmlu/flan_cot_zeroshot/mmlu_college_mathematics.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_medicine.yaml
...l/tasks/mmlu/flan_cot_zeroshot/mmlu_college_medicine.yaml
+6
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_physics.yaml
...al/tasks/mmlu/flan_cot_zeroshot/mmlu_college_physics.yaml
+6
-0
No files found.
Too many changes to show.
To preserve performance only
1000 of 1000+
files are displayed.
Plain diff
Email patch
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_professional_psychology.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
professional_psychology"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
professional
\
\
psychology.
\n\n
Q:
In
the
construction
of
a
multiple
regression
equation
for
purposes
\
\
of
prediction,
the
optimal
combination
of
measures
is
one
in
which
the
predictors
\n\
(A)
are
uncorrelated
with
each
other
but
are
moderately
correlated
with
the
criterion
\
\
(B)
have
low
correlations
with
each
other
and
low
correlations
with
the
criterion
\
\
(C)
are
highly
intercorrelated
with
each
other
and
moderately
correlated
with
\
\
the
criterion
(D)
have
low
correlations
with
the
criterion
bur
are
moderately
\
\
correlated
with
each
other
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
psychology
for
help.
The
basis
of
multiple
regression
is
to
assess
\
\
the
relationship
between
one
continuous
variable
and
a
set
of
independent
variables.
\
\
So
the
predictors
should
be
uncorrelated
with
each
other
but
are
moderately
correlated
\
\
with
the
criterion.
The
answer
is
(A).
\n\n
Q:
There
are
three
ways
to
measure
the
\
\
Central
Tendency:
the
Mean,
the
Median
and
the
Mode.
From
your
knowledge
about
\
\
them,
what
is
the
mode?
\n
(A)
less
sensitive
to
extreme
scores
than
the
mean
(B)
\
\
more
useful
for
skewed
distributions
(C)
sensitive
to
extreme
values
and
highly
\
\
skewed
distributions
(D)
the
most
frequently
occurring
number
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
The
definition
\
\
of
mode
is
the
most
frequently
occurring
number.
The
answer
is
(D).
\n\n
Q:
Carl
\
\
Jung
believed
that
a
client's
transference:
\n
(A)
is
a
fantasy
that
distracts
the
\
\
client
from
reality.
(B)
represents
“mixed
feelings”
toward
the
therapist.
(C)
\
\ \"
is
a
form
of
\"\"
acting
out.
\"\"\"
(D)
reflects
the
client’s
personal
and
collective
\
\
unconscious.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
\
\
for
help.
Transference
is
a
phenomenon
that
a
person's
feelings
are
unconsciously
\
\
redirected,
so
it
reflects
the
client's
personal
and
collective
unconscious.
The
\
\
answer
is
(D).
\n\n
Q:
In
terms
of
Hofstede’s
(1980)
five
cultural
dimensions,
the
\
\
United
States
scores
at
the
top
of
the
scale
on:
\n
(A)
individualism.
(B)
individualism
\
\
and
power
distance.
(C)
power
distance
and
masculinity.
(D)
uncertainty
avoidance.
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
\
\
US
scores
highest
on
individualism
among
the
five
cultural
dimensions.
The
answer
\
\
is
(A).
\n\n
Q:
One
of
your
therapy
clients
asks
your
advice
about
a
good
weight-
\
\
reduction
program.
You
have
investigated
the
programs
in
the
community
and
are
\
\
enrolled
in
the
one
you
consider
the
best.
This
program
offers
a
$50
bonus
to
\
\
its
patrons
for
each
new
person
they
bring
into
the
program.
Under
these
circumstances,
\
\
your
most
appropriate
response
would
be
to
\n
(A)
tell
your
client
the
pros
and
\
\
cons
of
each
program
you
know
about
except
for
the
one
in
which
you
are
enrolled
\
\
(B)
recommend
to
your
client
the
program
in
which
you
are
enrolled
and
explain
\
\
the
$50
bonus
you
will
receive
(C)
recommend
to
your
client
the
program
in
which
\
\
you
are
enrolled
and
offer
to
have
the
$50
bonus
credited
to
your
client's
account
\
\
in
the
program
(D)
tell
your
client
the
pros
and
cons
of
each
program
you
know
\
\
about,
but
do
not
claim
the
$50
bonus
if
your
client
enrolls
in
your
program
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
psychology
for
help.
\
\
Based
on
the
circumstances,
you
should
tell
your
client
about
the
pros
and
cons
\
\
of
each
program,
but
it
would
be
inappropriate
to
receive
the
bonus,
so
you
should
\
\
not
claim
the
$50
bonus.
The
answer
is
(D)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_professional_psychology"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_public_relations.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
public_relations"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
public
\
\
relations.
\n\n
Q:
Earth
Hour
was
a
campaign
launched
by
which
organization?
\n
(A)
\
\
Greenpeace
(B)
The
UN
(C)
Oxfam
(D)
World
Wildlife
Fund
\n
A:
Let's
think
step
by
\
\
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Earth
Hour
\
\
is
a
worldwide
movement
oragnized
launched
by
the
World
Wildlife
Fund.
The
answer
\
\
is
(D).
\n\n
Q:
In
issues
management,
what
is
the
most
proactive
approach
to
addressing
\
\
negative
or
misleading
information
posted
online
about
your
organization?
\n
(A)
\
\
Buy
domain
names
that
could
be
used
by
opposition
groups.
(B)
Post
anonymous
comments
\
\
on
blogs
to
combat
this
information.
(C)
Prepare
a
news
release
that
discredits
\
\
the
inaccurate
information.
(D)
Make
policy
changes
to
address
complaints
highlighted
\
\
on
these
sites.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
\
\
public
relations
for
help.
In
issues
management,
the
most
proactive
approach
to
\
\
addressing
negative
or
misleading
information
posted
online
is
to
make
policy
\
\
changes
to
address
complaints
highlighted
on
those
sites.
The
answer
is
(D).
\n\
\n
Q:
At
which
stage
in
the
planning
process
would
a
situation
analysis
be
carried
\
\
out?
\n
(A)
Defining
the
program
(B)
Planning
the
program
(C)
Taking
action
and
\
\
implementing
ideas
(D)
Evaluation
of
the
program
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
Situation
analyses
\
\
are
typically
carried
out
during
the
planning
process
stage
of
defining
the
program.
\
\
The
answer
is
(A).
\n\n
Q:
Which
of
these
statements
is
true
of
the
Vatican
in
2010
\
\
at
the
time
of
the
accusations
of
child
abuse
cover-ups?
\n
(A)
There
was
a
coordinated
\
\
media
response.
(B)
Consistent
messages
were
communicated.
(C)
Criticisms
were
\
\
taken
as
attacks
on
the
Catholic
Church.
(D)
The
credibility
of
the
Vatican
was
\
\
upheld.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
\
\
relations
for
help.
In
2010
when
there
were
accusations
of
child
abuse
cover-ups,
\
\
the
Vatican
took
those
criticisms
as
attacks
on
the
Catholic
Church.
The
answer
\
\
is
(C).
\n\n
Q:
What
should
a
public
relations
media
practitioner
do
if
she
does
\
\
not
know
the
answer
to
a
reporter's
question?
\n
(A)
Give
the
reporter
other
information
\
\
she
is
certain
is
correct.
(B)
Say
that
the
information
is
'off
the
record'
and
\
\
will
be
disseminated
later.
(C)
Say
'I
don't
know'
and
promise
to
provide
the
\
\
information
later.
(D)
Say
'no
comment,'
rather
than
appear
uninformed.
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
public
relations
for
help.
\
\
If
a
public
relations
media
practitioner
does
not
know
the
answer
to
a
reporter's
\
\
question,
they
should
say
'I
don't
know'
and
offer
to
provide
the
information
\
\
later.
The
answer
is
(C)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_public_relations"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_security_studies.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
security_studies"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
security
\
\
studies.
\n\n
Q:
What
are
the
frameworks
of
analysis
within
which
terrorism
has
\
\
been
considered
(as
of
2020)?
\n
(A)
Competition
between
larger
nations
has
resulted
\
\
in
some
countries
actively
supporting
terrorist
groups
to
undermine
the
strength
\
\
of
rival
states.
Terrorist
networks
are
extended
patronage
clubs
maintained
and
\
\
paid
for
by
their
donor
states
and
are
conceptualised
as
being
like
state
actors,
\
\
to
be
dealt
with
using
military
force.
(B)
Globalization
has
enabled
the
internationalization
\
\
of
terrorist
activities
by
opening
up
their
operational
space,
although
coordination
\
\
is
still
managed
from
a
geographical
base.
This
suggests
that
terrorist
groups
\
\
are
nationally
structured
which
means
that
terrorism
cannot
be
considered
in
terms
\
\
of
a
war
to
be
defeated
militarily
without
having
serious
implications
on
the
\
\
indigenous
population.
(C)
Terrorism
can
be
viewed
as
a
problem
to
be
resolved
\
\
by
military
means
(war
on
terrorism),
by
normal
police
techniques
(terrorism
as
\
\
crime),
or
as
a
medical
problem
with
underlying
causes
and
symptoms
(terrorism
\
\
as
disease).
(D)
Terrorism
is
viewed
as
a
criminal
problem.
The
criminalization
\
\
of
terrorism
has
two
important
implications.
Firstly,
it
suggests
that
terrorism
\
\
can
be
eradicated
-
terrorists
can
be
caught
and
brought
to
trial
by
normal
judicial
\
\
proceedings
thereby
removing
the
threat
from
society
-
and
secondly,
it
suggests
\
\
that
preventative
crime
techniques
are
applicable
to
prevent
its
development.
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
security
studies
\
\
for
help.
(A)
is
wrong
because
it
is
not
competition
between
larger
nations
that
\
\
causes
terrorism.
\n
(B)
is
wrong
because
globalization
is
not
the
cause
of
terrorism.
\n\
(C)
is
correct
because
the
US
undertook
the
war
on
terrorism.
\n
(D)
is
wrong
because
\
\
preventative
crime
techniques
will
likely
not
end
terrorism.
The
answer
is
(C).
\n\
\n
Q:
Which
of
the
following
is
the
best
lens
through
which
to
investigate
the
role
\
\
of
child
soldiers?
\n
(A)
Child
soldiers
are
victims
of
combat
that
need
re-education
\
\
and
rehabilitation.
(B)
Children
and
their
mothers
are
not
active
subjects
in
\
\
warfare
and
are
best
considered
as
subjects
in
the
private
sphere.
(C)
Children
\
\
are
most
often
innocent
bystanders
in
war
and
are
best
used
as
signifiers
of
peace.
\
\
(D)
Children
have
political
subjecthood
that
is
missed
when
they
are
considered
\
\
as
passive
victims
of
warfare.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
security
studies
for
help.
Child
soliders
as
a
political
topic
can
\
\
be
missed
when
they
are
considered
passive
victims
of
warfare.
The
answer
is
(D).
\n\
\n
Q:
How
can
we
best
describe
the
relationship
between
the
state-centric
approach
\
\
and
the
concept
of
human
security?
\n
(A)
There
are
such
wide
divisions
within
the
\
\
human
security
framework
regarding
the
nature
of
threats
and
referent
objects
\
\
that
no
widely
applicable
comparisons
between
state-centric
approaches
and
human
\
\
security
can
be
drawn.
(B)
By
adopting
the
framework
of
human
security,
the
limitations
\
\
of
the
realist
state-centric
approach
become
evident.
Whilst
human
security
defines
\
\
the
referent
object
as
the
person
or
population,
state-centric
approaches
prioritise
\
\
the
security
of
the
state,
de-prioritizing
the
pursuit
of
human
security.
(C)
\
\
The
state-centric
approach
to
security
is
a
faction
of
human
security,
usually
\
\
defined
within
the
broad
school
of
human
security.
By
being
state-centric
this
\
\
approach
prioritises
the
individual
as
the
referent
object
in
security
studies.
\
\
(D)
Both
the
state-centric
and
human-centric
approaches
to
security
are
mutually
\
\
exclusive
and
offer
a
sufficient
analytic
framework
with
which
to
understand
the
\
\
international
security
system.
It
is
therefore
the
role
of
security
analysts
to
\
\
determine
which
of
these
substantial
concepts
is
correct,
and
which
should
be
\
\
discarded.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
security
\
\
studies
for
help.
Human
security
focuses
on
a
person
or
population
whereas
state-centric
\
\
approaches
focus
on
the
state
while
deprioritizing
human
security.
The
answer
\
\
is
(B).
\n\n
Q:
In
order
to
become
securitized,
a
threat
must
be
presented
in
which
\
\
of
these
ways?
\n
(A)
As
an
existential
threat
that
requires
immediate
and
extraordinary
\
\
action,
posing
a
threat
to
the
survival
of
the
state
or
to
societal
security.
\
\
(B)
As
requiring
immediate
and
extraordinary
action
by
the
state,
threatening
\
\
the
survival
of
a
referent
object
and
therefore
warranting
the
use
of
measures
\
\
not
normally
employed
in
the
political
realm.
(C)
As
an
urgent
threat
to
the
survival
\
\
of
the
referent
object,
so
serious
that
it
legitimises
the
employment
of
extraordinary
\
\
action
in
response.
(D)
As
an
urgent
threat
to
the
survival
of
the
audience
that
\
\
requires
extraordinary
or
emergency
measures.
\n
A:
Let's
think
step
by
step.
We
\
\
refer
to
Wikipedia
articles
on
security
studies
for
help.
To
be
securitized,
a
\
\
threat
must
be
an
urgent
threat
to
the
survival
of
the
referent
object.
The
answer
\
\
is
(C).
\n\n
Q:
What
distinguishes
coercive
diplomacy
from
military
force?
\n
(A)
\
\
Compellence
is
another
term
for
coercive
diplomacy,
but
covering
a
narrower
set
\
\
of
criteria;
compellence
covers
those
threats
aimed
at
initiating
adversary
action.
\
\
A
threat
to
coerce
a
state
to
give
up
part
of
its
territory
would
count
as
coercive
\
\
diplomacy,
as
long
as
that
threat
proactively
initiates
action
before
reactive
\
\
diplomacy
is
taken.
(B)
Coercive
diplomacy
constitutes
the
threats
of
limited
\
\
force
to
induce
adversary's
incentive
to
comply
with
the
coercer's
demands.
It
\
\
is
an
influence
strategy
that
is
intended
to
obtain
compliance:
the
use
of
force
\
\
to
defeat
an
opponent
first
does
not
count.
It
leaves
an
element
of
choice
with
\
\
the
target
to
comply,
or
to
continue.
(C)
Military
force,
or
the
threat
of
military
\
\
force,
utilises
fear
to
achieve
strategic
objectives.
Coercive
diplomacy
is
differentiated
\
\
from
this
approach,
because
it
does
not
use
fear
as
a
tool
for
coercing
an
adversary.
\
\
(D)
Coercive
diplomacy
is
employed
to
use
force
but
to
limit
its
effects
on
the
\
\
international
community.
Coercive
diplomacy
is
an
aggressive
strategy
that
is
\
\
intended
to
obtain
compliance
through
defeat.
It
does
not
leave
an
element
of
\
\
choice
with
the
target,
the
target
either
being
forced
to
comply
or
engage
in
\
\
conflict.
It
seeks
to
control
by
imposing
compliance
by
removing
any
opportunity
\
\
for
negotiation
or
concession.
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
security
studies
for
help.
Coercive
diplomacy
uses
the
threat
of
force
\
\
to
induce
the
opponent
to
comply
with
demands.
The
answer
is
(B)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_security_studies"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_sociology.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
sociology"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
sociology.
\n\
\n
Q:
Which
of
the
following
is
not
a
problem
associated
with
official
statistics
\
\
on
strike
action?
\n
(A)
most
strikes
go
unnoticed
by
employers
and
the
mass
media
\
\
(B)
not
all
industrial
disputes
will
be
reported
by
the
employer
(C)
the
definition
\
\
of
strikes
excludes
those
that
involve
fewer
than
ten
workers
or
last
less
than
\
\
one
day
(D)
it
is
hard
to
compare
strikes
that
were
measured
in
different
ways
\n\
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
\
\
Official
statistics
on
strike
action
can
be
problematic
because
not
all
industrial
\
\
disputes
will
be
reported
by
employers,
the
definition
of
strikes
excludes
those
\
\
that
involves
fewer
than
ten
workers
or
last
less
than
one
day,
and
it
is
hard
\
\
to
compare
strikes
that
were
measured
in
different
ways.
Thus,
(A)
is
not
a
problem
\
\
associated
with
official
statistics
on
strike
action.
The
answer
is
(A).
\n\n
Q:
\
\
What
does
Berger
(1963)
describe
as
a
metaphor
for
social
reality?
\n
(A)
a
fairground
\
\
ride
(B)
a
circus
(C)
a
puppet
theatre
(D)
a
ballet
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Berger
describes
social
\
\
reality
using
the
metaphor
of
a
puppet
theatre.
The
answer
is
(C).
\n\n
Q:
The
term
\
\
'hegemony'
refers
to:
\n
(A)
the
tendency
for
the
working
class
not
to
realize
their
\
\
own
interests
(B)
a
dominant
ideology
that
legitimates
economic,
political
and
\
\
cultural
power
(C)
a
form
of
dual
consciousness
based
on
ideology
and
everyday
\
\
experiences
(D)
a
mode
of
payment
given
for
outstanding
topiary
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
Hegemony
refers
\
\
to
a
dominant
ideology
that
legitimates
economic,
policital,
and
cultural
power.
\
\
The
answer
is
(B).
\n\n
Q:
The
shift
from
'civil
religion'
to
'common
religion'
\
\
means
that:
\n
(A)
the
increasing
bureaucracy
of
the
state
has
made
religion
only
\
\
a
marginal
part
of
our
lives
(B)
despite
the
weakening
of
traditional
authority,
\
\
our
everyday
lives
and
'common
sense'
remain
shaped
by
religious
beliefs
and
values
\
\
(C)
religious
participation
in
collective
worship
may
have
declined,
but
people
\
\
still
practise
their
faiths
in
private
(D)
people
are
much
more
likely
to
discuss
\
\
their
religious
beliefs
in
public,
informal
settings
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
shift
from
civil
religion
\
\
to
common
religion
means
that
despite
the
weakening
of
traditional
authority,
\
\
our
everyday
lives
and
common
sense
remain
shaped
by
religious
beliefs
and
values.
\
\
The
answer
is
(B).
\n\n
Q:
Which
of
the
following
did
the
post-war
welfare
state
\
\
of
1948
not
aim
to
provide:
\n
(A)
free
health
care
and
education
for
all
(B)
a
\
\
minimum
wage
(C)
full
employment
(D)
universal
welfare
\n
A:
Let's
think
step
by
\
\
step.
We
refer
to
Wikipedia
articles
on
sociology
for
help.
The
post-war
welfare
\
\
state
of
1948
aimed
to
provide
free
healthcare
and
education,
full
employment,
\
\
and
universal
welfare.
But
it
did
not
aim
to
provide
a
minimum
wage.
The
answer
\
\
is
(B)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_sociology"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
us_foreign_policy"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
us
\
\
foreign
policy.
\n\n
Q:
How
did
Donald
Trump
attack
globalization
in
the
2016
campaign?
\n\
(A)
Globalization
had
made
men
like
him
too
rich
(B)
Globalization
only
benefited
\
\
certain
American
states,
such
as
New
York
(C)
Liberal
elites
had
encouraged
globalization,
\
\
while
'ordinary
Americans'
lost
jobs
because
of
it
(D)
Globalization
encouraged
\
\
damaging
trade
wars
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
us
foreign
policy
for
help.
Trump
attacked
globalization
because
he
believed
\
\
ordinary
Americans
lost
jobs
due
to
it,
and
so
he
wanted
to
blame
liberals
who
\
\
had
encouraged
it.
The
answer
is
(C).
\n\n
Q:
How
did
NSC-68
change
U.S.
strategy?
\n\
(A)
It
globalized
containment.
(B)
It
militarized
containment.
(C)
It
called
for
\
\
the
development
of
the
hydrogen
bomb.
(D)
All
of
the
above
\n
A:
Let's
think
step
\
\
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
NSC-68
\
\
outlined
a
variety
of
courses
of
action,
including
globalization
of
containment,
\
\
militarization
of
contaiment,
and
the
development
of
the
hydrogen
bomb.
The
answer
\
\
is
(D).
\n\n
Q:
How
do
Defensive
Realism
and
Offensive
Realism
differ
in
their
explanation
\
\
of
state
behaviour?
\n
(A)
Defensive
realists
place
greater
emphasis
on
the
role
\
\
of
international
institutions
(B)
Defensive
realists
place
less
emphasis
on
geographical
\
\
factors
(C)
Offensive
realists
give
more
priority
to
the
national
interest
than
\
\
Defensive
realists.
(D)
Defensive
realists
believe
states
are
security
maximizers,
\
\
while
Offensive
realists
believe
states
to
be
power
maximizers
\n
A:
Let's
think
\
\
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
While
\
\
defensive
realism
advocates
that
states
are
security
maximizers,
offensive
realists
\
\
think
of
states
as
power
maximizers.
The
answer
is
(D).
\n\n
Q:
The
realm
of
policy
\
\
decisions
concerned
primarily
with
relations
between
the
United
States
and
the
\
\
rest
of
the
world
is
known
as
\n
(A)
terrorism
policy.
(B)
economic
policy.
(C)
\
\
foreign
policy.
(D)
international
policy.
\n
A:
Let's
think
step
by
step.
We
refer
\
\
to
Wikipedia
articles
on
us
foreign
policy
for
help.
The
topic
of
policy
decisions
\
\
concerns
with
relations
between
the
US
and
the
rest
of
the
world
is
known
as
foreign
\
\
policy.
The
answer
is
(C).
\n\n
Q:
How
did
the
2008
financial
crisis
affect
America's
\
\
international
reputation?
\n
(A)
It
damaged
support
for
the
US
model
of
political
\
\
economy
and
capitalism
(B)
It
created
anger
at
the
United
States
for
exaggerating
\
\
the
crisis
(C)
It
increased
support
for
American
global
leadership
under
President
\
\
Obama
(D)
It
reduced
global
use
of
the
US
dollar
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
The
2008
financial
\
\
crisis
damanged
the
international
reputation
of
the
American
model
of
political
\
\
economy
and
capitalism.
The
answer
is
(A)."
"
group"
:
"
mmlu_flan_cot_fewshot_social_sciences"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_us_foreign_policy"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
virology"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
virology.
\n\
\n
Q:
The
median
survival
time
to
AIDS
and
death
was
established
by
following:
\n\
(A)
Seroprevalent
HIV-infected
individuals
(B)
Seronegatives
(C)
Seroconverters
\
\
(D)
High-risk
seronegatives
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
virology
for
help.
The
median
survival
time
to
AIDS
and
death
was
\
\
established
as
a
result
of
the
development
of
seroconverters.
The
answer
is
(C).
\n\
\n
Q:
Which
of
the
following
is
a
morphological
characteristic
of
the
paramyxoviruses.
\n\
(A)
Fragile
viruses
often
visualised
with
RNA
spewing
from
the
inside
(B)
Elongate
\
\
viruses
(C)
Icosahedral
viruses
with
envelope
(D)
Very
large
viruses
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
Paramyxoviruses
\
\
are
fragile
viruses
often
visualised
with
RNA
spewing
from
the
inside.
The
answer
\
\
is
(A).
\n\n
Q:
The
most
important
goal
of
a
behavioral
intervention
is:
\n
(A)
Change
\
\
in
behavior
(B)
Comprehensive
coverage
(C)
Effective
use
of
behavioral
theory
\
\
(D)
Sustained
behavior
change
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
virology
for
help.
The
prim
goal
of
a
behavioral
intervention
is
to
\
\
cause
sustained
behavior
change.
The
answer
is
(D).
\n\n
Q:
A
key
factor
facilitating
\
\
the
application
of
nested
case-control
studies
from
the
MACS
was:
\n
(A)
Data
collection
\
\
(B)
Establishment
of
a
repository
of
biologic
specimens
(C)
Participant
interest
\
\
(D)
Administration
of
the
questionnaire
by
staff
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
virology
for
help.
The
Multicenter
AIDS
Cohort
\
\
Study's
use
of
nested
case-control
studies
was
facilitated
by
the
establishment
\
\
of
a
repository
of
biologic
specimens.
The
answer
is
(B).
\n\n
Q:
Why
are
parvoviruses
\
\
a
highly
impactful
parasite?
\n
(A)
Because
they
have
no
nucleic
acid
(B)
They
require
\
\
a
helper
virus
(C)
Only
replicate
in
dividing
cells
(D)
Can
integrate
into
host
\
\
chromosomes
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
\
\
for
help.
Paroviruses
are
highly
impactful
because
they
do
not
have
nucleic
acid.
\
\
The
answer
is
(A)."
"
group"
:
"
mmlu_flan_cot_fewshot_other"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_virology"
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
world_religions"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
world
\
\
religions.
\n\n
Q:
How
can
the
Upanishads
be
characterized?
\n
(A)
Ritual
texts
(B)
\
\
Philosophical
texts
(C)
Hymns
(D)
Origin
stories
\n
A:
Let's
think
step
by
step.
\
\
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
The
Upanishads
are
\
\
the
most
recent
part
of
Vedas
(the
oldest
scriptures
in
Hinduism)
and
supplied
\
\
the
basis
of
later
Hindu
philosophy.
So
they
are
philosophical
texts.
The
answer
\
\
is
(B).
\n\n
Q:
What
is
the
Second
Gem
in
Buddhism?
\n
(A)
The
Dharma
(B)
The
Sangha
\
\
(C)
The
Buddha
(D)
The
Bodhisattva
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
\
\
articles
on
world
religions
for
help.
The
Second
Gem
in
Buddhism
is
The
Dharma.
\
\
The
answer
is
(A).
\n\n
Q:
Which
Japanese
government
promoted
a
kind
of
national
\
\
cult
based
on
the
emperor
and
his
associations
with
kami?
\n
(A)
Honen
(B)
Tanaka
\
\
(C)
Tokugawa
(D)
Meiji
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
\
\
on
world
religions
for
help.
The
promotion
of
a
national
cult
based
on
the
emperor
\
\
and
his
associations
with
Kami
happened
during
the
reign
of
Emperor
Meiji
(1852-1912).
\
\
The
answer
is
(D).
\n\n
Q:
In
which
dynasty
was
the
\"
Mandate
of
Heaven
\"
developed
\
\
to
legitimatize
the
new
rulers?
\n
(A)
Shang
(B)
Zhou
(C)
Han
(D)
Xia
\n
A:
Let's
\
\
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
\
\
The
\"
Mandate
of
Heaven
\"
was
developed
as
an
ancient
Chinese
philosophical
concept
\
\
during
the
Zhou
Dynasty
(1046-256
BCE).
The
answer
is
(B).
\n\n
Q:
What
is
the
sign
\
\
of
the
covenant
for
Jewish
males?
\n
(A)
The
rainbow
(B)
Circumcision
(C)
A
son
\
\
(D)
Bar
mitzvah
\n
A:
Let's
think
step
by
step.
We
refer
to
Wikipedia
articles
on
\
\
world
religions
for
help.
In
Judaism,
the
most
distinctive
sign
of
the
covenant
\
\
is
circumcision
(brit
milah).
The
answer
is
(B)."
"
group"
:
"
mmlu_flan_cot_fewshot_humanities"
"
include"
:
"
_mmlu_flan_cot_fewshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_fewshot_world_religions"
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu.yaml
0 → 100644
View file @
470059f6
group
:
mmlu_flan_cot_zeroshot
task
:
-
mmlu_flan_cot_zeroshot_stem
-
mmlu_flan_cot_zeroshot_other
-
mmlu_flan_cot_zeroshot_social_sciences
-
mmlu_flan_cot_zeroshot_humanities
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
0 → 100644
View file @
470059f6
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
validation_split: validation
fewshot_split: dev
output_type: generate_until
doc_to_text: "Q: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA: Let's think step by step."
doc_to_target: "{{['(A)', '(B)', '(C)', '(D)'][answer]}}"
filter_list:
- name: "get-answer"
filter:
- function: "regex"
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
- function: "take_first"
generation_kwargs:
until:
- "</s>"
do_sample: false
temperature: 0.0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_abstract_algebra.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
abstract_algebra"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
abstract
\
\
algebra.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_abstract_algebra"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_anatomy.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
anatomy"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
anatomy.
\n\
\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_anatomy"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_astronomy.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
astronomy"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
astronomy.
\n\
\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_astronomy"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_business_ethics.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
business_ethics"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
business
\
\
ethics.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_other"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_business_ethics"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_clinical_knowledge.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
clinical_knowledge"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
clinical
\
\
knowledge.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_other"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_clinical_knowledge"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_biology.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
college_biology"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
college
\
\
biology.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_college_biology"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_chemistry.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
college_chemistry"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
college
\
\
chemistry.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_college_chemistry"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_computer_science.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
college_computer_science"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
college
\
\
computer
science.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_college_computer_science"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_mathematics.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
college_mathematics"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
college
\
\
mathematics.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_college_mathematics"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_medicine.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
college_medicine"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
college
\
\
medicine.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_other"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_college_medicine"
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_physics.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
college_physics"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
college
\
\
physics.
\n\n
"
"
group"
:
"
mmlu_flan_cot_zeroshot_stem"
"
include"
:
"
_mmlu_flan_cot_zeroshot_template_yaml"
"
task"
:
"
mmlu_flan_cot_zeroshot_college_physics"
Prev
1
…
35
36
37
38
39
40
41
42
43
…
50
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment