Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
6769119f
Unverified
Commit
6769119f
authored
Oct 06, 2023
by
Hailey Schoelkopf
Committed by
GitHub
Oct 06, 2023
Browse files
Merge pull request #816 from EleutherAI/flan-benchmark
[Refactor] Flan benchmark
parents
4824a832
7d5e511c
Changes
448
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
323 additions
and
0 deletions
+323
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
...l/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
+66
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
+55
-0
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
+53
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_generative_template_yaml
...mlu/flan_cot_zeroshot/_mmlu_flan_generative_template_yaml
+24
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_abstract_algebra.yaml
...l/tasks/mmlu/flan_cot_zeroshot/mmlu_abstract_algebra.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_anatomy.yaml
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_anatomy.yaml
+7
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_astronomy.yaml
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_astronomy.yaml
+7
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_business_ethics.yaml
...al/tasks/mmlu/flan_cot_zeroshot/mmlu_business_ethics.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_clinical_knowledge.yaml
...tasks/mmlu/flan_cot_zeroshot/mmlu_clinical_knowledge.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_biology.yaml
...al/tasks/mmlu/flan_cot_zeroshot/mmlu_college_biology.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_chemistry.yaml
.../tasks/mmlu/flan_cot_zeroshot/mmlu_college_chemistry.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_computer_science.yaml
...mmlu/flan_cot_zeroshot/mmlu_college_computer_science.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_mathematics.yaml
...asks/mmlu/flan_cot_zeroshot/mmlu_college_mathematics.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_medicine.yaml
...l/tasks/mmlu/flan_cot_zeroshot/mmlu_college_medicine.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_physics.yaml
...al/tasks/mmlu/flan_cot_zeroshot/mmlu_college_physics.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_computer_security.yaml
.../tasks/mmlu/flan_cot_zeroshot/mmlu_computer_security.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_conceptual_physics.yaml
...tasks/mmlu/flan_cot_zeroshot/mmlu_conceptual_physics.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_econometrics.yaml
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_econometrics.yaml
+7
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_electrical_engineering.yaml
...s/mmlu/flan_cot_zeroshot/mmlu_electrical_engineering.yaml
+8
-0
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_elementary_mathematics.yaml
...s/mmlu/flan_cot_zeroshot/mmlu_elementary_mathematics.yaml
+8
-0
No files found.
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_us_foreign_policy.yaml
0 → 100644
View file @
6769119f
dataset_name
:
us_foreign_policy
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
us
foreign
policy.
Q:
How
did
Donald
Trump
attack
globalization
in
the
2016
campaign?
(A)
Globalization
had
made
men
like
him
too
rich
(B)
Globalization
only
benefited
certain
American
states,
such
as
New
York
(C)
Liberal
elites
had
encouraged
globalization,
while
'
'
ordinary
Americans'
'
lost
jobs
because
of
it
(D)
Globalization
encouraged
damaging
trade
wars
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
Trump
attacked
globalization
because
he
believed
ordinary
Americans
lost
jobs
due
to
it,
and
so
he
wanted
to
blame
liberals
who
had
encouraged
it.
The
answer
is
(C).
Q:
How
did
NSC-68
change
U.S.
strategy?
(A)
It
globalized
containment.
(B)
It
militarized
containment.
(C)
It
called
for
the
development
of
the
hydrogen
bomb.
(D)
All
of
the
above
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
NSC-68
outlined
a
variety
of
courses
of
action,
including
globalization
of
containment,
militarization
of
contaiment,
and
the
development
of
the
hydrogen
bomb.
The
answer
is
(D).
Q:
How
do
Defensive
Realism
and
Offensive
Realism
differ
in
their
explanation
of
state
behaviour?
(A)
Defensive
realists
place
greater
emphasis
on
the
role
of
international
institutions
(B)
Defensive
realists
place
less
emphasis
on
geographical
factors
(C)
Offensive
realists
give
more
priority
to
the
national
interest
than
Defensive
realists.
(D)
Defensive
realists
believe
states
are
security
maximizers,
while
Offensive
realists
believe
states
to
be
power
maximizers
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
While
defensive
realism
advocates
that
states
are
security
maximizers,
offensive
realists
think
of
states
as
power
maximizers.
The
answer
is
(D).
Q:
The
realm
of
policy
decisions
concerned
primarily
with
relations
between
the
United
States
and
the
rest
of
the
world
is
known
as
(A)
terrorism
policy.
(B)
economic
policy.
(C)
foreign
policy.
(D)
international
policy.
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
The
topic
of
policy
decisions
concerns
with
relations
between
the
US
and
the
rest
of
the
world
is
known
as
foreign
policy.
The
answer
is
(C).
Q:
How
did
the
2008
financial
crisis
affect
America'
'
s
international
reputation?
(A)
It
damaged
support
for
the
US
model
of
political
economy
and
capitalism
(B)
It
created
anger
at
the
United
States
for
exaggerating
the
crisis
(C)
It
increased
support
for
American
global
leadership
under
President
Obama
(D)
It
reduced
global
use
of
the
US
dollar
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
us
foreign
policy
for
help.
The
2008
financial
crisis
damanged
the
international
reputation
of
the
American
model
of
political
economy
and
capitalism.
The
answer
is
(A).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_us_foreign_policy
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_virology.yaml
0 → 100644
View file @
6769119f
dataset_name
:
virology
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
virology.
Q:
The
median
survival
time
to
AIDS
and
death
was
established
by
following:
(A)
Seroprevalent
HIV-infected
individuals
(B)
Seronegatives
(C)
Seroconverters
(D)
High-risk
seronegatives
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
The
median
survival
time
to
AIDS
and
death
was
established
as
a
result
of
the
development
of
seroconverters.
The
answer
is
(C).
Q:
Which
of
the
following
is
a
morphological
characteristic
of
the
paramyxoviruses.
(A)
Fragile
viruses
often
visualised
with
RNA
spewing
from
the
inside
(B)
Elongate
viruses
(C)
Icosahedral
viruses
with
envelope
(D)
Very
large
viruses
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
Paramyxoviruses
are
fragile
viruses
often
visualised
with
RNA
spewing
from
the
inside.
The
answer
is
(A).
Q:
The
most
important
goal
of
a
behavioral
intervention
is:
(A)
Change
in
behavior
(B)
Comprehensive
coverage
(C)
Effective
use
of
behavioral
theory
(D)
Sustained
behavior
change
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
The
prim
goal
of
a
behavioral
intervention
is
to
cause
sustained
behavior
change.
The
answer
is
(D).
Q:
A
key
factor
facilitating
the
application
of
nested
case-control
studies
from
the
MACS
was:
(A)
Data
collection
(B)
Establishment
of
a
repository
of
biologic
specimens
(C)
Participant
interest
(D)
Administration
of
the
questionnaire
by
staff
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
The
Multicenter
AIDS
Cohort
Study'
'
s
use
of
nested
case-control
studies
was
facilitated
by
the
establishment
of
a
repository
of
biologic
specimens.
The
answer
is
(B).
Q:
Why
are
parvoviruses
a
highly
impactful
parasite?
(A)
Because
they
have
no
nucleic
acid
(B)
They
require
a
helper
virus
(C)
Only
replicate
in
dividing
cells
(D)
Can
integrate
into
host
chromosomes
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
virology
for
help.
Paroviruses
are
highly
impactful
because
they
do
not
have
nucleic
acid.
The
answer
is
(A).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_virology
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_world_religions.yaml
0 → 100644
View file @
6769119f
dataset_name
:
world_religions
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
world
religions.
Q:
How
can
the
Upanishads
be
characterized?
(A)
Ritual
texts
(B)
Philosophical
texts
(C)
Hymns
(D)
Origin
stories
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
The
Upanishads
are
the
most
recent
part
of
Vedas
(the
oldest
scriptures
in
Hinduism)
and
supplied
the
basis
of
later
Hindu
philosophy.
So
they
are
philosophical
texts.
The
answer
is
(B).
Q:
What
is
the
Second
Gem
in
Buddhism?
(A)
The
Dharma
(B)
The
Sangha
(C)
The
Buddha
(D)
The
Bodhisattva
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
The
Second
Gem
in
Buddhism
is
The
Dharma.
The
answer
is
(A).
Q:
Which
Japanese
government
promoted
a
kind
of
national
cult
based
on
the
emperor
and
his
associations
with
kami?
(A)
Honen
(B)
Tanaka
(C)
Tokugawa
(D)
Meiji
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
The
promotion
of
a
national
cult
based
on
the
emperor
and
his
associations
with
Kami
happened
during
the
reign
of
Emperor
Meiji
(1852-1912).
The
answer
is
(D).
Q:
In
which
dynasty
was
the
"Mandate
of
Heaven"
developed
to
legitimatize
the
new
rulers?
(A)
Shang
(B)
Zhou
(C)
Han
(D)
Xia
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
The
"Mandate
of
Heaven"
was
developed
as
an
ancient
Chinese
philosophical
concept
during
the
Zhou
Dynasty
(1046-256
BCE).
The
answer
is
(B).
Q:
What
is
the
sign
of
the
covenant
for
Jewish
males?
(A)
The
rainbow
(B)
Circumcision
(C)
A
son
(D)
Bar
mitzvah
A:
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
world
religions
for
help.
In
Judaism,
the
most
distinctive
sign
of
the
covenant
is
circumcision
(brit
milah).
The
answer
is
(B).'
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_world_religions
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_generative_template_yaml
0 → 100644
View file @
6769119f
group: mmlu_flan_cot_zeroshot
dataset_path: cais/mmlu
validation_split: validation
fewshot_split: dev
output_type: greedy_until
doc_to_text: "Q: {{question.strip()}}\n(A) {{choices[0]}} (B) {{choices[1]}} (C) {{choices[2]}} (D) {{choices[3]}}\nA: Let's think step by step."
doc_to_target: "{{['(A)', '(B)', '(C)', '(D)'][answer]}}"
filter_list:
- name: "get-answer"
filter:
- function: "regex"
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
- function: "take_first"
generation_kwargs:
until:
- "</s>"
do_sample: false
temperature: 0.0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_abstract_algebra.yaml
0 → 100644
View file @
6769119f
dataset_name
:
abstract_algebra
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
abstract
algebra.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_abstract_algebra
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_anatomy.yaml
0 → 100644
View file @
6769119f
dataset_name
:
anatomy
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
anatomy.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_anatomy
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_astronomy.yaml
0 → 100644
View file @
6769119f
dataset_name
:
astronomy
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
astronomy.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_astronomy
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_business_ethics.yaml
0 → 100644
View file @
6769119f
dataset_name
:
business_ethics
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
business
ethics.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_business_ethics
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_clinical_knowledge.yaml
0 → 100644
View file @
6769119f
dataset_name
:
clinical_knowledge
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
clinical
knowledge.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_clinical_knowledge
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_biology.yaml
0 → 100644
View file @
6769119f
dataset_name
:
college_biology
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
college
biology.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_college_biology
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_chemistry.yaml
0 → 100644
View file @
6769119f
dataset_name
:
college_chemistry
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
college
chemistry.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_college_chemistry
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_computer_science.yaml
0 → 100644
View file @
6769119f
dataset_name
:
college_computer_science
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
college
computer
science.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_college_computer_science
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_mathematics.yaml
0 → 100644
View file @
6769119f
dataset_name
:
college_mathematics
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
college
mathematics.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_college_mathematics
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_medicine.yaml
0 → 100644
View file @
6769119f
dataset_name
:
college_medicine
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
college
medicine.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_college_medicine
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_college_physics.yaml
0 → 100644
View file @
6769119f
dataset_name
:
college_physics
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
college
physics.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_college_physics
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_computer_security.yaml
0 → 100644
View file @
6769119f
dataset_name
:
computer_security
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
computer
security.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_computer_security
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_conceptual_physics.yaml
0 → 100644
View file @
6769119f
dataset_name
:
conceptual_physics
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
conceptual
physics.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_conceptual_physics
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_econometrics.yaml
0 → 100644
View file @
6769119f
dataset_name
:
econometrics
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
econometrics.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_econometrics
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_electrical_engineering.yaml
0 → 100644
View file @
6769119f
dataset_name
:
electrical_engineering
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
electrical
engineering.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_electrical_engineering
lm_eval/tasks/mmlu/flan_cot_zeroshot/mmlu_elementary_mathematics.yaml
0 → 100644
View file @
6769119f
dataset_name
:
elementary_mathematics
description
:
'
The
following
are
multiple
choice
questions
(with
answers)
about
elementary
mathematics.
'
include
:
_mmlu_flan_generative_template_yaml
task
:
mmlu_flan_cot_zeroshot_elementary_mathematics
Prev
1
…
10
11
12
13
14
15
16
17
18
…
23
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment