Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
e4db76cb
Commit
e4db76cb
authored
Jul 09, 2024
by
haileyschoelkopf
Browse files
Merge branch 'main' into multimodal-prototyping
parents
6cc6e9cd
ad80f555
Changes
871
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
49 additions
and
28 deletions
+49
-28
lm_eval/tasks/mmlu/default/mmlu_security_studies.yaml
lm_eval/tasks/mmlu/default/mmlu_security_studies.yaml
+1
-2
lm_eval/tasks/mmlu/default/mmlu_sociology.yaml
lm_eval/tasks/mmlu/default/mmlu_sociology.yaml
+1
-2
lm_eval/tasks/mmlu/default/mmlu_us_foreign_policy.yaml
lm_eval/tasks/mmlu/default/mmlu_us_foreign_policy.yaml
+1
-2
lm_eval/tasks/mmlu/default/mmlu_virology.yaml
lm_eval/tasks/mmlu/default/mmlu_virology.yaml
+1
-2
lm_eval/tasks/mmlu/default/mmlu_world_religions.yaml
lm_eval/tasks/mmlu/default/mmlu_world_religions.yaml
+1
-2
lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu.yaml
+30
-4
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_abstract_algebra.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_abstract_algebra.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_anatomy.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_anatomy.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_astronomy.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_astronomy.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_business_ethics.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_business_ethics.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_clinical_knowledge.yaml
.../tasks/mmlu/flan_cot_fewshot/mmlu_clinical_knowledge.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_biology.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_college_biology.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_chemistry.yaml
...l/tasks/mmlu/flan_cot_fewshot/mmlu_college_chemistry.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_computer_science.yaml
.../mmlu/flan_cot_fewshot/mmlu_college_computer_science.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_mathematics.yaml
...tasks/mmlu/flan_cot_fewshot/mmlu_college_mathematics.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_medicine.yaml
...al/tasks/mmlu/flan_cot_fewshot/mmlu_college_medicine.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_physics.yaml
...val/tasks/mmlu/flan_cot_fewshot/mmlu_college_physics.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_computer_security.yaml
...l/tasks/mmlu/flan_cot_fewshot/mmlu_computer_security.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_conceptual_physics.yaml
.../tasks/mmlu/flan_cot_fewshot/mmlu_conceptual_physics.yaml
+1
-1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_econometrics.yaml
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_econometrics.yaml
+1
-1
No files found.
lm_eval/tasks/mmlu/default/mmlu_security_studies.yaml
View file @
e4db76cb
"
dataset_name"
:
"
security_studies"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
security
\
\
studies.
\n\n
"
"
group"
:
"
mmlu_social_sciences"
"
group_alias"
:
"
social_sciences"
"
tag"
:
"
mmlu_social_sciences_tasks"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
mmlu_security_studies"
"
task_alias"
:
"
security_studies"
lm_eval/tasks/mmlu/default/mmlu_sociology.yaml
View file @
e4db76cb
"
dataset_name"
:
"
sociology"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
sociology.
\n\
\n
"
"
group"
:
"
mmlu_social_sciences"
"
group_alias"
:
"
social_sciences"
"
tag"
:
"
mmlu_social_sciences_tasks"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
mmlu_sociology"
"
task_alias"
:
"
sociology"
lm_eval/tasks/mmlu/default/mmlu_us_foreign_policy.yaml
View file @
e4db76cb
"
dataset_name"
:
"
us_foreign_policy"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
us
\
\
foreign
policy.
\n\n
"
"
group"
:
"
mmlu_social_sciences"
"
group_alias"
:
"
social_sciences"
"
tag"
:
"
mmlu_social_sciences_tasks"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
mmlu_us_foreign_policy"
"
task_alias"
:
"
us_foreign_policy"
lm_eval/tasks/mmlu/default/mmlu_virology.yaml
View file @
e4db76cb
"
dataset_name"
:
"
virology"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
virology.
\n\
\n
"
"
group"
:
"
mmlu_other"
"
group_alias"
:
"
other"
"
tag"
:
"
mmlu_other_tasks"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
mmlu_virology"
"
task_alias"
:
"
virology"
lm_eval/tasks/mmlu/default/mmlu_world_religions.yaml
View file @
e4db76cb
"
dataset_name"
:
"
world_religions"
"
description"
:
"
The
following
are
multiple
choice
questions
(with
answers)
about
world
\
\
religions.
\n\n
"
"
group"
:
"
mmlu_humanities"
"
group_alias"
:
"
humanities"
"
tag"
:
"
mmlu_humanities_tasks"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
mmlu_world_religions"
"
task_alias"
:
"
world_religions"
lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu.yaml
View file @
e4db76cb
group
:
mmlu_flan_cot_fewshot
group_alias
:
mmlu (flan style, fewshot cot)
task
:
-
mmlu_flan_cot_fewshot_stem
-
mmlu_flan_cot_fewshot_other
-
mmlu_flan_cot_fewshot_social_sciences
-
mmlu_flan_cot_fewshot_humanities
-
group
:
stem
task
:
-
mmlu_flan_cot_fewshot_stem
aggregate_metric_list
:
-
metric
:
acc
weight_by_size
:
True
-
group
:
other
task
:
-
mmlu_flan_cot_fewshot_other
aggregate_metric_list
:
-
metric
:
acc
weight_by_size
:
True
-
group
:
social sciences
task
:
-
mmlu_flan_cot_fewshot_social_sciences
aggregate_metric_list
:
-
metric
:
acc
weight_by_size
:
True
-
group
:
humanities
task
:
-
mmlu_flan_cot_fewshot_humanities
aggregate_metric_list
:
-
metric
:
acc
weight_by_size
:
True
aggregate_metric_list
:
-
metric
:
acc
weight_by_size
:
True
metadata
:
version
:
1
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_abstract_algebra.yaml
View file @
e4db76cb
...
...
@@ -54,6 +54,6 @@ fewshot_config:
not
have
any
roots.
For
c
=
2
the
polynomial
x^2
+
2
has
two
roots
at
x
=
1
and
x
=
2.
Hence
Z_3[x]/(x^2
+
c)
is
a
field
if
and
only
if
c
=
1.
The
answer
is
(B).'
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_abstract_algebra
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_anatomy.yaml
View file @
e4db76cb
...
...
@@ -70,6 +70,6 @@ fewshot_config:
\
origin
of
the
hyoid
bone
are
the
second
and
the
third
pharyngeal
arches
\u2014\
this
information
is
covered
in
the
last
option
(D).
Therefore,
we
conclude
that
\
\
(D)
must
be
the
correct
answer.
The
answer
is
(D).
\n\n
"
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_anatomy
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_astronomy.yaml
View file @
e4db76cb
...
...
@@ -65,6 +65,6 @@ fewshot_config:
because
it
explains
that
the
surface
is
red
due
to
the
rusted
materials
on
the
surface
and
the
red
color
comes
from
the
rust.
So
the
correct
option
is
(A).
The
answer
is
(A).'
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_astronomy
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_business_ethics.yaml
View file @
e4db76cb
...
...
@@ -70,6 +70,6 @@ fewshot_config:
\
moral
arguments
relating
to:
negative
*externalities*,
the
*power*
that
corporations
\
\
possess
and
the
*mutual
independence*
of
business
and
society.
The
answer
\
\
is
(D).
\n\n
"
group
:
mmlu_flan_cot_fewshot_other
tag
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_business_ethics
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_clinical_knowledge.yaml
View file @
e4db76cb
...
...
@@ -43,6 +43,6 @@ fewshot_config:
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
clinical
knowledge
for
help.
The
energy
for
muscular
contraction
is
provided
by
ATP
(adenosine
triphosphate),
which
is
the
powerhouse
of
the
cell.
The
answer
is
(A).'
group
:
mmlu_flan_cot_fewshot_other
tag
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_clinical_knowledge
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_biology.yaml
View file @
e4db76cb
...
...
@@ -70,6 +70,6 @@ fewshot_config:
that
have
different
origins,
which
is
not
the
case
for
the
human
and
bird
forearms,
which
rules
out
(D).
Humans
and
birds
do
belong
to
the
same
clade
-
a
group
of
organisms
composed
of
a
common
ancestor.
The
answer
is
(C).'
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_college_biology
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_chemistry.yaml
View file @
e4db76cb
...
...
@@ -44,6 +44,6 @@ fewshot_config:
\
into
2
lines.
This
will
be
further
split
into
4
lines
by
the
interaction
with
\
\
three
equivalent
1H
nuclei.
The
total
number
of
lines
is
therefore
$2
\\
cdot
\
\
4
=
8$.
The
answer
is
(E).
\n\n
"
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_college_chemistry
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_computer_science.yaml
View file @
e4db76cb
...
...
@@ -175,6 +175,6 @@ fewshot_config:
(1000
nanoseconds
/
cache
miss)
*
(1
cache
miss
/
50
instructions)
*
(50
instructions
/
27000
nanoseconds)
=
1000
*
(1/50)
*
(50/27000)
=
1000/27000
=
1/27.
The
answer
is
(B).'
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_college_computer_science
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_mathematics.yaml
View file @
e4db76cb
...
...
@@ -68,6 +68,6 @@ fewshot_config:
\
Then,
for
all
$t
\\
in
\\
mathbb{R}$,
we
have
$(s(t))-2=K
e^{-t
/
25}$,
and
\
\
so
$s(t)=2+K
e^{-t
/
25}$.
Then
$3=s(0)=2+K
e^{0}=2+K$,
so
$K=1$.
Then
$s(100)=2+K
\
\
e^{-100
/
25}=2+1
\\
cdot
e^{-4}=2+e^{-4}$.
The
answer
is
(D).
\n\n
"
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_college_mathematics
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_medicine.yaml
View file @
e4db76cb
...
...
@@ -63,6 +63,6 @@ fewshot_config:
for
help.
Glucose
(also
known
as
the
blood
sugar)
is
the
main
sugar
found
in
the
human
body.
It
is
transported
into
the
muscle
cell
via
diffusion
through
protein
transporters
called
GLUT4.
The
answer
is
(A).'
group
:
mmlu_flan_cot_fewshot_other
tag
:
mmlu_flan_cot_fewshot_other
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_college_medicine
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_college_physics.yaml
View file @
e4db76cb
...
...
@@ -56,6 +56,6 @@ fewshot_config:
of
the
gas
container
is
constant,
no
work
will
be
done
(since
work
is
pressure
times
change
in
volume).
So,
at
constant
volume,
all
of
the
heat
goes
into
the
internal
energy.
The
answer
is
(B).'
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_college_physics
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_computer_security.yaml
View file @
e4db76cb
...
...
@@ -45,6 +45,6 @@ fewshot_config:
of
the
TLS
heartbeat
extension.
The
vulnerability
was
classified
as
a
buffer
over-read,
a
situation
where
more
data
can
be
read
than
should
be
allowed.
The
answer
is
(C).'
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_computer_security
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_conceptual_physics.yaml
View file @
e4db76cb
...
...
@@ -44,6 +44,6 @@ fewshot_config:
\
orthogonal
to
the
wind
is
the
same
as
it
would
be
in
the
absence
of
the
wind.
\
\
The
total
speed,
which
is
these
two
components
added
in
quadrature,
is
thus
\
\
greater
than
the
speed
in
still
air.
The
answer
is
(B).
\n\n
"
group
:
mmlu_flan_cot_fewshot_stem
tag
:
mmlu_flan_cot_fewshot_stem
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_conceptual_physics
lm_eval/tasks/mmlu/flan_cot_fewshot/mmlu_econometrics.yaml
View file @
e4db76cb
...
...
@@ -82,6 +82,6 @@ fewshot_config:
target
:
'
Let'
'
s
think
step
by
step.
We
refer
to
Wikipedia
articles
on
econometrics
for
help.
This
is
a
formal
logic
problem
about
stationally
process.
For
a
stationary
autoregressive
process,
shocks
will
eventually
die
away.
The
answer
is
(A).'
group
:
mmlu_flan_cot_fewshot_social_sciences
tag
:
mmlu_flan_cot_fewshot_social_sciences
include
:
_mmlu_flan_cot_fewshot_template_yaml
task
:
mmlu_flan_cot_fewshot_econometrics
Prev
1
…
20
21
22
23
24
25
26
27
28
…
44
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment