Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
b58e5556
Commit
b58e5556
authored
Jul 27, 2025
by
Baber
Browse files
Merge branch 'main' into tasklist
# Conflicts: # pyproject.toml
parents
6e1866f5
4f8195f1
Changes
340
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
59 additions
and
158 deletions
+59
-158
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
+2
-1
lm_eval/tasks/bbh/cot_fewshot/causal_judgement.yaml
lm_eval/tasks/bbh/cot_fewshot/causal_judgement.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/date_understanding.yaml
lm_eval/tasks/bbh/cot_fewshot/date_understanding.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/disambiguation_qa.yaml
lm_eval/tasks/bbh/cot_fewshot/disambiguation_qa.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/dyck_languages.yaml
lm_eval/tasks/bbh/cot_fewshot/dyck_languages.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/formal_fallacies.yaml
lm_eval/tasks/bbh/cot_fewshot/formal_fallacies.yaml
+3
-3
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
...tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
...asks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
...asks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
+3
-3
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
...asks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/salient_translation_error_detection.yaml
.../bbh/cot_fewshot/salient_translation_error_detection.yaml
+3
-7
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
+3
-9
No files found.
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
View file @
b58e5556
...
...
@@ -2,6 +2,7 @@ dataset_path: SaylorTwift/bbh
output_type: generate_until
test_split: test
doc_to_target: "{{target}}"
target_delimiter: ""
metric_list:
- metric: exact_match
aggregation: mean
...
...
@@ -24,4 +25,4 @@ filter_list:
- function: "take_first"
num_fewshot: 3
metadata:
version:
3
.0
version:
4
.0
lm_eval/tasks/bbh/cot_fewshot/causal_judgement.yaml
View file @
b58e5556
...
...
@@ -26,9 +26,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
Here
in
this
question,
we
are
told
that
"Frank
T.
had
no
experience
with
guns,
target
:
'
Here
in
this
question,
we
are
told
that
"Frank
T.
had
no
experience
with
guns,
his
hand
slipped
on
the
barrel
of
the
gun,
and
the
shot
went
wild."
A
typical
person
would
assume
that
this
passage
suggests
that
Frank
T.
had
no
intention
of
shooting
and
injuring
someone
and
that
the
bullet
accidentally
hit
the
neighbor'
'
s
...
...
@@ -50,9 +48,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
Here
in
this
question,
we
are
told
that
the
boss
ordered
them
both
to
arrive
target
:
'
Here
in
this
question,
we
are
told
that
the
boss
ordered
them
both
to
arrive
at
the
meeting
room
at
the
same
time
and
that
the
motion
detector
was
set
up
to
be
triggered
if
at
least
one
person
appeared
in
the
room
at
the
same
time."
A
typical
person
would
assume
that
the
person
probably
meant
to
say
the
detector
...
...
@@ -82,9 +78,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
Here
in
this
question,
we
are
told
that
"He
aims
the
dart
at
the
low
point
region."
target
:
'
Here
in
this
question,
we
are
told
that
"He
aims
the
dart
at
the
low
point
region."
A
typical
person
might
therefore
think
George
did
intentionally
hit
the
low
point
region,
because
he
wanted
to
lift
up
the
spirit
of
his
sister
Lena.
So
the
answer
is
Yes.'
...
...
lm_eval/tasks/bbh/cot_fewshot/date_understanding.yaml
View file @
b58e5556
...
...
@@ -26,9 +26,7 @@ fewshot_config:
(E)
07/14/1938
(F)
12/14/1988'
target
:
'
Let'
'
s
think
step
by
step.
If
today
is
Christmas
Eve
of
1937,
then
today'
'
s
date
is
December
24,
1937.
target
:
'
If
today
is
Christmas
Eve
of
1937,
then
today'
'
s
date
is
December
24,
1937.
10
days
before
today
is
December
14,
1937,
that
is
12/14/1937.
So
the
answer
is
(D).'
-
input
:
'
Tomorrow
is
11/12/2019.
What
is
the
date
one
year
ago
from
today
in
MM/DD/YYYY?
...
...
@@ -44,9 +42,7 @@ fewshot_config:
(D)
11/02/2018
(E)
11/04/2018'
target
:
'
Let'
'
s
think
step
by
step.
If
tomorrow
is
11/12/2019,
then
today
is
11/11/2019.
The
date
one
year
ago
from
target
:
'
If
tomorrow
is
11/12/2019,
then
today
is
11/11/2019.
The
date
one
year
ago
from
today
is
11/11/2018.
So
the
answer
is
(B).'
-
input
:
'
Jane
and
John
married
on
Jan
2,
1958.
It
is
their
5-year
anniversary
today.
What
is
the
date
tomorrow
in
MM/DD/YYYY?
...
...
@@ -64,9 +60,7 @@ fewshot_config:
(E)
01/03/1982
(F)
12/03/1960'
target
:
'
Let'
'
s
think
step
by
step.
If
Jane
and
John
married
on
Jan
2,
1958,
then
and
if
it
is
their
5-year
anniversary
target
:
'
If
Jane
and
John
married
on
Jan
2,
1958,
then
and
if
it
is
their
5-year
anniversary
today,
then
today'
'
s
date
is
Jan
2,
1963.
The
date
tomorrow
is
Jan
3,
1963,
that
is
01/03/1963.
So
the
answer
is
(B).'
include
:
_cot_fewshot_template_yaml
...
...
lm_eval/tasks/bbh/cot_fewshot/disambiguation_qa.yaml
View file @
b58e5556
...
...
@@ -23,9 +23,7 @@ fewshot_config:
(B)
The
counselor
took
the
day
off
(C)
Ambiguous'
target
:
'
Let'
'
s
think
step
by
step.
Here
we
need
to
determine
who
the
pronoun
"they"
might
be
referring
to.
There
target
:
'
Here
we
need
to
determine
who
the
pronoun
"they"
might
be
referring
to.
There
are
two
possible
referents
for
"they",
namely
the
chief
and
the
counselor.
The
verb
"told"
might
be
able
to
help
us
determine
which
one
is
more
likely
(if
either).
Let
X
be
the
chief
and
Y
the
counselor.
The
sentence
is
then
of
the
...
...
@@ -54,9 +52,7 @@ fewshot_config:
(B)
The
manager
didn'
'
t
reply
yet
(C)
Ambiguous'
target
:
'
Let'
'
s
think
step
by
step.
Here
we
need
to
determine
who
the
pronoun
"he"
might
be
referring
to.
There
target
:
'
Here
we
need
to
determine
who
the
pronoun
"he"
might
be
referring
to.
There
are
two
possible
referents
for
"he",
namely
the
manager
and
the
secretary.
The
verbs
"sent"
and
"reply"
might
be
able
to
help
us
determine
which
one
is
more
likely
(if
either).
Let
X
be
the
manager
and
Y
the
secretary.
The
sentence
is
...
...
@@ -84,9 +80,7 @@ fewshot_config:
(B)
It
will
be
the
director'
'
s
office
(C)
Ambiguous'
target
:
'
Let'
'
s
think
step
by
step.
Here
we
need
to
determine
who
the
pronoun
"his"
might
be
referring
to.
There
target
:
'
Here
we
need
to
determine
who
the
pronoun
"his"
might
be
referring
to.
There
are
two
possible
referents
for
"his",
namely
Bailey'
'
s
and
the
director'
'
s.
The
verb
phrase
"plan
to
meet"
might
be
able
to
help
us
determine
which
one
is
more
likely
(if
either).
Let
X
be
Bailey
and
Y
the
director.
The
sentence
...
...
lm_eval/tasks/bbh/cot_fewshot/dyck_languages.yaml
View file @
b58e5556
...
...
@@ -13,9 +13,7 @@ fewshot_config:
samples
:
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
closed
properly.
Input:
[
{
['
target
:
'
Let'
'
s
think
step
by
step.
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
target
:
'
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
0:
empty
stack
...
...
@@ -32,9 +30,7 @@ fewshot_config:
So,
we
need
"]",
"}",
"]".
So
the
answer
is
]
}
].'
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
closed
properly.
Input:
<
>
(
(
[
[
(
{
}
)
[
<
>
]
]'
target
:
'
Let'
'
s
think
step
by
step.
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
target
:
'
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
0:
empty
stack
...
...
@@ -76,9 +72,7 @@ fewshot_config:
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
closed
properly.
Input:
<
[
<
[
{
<
[
]
<
{
}
>
>
}
]
>
{
{
(
)
}
{
<
[
<
>
]
>
}'
target
:
'
Let'
'
s
think
step
by
step.
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
target
:
'
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
0:
empty
stack
...
...
lm_eval/tasks/bbh/cot_fewshot/formal_fallacies.yaml
View file @
b58e5556
...
...
@@ -25,7 +25,7 @@ fewshot_config:
-
valid
-
invalid'
target
:
"
Let's
think
step
by
step.
\n
(1)
Lesley
is
a
close
friend
of
Fernando:
\
target
:
"
(1)
Lesley
is
a
close
friend
of
Fernando:
\
\
Lesley
=
friend(Fernando).
\n
(2)
Being
a
close
friend
of
Fernando
or
a
schoolmate
\
\
of
Lowell
is
sufficient
for
being
a
great-grandfather
of
Leroy:
If
X
=
friend(Fernando)
\
\
OR
SCHOOLMATE(Lowell),
then
X
=
great-grandfather(Leroy).
\n
Hypothesis:
Does
\
...
...
@@ -49,7 +49,7 @@ fewshot_config:
-
valid
-
invalid'
target
:
"
Let's
think
step
by
step.
\n
(1)
Whoever
is
not
a
great-grandfather
of
\
target
:
"
(1)
Whoever
is
not
a
great-grandfather
of
\
\
Clyde
is
a
stepbrother
of
Brian:
If
X
=
NOT
(great-grandfather(Clyde)),
then
\
\
X
=
stepbrother(Brian).
\n
(2):
Being
an
ancestor
of
Dana
is
sufficient
for
\
\
not
being
a
great-grandfather
of
Clyde:
If
X
=
ancestor(Dana),
X
=
NOT
(great-grandfather(Clyde)).
\n\
...
...
@@ -78,7 +78,7 @@ fewshot_config:
-
valid
-
invalid'
target
:
"
Let's
think
step
by
step.
\n
(1)
Every
infrequent
user
of
Paul
Mitchell
\
target
:
"
(1)
Every
infrequent
user
of
Paul
Mitchell
\
\
shampoo
is
either
a
rare
consumer
of
Nioxin
shampoo
or
a
loyal
buyer
of
Caress
\
\
soap,
or
both:
If
X
=
infrequent-user(Paul
Mitchell),
then
X
=
rare-consumer(Nioxin)
\
\
OR
X
=
loyal-buyer(Caress).
\n
(2):
No
regular
consumer
of
Lush
soap
is
a
rare
\
...
...
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
View file @
b58e5556
...
...
@@ -36,9 +36,7 @@ fewshot_config:
(I)
sector
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
...
@@ -90,9 +88,7 @@ fewshot_config:
(I)
sector
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
...
@@ -138,9 +134,7 @@ fewshot_config:
(I)
sector
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
...
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
View file @
b58e5556
...
...
@@ -18,9 +18,7 @@ fewshot_config:
(A)
rubber
terrible
ship
(B)
terrible
rubber
ship'
target
:
'
Let'
'
s
think
step
by
step.
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
...
@@ -39,9 +37,7 @@ fewshot_config:
(A)
repulsive
small
Brazilian
exercise
ship
(B)
Brazilian
repulsive
exercise
small
ship'
target
:
'
Let'
'
s
think
step
by
step.
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
...
@@ -63,9 +59,7 @@ fewshot_config:
(A)
blue
gold
wonderful
square
shoe
(B)
wonderful
square
blue
gold
shoe'
target
:
'
Let'
'
s
think
step
by
step.
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
View file @
b58e5556
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
View file @
b58e5556
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
View file @
b58e5556
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
View file @
b58e5556
...
...
@@ -26,9 +26,7 @@ fewshot_config:
(D)
The
Barkley
Marathons
The
Race
That
Eats
Its
Young
(E)
Bug'
target
:
'
Let'
'
s
think
step
by
step.
-
Star
Wars
Episode
IV
-
A
New
Hope
(action,
adventure,
fantasy;
1977)
target
:
'
-
Star
Wars
Episode
IV
-
A
New
Hope
(action,
adventure,
fantasy;
1977)
-
Indiana
Jones
and
the
Last
Crusade
(action,
adventure;
1989)
...
...
@@ -54,9 +52,7 @@ fewshot_config:
(D)
The
Salton
Sea
(E)
Extreme
Days'
target
:
'
Let'
'
s
think
step
by
step.
-
Twister
(action,
adventure,
thriller;
1996)
target
:
'
-
Twister
(action,
adventure,
thriller;
1996)
-
The
Silence
of
the
Lambs
(crime,
drama,
thriller;
1991)
...
...
@@ -79,9 +75,7 @@ fewshot_config:
(C)
Catwoman
(D)
Edge
of
Tomorrow'
target
:
'
Let'
'
s
think
step
by
step.
-
Minority
Report
(action,
crime,
mystery;
2002)
target
:
'
-
Minority
Report
(action,
crime,
mystery;
2002)
-
Total
Recall
(action,
adventure,
science-fiction;
2012)
...
...
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
View file @
b58e5556
...
...
@@ -12,7 +12,7 @@ fewshot_config:
sampler
:
first_n
samples
:
-
input
:
((-5 + 9 * -4 - 0) * (4 + -7 + 0 * -5)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
...
@@ -23,7 +23,7 @@ fewshot_config:
\
+
0)
=
(4
+
-7)
=
(4
-
7)
=
-3.
\n
Then,
the
final
equation
is
A
*
B
=
-41
*
\
\
-3
=
(-61)
*
(-3)
=
123.
So
the
answer
is
123."
-
input
:
((-9 * 7 * 7 * -9) + (4 * -9 - 8 - -4)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
...
@@ -34,7 +34,7 @@ fewshot_config:
\
-
8)
-
(-4))
=
(-44
-
(-4))
=
-40.
\n
Then,
the
final
equation
is
A
+
B
=
3969
\
\
+
-40
=
3969
-
40
=
3929.
So
the
answer
is
3929."
-
input
:
((-3 + 5 * 8 * -4) - (9 - 8 * -7 + -9)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
...
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
View file @
b58e5556
...
...
@@ -21,9 +21,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Turn
left:
(0,
0),
facing
the
negative
x-axis.
...
...
@@ -49,9 +47,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Turn
around:
(0,
0),
facing
the
negative
y-axis.
...
...
@@ -76,9 +72,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Always
face
forward:
(0,
0),
facing
the
positive
y-axis.
...
...
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
View file @
b58e5556
...
...
@@ -14,9 +14,7 @@ fewshot_config:
samples
:
-
input
:
I have a blackberry, a clarinet, a nectarine, a plum, a strawberry, a banana,
a flute, an orange, and a violin. How many fruits do I have?
target
:
'
Let'
'
s
think
step
by
step.
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
target
:
'
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
-
blackberry
(1)
...
...
@@ -34,9 +32,7 @@ fewshot_config:
answer
is
6.'
-
input
:
I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape,
a nectarine, and three plums. How many fruits do I have?
target
:
'
Let'
'
s
think
step
by
step.
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
target
:
'
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
-
orange
(1)
...
...
@@ -58,9 +54,7 @@ fewshot_config:
11.
So
the
answer
is
11.'
-
input
:
I have a lettuce head, a head of broccoli, an onion, a stalk of celery,
two carrots, a garlic, and a yam. How many vegetables do I have?
target
:
'
Let'
'
s
think
step
by
step.
We
first
identify
the
vegetables
on
the
list
and
include
their
quantity
in
parentheses:
target
:
'
We
first
identify
the
vegetables
on
the
list
and
include
their
quantity
in
parentheses:
-
lettuce
(1)
...
...
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
View file @
b58e5556
...
...
@@ -32,9 +32,7 @@ fewshot_config:
(D)
4
(E)
5'
target
:
'
Let'
'
s
think
step
by
step.
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
target
:
'
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
Now,
we
add
James
to
this
table:
James
is
12
years
old.
...
...
@@ -59,9 +57,7 @@ fewshot_config:
(D)
Gwen
(E)
James'
target
:
'
Let'
'
s
think
step
by
step.
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
target
:
'
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
According
to
the
table,
Bernard
(5)
is
the
youngest
amongst
them.
...
...
@@ -84,9 +80,7 @@ fewshot_config:
(D)
Gwen
(E)
James'
target
:
'
Let'
'
s
think
step
by
step.
This
question
focuses
on
the
name.
We
know
the
following:
The
names
of
the
penguin
target
:
'
This
question
focuses
on
the
name.
We
know
the
following:
The
names
of
the
penguin
in
the
table
are
Louis,
Bernard,
Vincent,
and
Gwen.
When
we
sort
their
names
alphabetically,
we
get
Bernard,
Gwen,
Louis,
Vincent.
...
...
lm_eval/tasks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
View file @
b58e5556
...
...
@@ -52,9 +52,7 @@ fewshot_config:
(Q)
purple
(R)
pink'
target
:
'
Let'
'
s
think
step
by
step.
According
to
this
question,
the
color
of
the
stress
ball
is
blue.
So
the
answer
target
:
'
According
to
this
question,
the
color
of
the
stress
ball
is
blue.
So
the
answer
is
(E).'
-
input
:
'
On
the
table,
you
see
a
bunch
of
objects
arranged
in
a
row:
a
purple
paperclip,
a
pink
stress
ball,
a
brown
keychain,
a
green
scrunchiephone
charger,
a
mauve
...
...
@@ -98,9 +96,7 @@ fewshot_config:
(Q)
purple
(R)
pink'
target
:
'
Let'
'
s
think
step
by
step.
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
target
:
'
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
right,
as
follows:
(1)
a
purple
paperclip,
(2)
a
pink
stress
ball,
(3)
a
brown
keychain,
(4)
a
green
scrunchiephone
charger,
(5)
a
mauve
fidget
spinner,
(6)
a
burgundy
pen.
...
...
@@ -129,9 +125,7 @@ fewshot_config:
(F)
five
(G)
six'
target
:
'
Let'
'
s
think
step
by
step.
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
target
:
'
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
right,
as
follows:
(1)
a
teal
plate,
(2)
a
burgundy
keychain,
(3)
a
yellow
scrunchiephone
charger,
(4)
an
orange
mug,
(5)
a
pink
notebook,
(6)
a
grey
cup.
...
...
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
View file @
b58e5556
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(C)
whitesnuake
(D)
mwhitesnake'
target
:
'
Let'
'
s
think
step
by
step.
The
original
name
is
"whitesnake".
This
is
the
name
of
an
old
English
hard
rock
target
:
'
The
original
name
is
"whitesnake".
This
is
the
name
of
an
old
English
hard
rock
band.
It
is
a
compound
word,
formed
by
the
words
"white"
and
"snake".
(A)
"whitesnape":
It
is
formed
by
the
combination
of
"white"
and
"snake";
therefore,
...
...
@@ -57,9 +55,7 @@ fewshot_config:
(C)
one
of
our
dinosaurs
is
pissing
(D)
one
of
our
dinosaur
is
missing'
target
:
'
Let'
'
s
think
step
by
step.
The
original
name
is
"one
of
our
dinosaurs
is
missing".
This
is
the
name
of
target
:
'
The
original
name
is
"one
of
our
dinosaurs
is
missing".
This
is
the
name
of
an
old
British
movie.
(A)
"ofne
of
our
dinosaurs
is
missing":
Here
"one
of"
is
changed
to
"ofne",
...
...
@@ -91,9 +87,7 @@ fewshot_config:
(C)
courting
crows
(D)
coutnting
crows'
target
:
'
Let'
'
s
think
step
by
step.
The
original
name
is
"counting
crows".
This
is
the
name
of
an
American
rock
target
:
'
The
original
name
is
"counting
crows".
This
is
the
name
of
an
American
rock
band.
Historically,
the
band
name
comes
from
the
British
nursery
rhyme
"One
for
Sorrow",
which
is
about
counting
of
magpies.
...
...
lm_eval/tasks/bbh/cot_fewshot/salient_translation_error_detection.yaml
View file @
b58e5556
...
...
@@ -42,9 +42,7 @@ fewshot_config:
(E)
Dropped
Content
(F)
Facts'
target
:
'
Let'
'
s
think
step
by
step.
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
target
:
'
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
to
English
is
"The
list
of
monuments
in
Lenzen
(Elbe)
includes
all
the
monuments
...
...
@@ -70,9 +68,7 @@ fewshot_config:
\
am
Lech.
\n
The
translation
contains
an
error
pertaining
to
\n
Options:
\n
(A)
Modifiers
\
\
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
Antonyms
\n
(D)
Named
Entities
\n\
(E)
Dropped
Content
\n
(F)
Facts"
target
:
'
Let'
'
s
think
step
by
step.
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
target
:
'
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
to
English
is
"The
monuments
of
the
Upper
Bavarian
district
town
of
Landsberg
...
...
@@ -98,7 +94,7 @@ fewshot_config:
\
Voivodeship
of
Poland.
\n
The
translation
contains
an
error
pertaining
to
\n\
Options:
\n
(A)
Modifiers
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
\
\
Antonyms
\n
(D)
Named
Entities
\n
(E)
Dropped
Content
\n
(F)
Facts"
target
:
"
Let's
think
step
by
step.
\n
We
solve
this
question
by
first
translating
\
target
:
"
We
solve
this
question
by
first
translating
\
\
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
\
\
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
\
\
of
the
source
sentence
from
German
to
English
is
\"\u0141
eba
is
a
small
town
\
...
...
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
View file @
b58e5556
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(A)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
waste
(B)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
deal'
target
:
'
Let'
'
s
think
step
by
step.
If
we
look
at
(A),
it
says
that
having
interests
and
actively
researching
them
target
:
'
If
we
look
at
(A),
it
says
that
having
interests
and
actively
researching
them
is
a
huge
waste,
implying
that
it
is
a
useless
effort.
However,
we
know
that
having
interests
and
actively
researching
them
is
typically
not
a
waste
but
rather
is
beneficial
to
the
individual.
The
presence
of
such
a
juxtaposition
...
...
@@ -46,9 +44,7 @@ fewshot_config:
(B)
No
one
is
going
to
disagree
with
you
on
this.
Ad
hominem
attacks
really
help
your
case'
target
:
'
Let'
'
s
think
step
by
step.
If
we
look
at
(A),
it
says
that
avoiding
ad
hominem
attacks
really
help
your
target
:
'
If
we
look
at
(A),
it
says
that
avoiding
ad
hominem
attacks
really
help
your
case,
implying
that
ad
hominem
attacks
are
adverse
and
injurious.
Because
ad
hominem
attacks
are
adressed
at
a
person
rather
than
an
idea,
it
is
indeed
true
that
avoiding
them
is
often
useful
and
helpful;
so,
(A)
is
a
neutral
(valid
...
...
@@ -70,9 +66,7 @@ fewshot_config:
(B)
Consistency
in
the
league'
'
s
punishments?
What
do
you
think
this
is
supposed
to
be,
moral?'
target
:
'
Let'
'
s
think
step
by
step.
If
we
look
at
(A),
it
likens
the
consistency
in
the
league'
'
s
punishments
with
target
:
'
If
we
look
at
(A),
it
likens
the
consistency
in
the
league'
'
s
punishments
with
that
in
politics.
Because
politics
or
political
affairs
are
often
not
considered
to
be
consistent
or
dependable,
this
sentence
appears
to
be
satirical.
...
...
Prev
1
2
3
4
5
6
7
…
17
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment