Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
bd028848
Commit
bd028848
authored
Jul 18, 2025
by
Baber
Browse files
Merge branch 'main' into metrics
# Conflicts: # tests/test_tasks.py
parents
6e48110e
56def33d
Changes
108
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
60 additions
and
166 deletions
+60
-166
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
...tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
...asks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
...asks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
+3
-3
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
...asks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/salient_translation_error_detection.yaml
.../bbh/cot_fewshot/salient_translation_error_detection.yaml
+3
-7
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/sports_understanding.yaml
lm_eval/tasks/bbh/cot_fewshot/sports_understanding.yaml
+3
-3
lm_eval/tasks/bbh/cot_fewshot/temporal_sequences.yaml
lm_eval/tasks/bbh/cot_fewshot/temporal_sequences.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/tracking_shuffled_objects_five_objects.yaml
...h/cot_fewshot/tracking_shuffled_objects_five_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/tracking_shuffled_objects_seven_objects.yaml
.../cot_fewshot/tracking_shuffled_objects_seven_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/tracking_shuffled_objects_three_objects.yaml
.../cot_fewshot/tracking_shuffled_objects_three_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/web_of_lies.yaml
lm_eval/tasks/bbh/cot_fewshot/web_of_lies.yaml
+3
-9
No files found.
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
View file @
bd028848
...
...
@@ -36,9 +36,7 @@ fewshot_config:
(I)
sector
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
...
@@ -90,9 +88,7 @@ fewshot_config:
(I)
sector
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
...
@@ -138,9 +134,7 @@ fewshot_config:
(I)
sector
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
...
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
View file @
bd028848
...
...
@@ -18,9 +18,7 @@ fewshot_config:
(A)
rubber
terrible
ship
(B)
terrible
rubber
ship'
target
:
'
Let'
'
s
think
step
by
step.
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
...
@@ -39,9 +37,7 @@ fewshot_config:
(A)
repulsive
small
Brazilian
exercise
ship
(B)
Brazilian
repulsive
exercise
small
ship'
target
:
'
Let'
'
s
think
step
by
step.
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
...
@@ -63,9 +59,7 @@ fewshot_config:
(A)
blue
gold
wonderful
square
shoe
(B)
wonderful
square
blue
gold
shoe'
target
:
'
Let'
'
s
think
step
by
step.
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
View file @
bd028848
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
View file @
bd028848
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
View file @
bd028848
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
View file @
bd028848
...
...
@@ -26,9 +26,7 @@ fewshot_config:
(D)
The
Barkley
Marathons
The
Race
That
Eats
Its
Young
(E)
Bug'
target
:
'
Let'
'
s
think
step
by
step.
-
Star
Wars
Episode
IV
-
A
New
Hope
(action,
adventure,
fantasy;
1977)
target
:
'
-
Star
Wars
Episode
IV
-
A
New
Hope
(action,
adventure,
fantasy;
1977)
-
Indiana
Jones
and
the
Last
Crusade
(action,
adventure;
1989)
...
...
@@ -54,9 +52,7 @@ fewshot_config:
(D)
The
Salton
Sea
(E)
Extreme
Days'
target
:
'
Let'
'
s
think
step
by
step.
-
Twister
(action,
adventure,
thriller;
1996)
target
:
'
-
Twister
(action,
adventure,
thriller;
1996)
-
The
Silence
of
the
Lambs
(crime,
drama,
thriller;
1991)
...
...
@@ -79,9 +75,7 @@ fewshot_config:
(C)
Catwoman
(D)
Edge
of
Tomorrow'
target
:
'
Let'
'
s
think
step
by
step.
-
Minority
Report
(action,
crime,
mystery;
2002)
target
:
'
-
Minority
Report
(action,
crime,
mystery;
2002)
-
Total
Recall
(action,
adventure,
science-fiction;
2012)
...
...
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
View file @
bd028848
...
...
@@ -12,7 +12,7 @@ fewshot_config:
sampler
:
first_n
samples
:
-
input
:
((-5 + 9 * -4 - 0) * (4 + -7 + 0 * -5)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
...
@@ -23,7 +23,7 @@ fewshot_config:
\
+
0)
=
(4
+
-7)
=
(4
-
7)
=
-3.
\n
Then,
the
final
equation
is
A
*
B
=
-41
*
\
\
-3
=
(-61)
*
(-3)
=
123.
So
the
answer
is
123."
-
input
:
((-9 * 7 * 7 * -9) + (4 * -9 - 8 - -4)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
...
@@ -34,7 +34,7 @@ fewshot_config:
\
-
8)
-
(-4))
=
(-44
-
(-4))
=
-40.
\n
Then,
the
final
equation
is
A
+
B
=
3969
\
\
+
-40
=
3969
-
40
=
3929.
So
the
answer
is
3929."
-
input
:
((-3 + 5 * 8 * -4) - (9 - 8 * -7 + -9)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
...
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
View file @
bd028848
...
...
@@ -21,9 +21,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Turn
left:
(0,
0),
facing
the
negative
x-axis.
...
...
@@ -49,9 +47,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Turn
around:
(0,
0),
facing
the
negative
y-axis.
...
...
@@ -76,9 +72,7 @@ fewshot_config:
-
Yes
-
No'
target
:
'
Let'
'
s
think
step
by
step.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Always
face
forward:
(0,
0),
facing
the
positive
y-axis.
...
...
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
View file @
bd028848
...
...
@@ -14,9 +14,7 @@ fewshot_config:
samples
:
-
input
:
I have a blackberry, a clarinet, a nectarine, a plum, a strawberry, a banana,
a flute, an orange, and a violin. How many fruits do I have?
target
:
'
Let'
'
s
think
step
by
step.
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
target
:
'
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
-
blackberry
(1)
...
...
@@ -34,9 +32,7 @@ fewshot_config:
answer
is
6.'
-
input
:
I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape,
a nectarine, and three plums. How many fruits do I have?
target
:
'
Let'
'
s
think
step
by
step.
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
target
:
'
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
-
orange
(1)
...
...
@@ -58,9 +54,7 @@ fewshot_config:
11.
So
the
answer
is
11.'
-
input
:
I have a lettuce head, a head of broccoli, an onion, a stalk of celery,
two carrots, a garlic, and a yam. How many vegetables do I have?
target
:
'
Let'
'
s
think
step
by
step.
We
first
identify
the
vegetables
on
the
list
and
include
their
quantity
in
parentheses:
target
:
'
We
first
identify
the
vegetables
on
the
list
and
include
their
quantity
in
parentheses:
-
lettuce
(1)
...
...
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
View file @
bd028848
...
...
@@ -32,9 +32,7 @@ fewshot_config:
(D)
4
(E)
5'
target
:
'
Let'
'
s
think
step
by
step.
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
target
:
'
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
Now,
we
add
James
to
this
table:
James
is
12
years
old.
...
...
@@ -59,9 +57,7 @@ fewshot_config:
(D)
Gwen
(E)
James'
target
:
'
Let'
'
s
think
step
by
step.
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
target
:
'
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
According
to
the
table,
Bernard
(5)
is
the
youngest
amongst
them.
...
...
@@ -84,9 +80,7 @@ fewshot_config:
(D)
Gwen
(E)
James'
target
:
'
Let'
'
s
think
step
by
step.
This
question
focuses
on
the
name.
We
know
the
following:
The
names
of
the
penguin
target
:
'
This
question
focuses
on
the
name.
We
know
the
following:
The
names
of
the
penguin
in
the
table
are
Louis,
Bernard,
Vincent,
and
Gwen.
When
we
sort
their
names
alphabetically,
we
get
Bernard,
Gwen,
Louis,
Vincent.
...
...
lm_eval/tasks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
View file @
bd028848
...
...
@@ -52,9 +52,7 @@ fewshot_config:
(Q)
purple
(R)
pink'
target
:
'
Let'
'
s
think
step
by
step.
According
to
this
question,
the
color
of
the
stress
ball
is
blue.
So
the
answer
target
:
'
According
to
this
question,
the
color
of
the
stress
ball
is
blue.
So
the
answer
is
(E).'
-
input
:
'
On
the
table,
you
see
a
bunch
of
objects
arranged
in
a
row:
a
purple
paperclip,
a
pink
stress
ball,
a
brown
keychain,
a
green
scrunchiephone
charger,
a
mauve
...
...
@@ -98,9 +96,7 @@ fewshot_config:
(Q)
purple
(R)
pink'
target
:
'
Let'
'
s
think
step
by
step.
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
target
:
'
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
right,
as
follows:
(1)
a
purple
paperclip,
(2)
a
pink
stress
ball,
(3)
a
brown
keychain,
(4)
a
green
scrunchiephone
charger,
(5)
a
mauve
fidget
spinner,
(6)
a
burgundy
pen.
...
...
@@ -129,9 +125,7 @@ fewshot_config:
(F)
five
(G)
six'
target
:
'
Let'
'
s
think
step
by
step.
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
target
:
'
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
right,
as
follows:
(1)
a
teal
plate,
(2)
a
burgundy
keychain,
(3)
a
yellow
scrunchiephone
charger,
(4)
an
orange
mug,
(5)
a
pink
notebook,
(6)
a
grey
cup.
...
...
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
View file @
bd028848
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(C)
whitesnuake
(D)
mwhitesnake'
target
:
'
Let'
'
s
think
step
by
step.
The
original
name
is
"whitesnake".
This
is
the
name
of
an
old
English
hard
rock
target
:
'
The
original
name
is
"whitesnake".
This
is
the
name
of
an
old
English
hard
rock
band.
It
is
a
compound
word,
formed
by
the
words
"white"
and
"snake".
(A)
"whitesnape":
It
is
formed
by
the
combination
of
"white"
and
"snake";
therefore,
...
...
@@ -57,9 +55,7 @@ fewshot_config:
(C)
one
of
our
dinosaurs
is
pissing
(D)
one
of
our
dinosaur
is
missing'
target
:
'
Let'
'
s
think
step
by
step.
The
original
name
is
"one
of
our
dinosaurs
is
missing".
This
is
the
name
of
target
:
'
The
original
name
is
"one
of
our
dinosaurs
is
missing".
This
is
the
name
of
an
old
British
movie.
(A)
"ofne
of
our
dinosaurs
is
missing":
Here
"one
of"
is
changed
to
"ofne",
...
...
@@ -91,9 +87,7 @@ fewshot_config:
(C)
courting
crows
(D)
coutnting
crows'
target
:
'
Let'
'
s
think
step
by
step.
The
original
name
is
"counting
crows".
This
is
the
name
of
an
American
rock
target
:
'
The
original
name
is
"counting
crows".
This
is
the
name
of
an
American
rock
band.
Historically,
the
band
name
comes
from
the
British
nursery
rhyme
"One
for
Sorrow",
which
is
about
counting
of
magpies.
...
...
lm_eval/tasks/bbh/cot_fewshot/salient_translation_error_detection.yaml
View file @
bd028848
...
...
@@ -42,9 +42,7 @@ fewshot_config:
(E)
Dropped
Content
(F)
Facts'
target
:
'
Let'
'
s
think
step
by
step.
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
target
:
'
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
to
English
is
"The
list
of
monuments
in
Lenzen
(Elbe)
includes
all
the
monuments
...
...
@@ -70,9 +68,7 @@ fewshot_config:
\
am
Lech.
\n
The
translation
contains
an
error
pertaining
to
\n
Options:
\n
(A)
Modifiers
\
\
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
Antonyms
\n
(D)
Named
Entities
\n\
(E)
Dropped
Content
\n
(F)
Facts"
target
:
'
Let'
'
s
think
step
by
step.
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
target
:
'
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
to
English
is
"The
monuments
of
the
Upper
Bavarian
district
town
of
Landsberg
...
...
@@ -98,7 +94,7 @@ fewshot_config:
\
Voivodeship
of
Poland.
\n
The
translation
contains
an
error
pertaining
to
\n\
Options:
\n
(A)
Modifiers
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
\
\
Antonyms
\n
(D)
Named
Entities
\n
(E)
Dropped
Content
\n
(F)
Facts"
target
:
"
Let's
think
step
by
step.
\n
We
solve
this
question
by
first
translating
\
target
:
"
We
solve
this
question
by
first
translating
\
\
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
\
\
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
\
\
of
the
source
sentence
from
German
to
English
is
\"\u0141
eba
is
a
small
town
\
...
...
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
View file @
bd028848
...
...
@@ -24,9 +24,7 @@ fewshot_config:
(A)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
waste
(B)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
deal'
target
:
'
Let'
'
s
think
step
by
step.
If
we
look
at
(A),
it
says
that
having
interests
and
actively
researching
them
target
:
'
If
we
look
at
(A),
it
says
that
having
interests
and
actively
researching
them
is
a
huge
waste,
implying
that
it
is
a
useless
effort.
However,
we
know
that
having
interests
and
actively
researching
them
is
typically
not
a
waste
but
rather
is
beneficial
to
the
individual.
The
presence
of
such
a
juxtaposition
...
...
@@ -46,9 +44,7 @@ fewshot_config:
(B)
No
one
is
going
to
disagree
with
you
on
this.
Ad
hominem
attacks
really
help
your
case'
target
:
'
Let'
'
s
think
step
by
step.
If
we
look
at
(A),
it
says
that
avoiding
ad
hominem
attacks
really
help
your
target
:
'
If
we
look
at
(A),
it
says
that
avoiding
ad
hominem
attacks
really
help
your
case,
implying
that
ad
hominem
attacks
are
adverse
and
injurious.
Because
ad
hominem
attacks
are
adressed
at
a
person
rather
than
an
idea,
it
is
indeed
true
that
avoiding
them
is
often
useful
and
helpful;
so,
(A)
is
a
neutral
(valid
...
...
@@ -70,9 +66,7 @@ fewshot_config:
(B)
Consistency
in
the
league'
'
s
punishments?
What
do
you
think
this
is
supposed
to
be,
moral?'
target
:
'
Let'
'
s
think
step
by
step.
If
we
look
at
(A),
it
likens
the
consistency
in
the
league'
'
s
punishments
with
target
:
'
If
we
look
at
(A),
it
likens
the
consistency
in
the
league'
'
s
punishments
with
that
in
politics.
Because
politics
or
political
affairs
are
often
not
considered
to
be
consistent
or
dependable,
this
sentence
appears
to
be
satirical.
...
...
lm_eval/tasks/bbh/cot_fewshot/sports_understanding.yaml
View file @
bd028848
...
...
@@ -14,15 +14,15 @@ fewshot_config:
samples
:
-
input
:
Is the following sentence plausible? "Bam Adebayo scored a reverse layup
in the Western Conference Finals."
target
:
Let's think step by step.
Bam Adebayo is an American basketball player.
target
:
Bam Adebayo is an American basketball player.
Scoring a reverse layup in the Western Conference Finals is part of the NBA
Finals. So the answer is yes.
-
input
:
Is the following sentence plausible? "Santi Cazorla scored a touchdown."
target
:
Let's think step by step.
Santi Cazorla is a soccer player. Touchdown
target
:
Santi Cazorla is a soccer player. Touchdown
is part of American football and rugby. So the answer is no.
-
input
:
Is the following sentence plausible? "DeMar DeRozan was called for the
goal tend."
target
:
Let's think step by step.
DeMar DeRozan is an American basketball player.
target
:
DeMar DeRozan is an American basketball player.
Goal tending is part of basketball. So the answer is yes.
include
:
_cot_fewshot_template_yaml
task
:
bbh_cot_fewshot_sports_understanding
lm_eval/tasks/bbh/cot_fewshot/temporal_sequences.yaml
View file @
bd028848
...
...
@@ -39,9 +39,7 @@ fewshot_config:
(C)
5pm
to
6pm
(D)
2pm
to
4pm'
target
:
'
Let'
'
s
think
step
by
step.
Wake-up
time:
1pm.
target
:
'
Wake-up
time:
1pm.
1pm-2pm:
free.
...
...
@@ -88,9 +86,7 @@ fewshot_config:
(C)
1pm
to
2pm
(D)
3pm
to
6pm'
target
:
'
Let'
'
s
think
step
by
step.
Wake-up
time:
7am.
target
:
'
Wake-up
time:
7am.
7am-1pm:
free.
...
...
@@ -137,9 +133,7 @@ fewshot_config:
(C)
5am
to
6am
(D)
3pm
to
4pm'
target
:
'
Let'
'
s
think
step
by
step.
Wake-up
time:
5am.
target
:
'
Wake-up
time:
5am.
5am-6am:
getting
a
coffee
at
the
cafe.
...
...
lm_eval/tasks/bbh/cot_fewshot/tracking_shuffled_objects_five_objects.yaml
View file @
bd028848
...
...
@@ -27,9 +27,7 @@ fewshot_config:
(B)
blue
ball
(C)
pink
ball'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
yellow,
Bob:
blue,
Claire:
pink.
target
:
'
(0)
At
the
start:
Alice:
yellow,
Bob:
blue,
Claire:
pink.
(1)
Claire
and
Alice
swap
balls:
Alice:
pink,
Bob:
blue,
Claire:
yellow.
...
...
@@ -53,9 +51,7 @@ fewshot_config:
(B)
purple
ball
(C)
pink
ball'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
white,
Bob:
purple,
Claire:
pink.
target
:
'
(0)
At
the
start:
Alice:
white,
Bob:
purple,
Claire:
pink.
(1)
Bob
and
Alice
swap
balls:
Alice:
purple,
Bob:
white,
Claire:
pink.
...
...
@@ -79,9 +75,7 @@ fewshot_config:
(B)
Rodrigo
(C)
Patrick'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
Lola,
Bob:
Rodrigo,
Claire:
Patrick.
target
:
'
(0)
At
the
start:
Alice:
Lola,
Bob:
Rodrigo,
Claire:
Patrick.
(1)
Alice
and
Bob
switch
partners:
Alice:
Rodrigo,
Bob:
Lola,
Claire:
Patrick.
...
...
lm_eval/tasks/bbh/cot_fewshot/tracking_shuffled_objects_seven_objects.yaml
View file @
bd028848
...
...
@@ -27,9 +27,7 @@ fewshot_config:
(B)
blue
ball
(C)
pink
ball'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
yellow,
Bob:
blue,
Claire:
pink.
target
:
'
(0)
At
the
start:
Alice:
yellow,
Bob:
blue,
Claire:
pink.
(1)
Claire
and
Alice
swap
balls:
Alice:
pink,
Bob:
blue,
Claire:
yellow.
...
...
@@ -53,9 +51,7 @@ fewshot_config:
(B)
purple
ball
(C)
pink
ball'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
white,
Bob:
purple,
Claire:
pink.
target
:
'
(0)
At
the
start:
Alice:
white,
Bob:
purple,
Claire:
pink.
(1)
Bob
and
Alice
swap
balls:
Alice:
purple,
Bob:
white,
Claire:
pink.
...
...
@@ -79,9 +75,7 @@ fewshot_config:
(B)
Rodrigo
(C)
Patrick'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
Lola,
Bob:
Rodrigo,
Claire:
Patrick.
target
:
'
(0)
At
the
start:
Alice:
Lola,
Bob:
Rodrigo,
Claire:
Patrick.
(1)
Alice
and
Bob
switch
partners:
Alice:
Rodrigo,
Bob:
Lola,
Claire:
Patrick.
...
...
lm_eval/tasks/bbh/cot_fewshot/tracking_shuffled_objects_three_objects.yaml
View file @
bd028848
...
...
@@ -27,9 +27,7 @@ fewshot_config:
(B)
blue
ball
(C)
pink
ball'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
yellow,
Bob:
blue,
Claire:
pink.
target
:
'
(0)
At
the
start:
Alice:
yellow,
Bob:
blue,
Claire:
pink.
(1)
Claire
and
Alice
swap
balls:
Alice:
pink,
Bob:
blue,
Claire:
yellow.
...
...
@@ -53,9 +51,7 @@ fewshot_config:
(B)
purple
ball
(C)
pink
ball'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
white,
Bob:
purple,
Claire:
pink.
target
:
'
(0)
At
the
start:
Alice:
white,
Bob:
purple,
Claire:
pink.
(1)
Bob
and
Alice
swap
balls:
Alice:
purple,
Bob:
white,
Claire:
pink.
...
...
@@ -79,9 +75,7 @@ fewshot_config:
(B)
Rodrigo
(C)
Patrick'
target
:
'
Let'
'
s
think
step
by
step.
(0)
At
the
start:
Alice:
Lola,
Bob:
Rodrigo,
Claire:
Patrick.
target
:
'
(0)
At
the
start:
Alice:
Lola,
Bob:
Rodrigo,
Claire:
Patrick.
(1)
Alice
and
Bob
switch
partners:
Alice:
Rodrigo,
Bob:
Lola,
Claire:
Patrick.
...
...
lm_eval/tasks/bbh/cot_fewshot/web_of_lies.yaml
View file @
bd028848
...
...
@@ -14,9 +14,7 @@ fewshot_config:
-
input
:
'
Question:
Fidel
tells
the
truth.
Jerry
says
Fidel
tells
the
truth.
Vina
says
Jerry
tells
the
truth.
Millicent
says
Vina
lies.
Raymond
says
Millicent
lies.
Does
Raymond
tell
the
truth?'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Fidel
tells
the
truth.
So,
we
know
that
Fidel
tells
the
truth.
target
:
'
(1)
Fidel
tells
the
truth.
So,
we
know
that
Fidel
tells
the
truth.
(2)
Jerry
says
Fidel
tells
the
truth.
Since
we
know
from
(1)
that
Fidel
tells
the
truth,
if
Jerry
says
that
Fidel
tells
the
truth,
then
Jerry
tells
the
truth.
...
...
@@ -35,9 +33,7 @@ fewshot_config:
-
input
:
'
Question:
Kristian
lies.
Millie
says
Kristian
lies.
Maybelle
says
Millie
tells
the
truth.
Fidel
says
Maybelle
lies.
Leda
says
Fidel
lies.
Does
Leda
tell
the
truth?'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Kristian
lies.
So,
we
know
that
Kristian
lies.
target
:
'
(1)
Kristian
lies.
So,
we
know
that
Kristian
lies.
(2)
Millie
says
Kristian
lies.
Since
we
know
from
(1)
that
Kristian
lies,
if
Millie
says
Kristian
lies,
then
Millie
tells
the
truth.
...
...
@@ -57,9 +53,7 @@ fewshot_config:
-
input
:
'
Question:
Kristian
tells
the
truth.
Michaela
says
Kristian
lies.
Raymond
says
Michaela
tells
the
truth.
Osvaldo
says
Raymond
tells
the
truth.
Jamey
says
Osvaldo
tells
the
truth.
Does
Jamey
tell
the
truth?'
target
:
'
Let'
'
s
think
step
by
step.
(1)
Kristian
tells
the
truth.
So,
we
know
that
Kristian
tells
the
truth.
target
:
'
(1)
Kristian
tells
the
truth.
So,
we
know
that
Kristian
tells
the
truth.
(2)
Michaela
says
Kristian
lies.
Since
we
know
from
(1)
that
Kristian
tells
the
truth,
if
Michaela
says
Kristian
lies,
then
Michaela
lies.
...
...
Prev
1
2
3
4
5
6
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment