Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
b58e5556
Commit
b58e5556
authored
Jul 27, 2025
by
Baber
Browse files
Merge branch 'main' into tasklist
# Conflicts: # pyproject.toml
parents
6e1866f5
4f8195f1
Changes
340
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
59 additions
and
158 deletions
+59
-158
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
+2
-1
lm_eval/tasks/bbh/cot_fewshot/causal_judgement.yaml
lm_eval/tasks/bbh/cot_fewshot/causal_judgement.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/date_understanding.yaml
lm_eval/tasks/bbh/cot_fewshot/date_understanding.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/disambiguation_qa.yaml
lm_eval/tasks/bbh/cot_fewshot/disambiguation_qa.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/dyck_languages.yaml
lm_eval/tasks/bbh/cot_fewshot/dyck_languages.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/formal_fallacies.yaml
lm_eval/tasks/bbh/cot_fewshot/formal_fallacies.yaml
+3
-3
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
...tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
...asks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
...asks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
+3
-3
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
...asks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
+3
-9
lm_eval/tasks/bbh/cot_fewshot/salient_translation_error_detection.yaml
.../bbh/cot_fewshot/salient_translation_error_detection.yaml
+3
-7
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
+3
-9
No files found.
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
View file @
b58e5556
...
@@ -2,6 +2,7 @@ dataset_path: SaylorTwift/bbh
...
@@ -2,6 +2,7 @@ dataset_path: SaylorTwift/bbh
output_type: generate_until
output_type: generate_until
test_split: test
test_split: test
doc_to_target: "{{target}}"
doc_to_target: "{{target}}"
target_delimiter: ""
metric_list:
metric_list:
- metric: exact_match
- metric: exact_match
aggregation: mean
aggregation: mean
...
@@ -24,4 +25,4 @@ filter_list:
...
@@ -24,4 +25,4 @@ filter_list:
- function: "take_first"
- function: "take_first"
num_fewshot: 3
num_fewshot: 3
metadata:
metadata:
version:
3
.0
version:
4
.0
lm_eval/tasks/bbh/cot_fewshot/causal_judgement.yaml
View file @
b58e5556
...
@@ -26,9 +26,7 @@ fewshot_config:
...
@@ -26,9 +26,7 @@ fewshot_config:
-
Yes
-
Yes
-
No'
-
No'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
Here
in
this
question,
we
are
told
that
"Frank
T.
had
no
experience
with
guns,
Here
in
this
question,
we
are
told
that
"Frank
T.
had
no
experience
with
guns,
his
hand
slipped
on
the
barrel
of
the
gun,
and
the
shot
went
wild."
A
typical
his
hand
slipped
on
the
barrel
of
the
gun,
and
the
shot
went
wild."
A
typical
person
would
assume
that
this
passage
suggests
that
Frank
T.
had
no
intention
person
would
assume
that
this
passage
suggests
that
Frank
T.
had
no
intention
of
shooting
and
injuring
someone
and
that
the
bullet
accidentally
hit
the
neighbor'
'
s
of
shooting
and
injuring
someone
and
that
the
bullet
accidentally
hit
the
neighbor'
'
s
...
@@ -50,9 +48,7 @@ fewshot_config:
...
@@ -50,9 +48,7 @@ fewshot_config:
-
Yes
-
Yes
-
No'
-
No'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
Here
in
this
question,
we
are
told
that
the
boss
ordered
them
both
to
arrive
Here
in
this
question,
we
are
told
that
the
boss
ordered
them
both
to
arrive
at
the
meeting
room
at
the
same
time
and
that
the
motion
detector
was
set
up
at
the
meeting
room
at
the
same
time
and
that
the
motion
detector
was
set
up
to
be
triggered
if
at
least
one
person
appeared
in
the
room
at
the
same
time."
to
be
triggered
if
at
least
one
person
appeared
in
the
room
at
the
same
time."
A
typical
person
would
assume
that
the
person
probably
meant
to
say
the
detector
A
typical
person
would
assume
that
the
person
probably
meant
to
say
the
detector
...
@@ -82,9 +78,7 @@ fewshot_config:
...
@@ -82,9 +78,7 @@ fewshot_config:
-
Yes
-
Yes
-
No'
-
No'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
Here
in
this
question,
we
are
told
that
"He
aims
the
dart
at
the
low
point
region."
Here
in
this
question,
we
are
told
that
"He
aims
the
dart
at
the
low
point
region."
A
typical
person
might
therefore
think
George
did
intentionally
hit
the
low
A
typical
person
might
therefore
think
George
did
intentionally
hit
the
low
point
region,
because
he
wanted
to
lift
up
the
spirit
of
his
sister
Lena.
So
point
region,
because
he
wanted
to
lift
up
the
spirit
of
his
sister
Lena.
So
the
answer
is
Yes.'
the
answer
is
Yes.'
...
...
lm_eval/tasks/bbh/cot_fewshot/date_understanding.yaml
View file @
b58e5556
...
@@ -26,9 +26,7 @@ fewshot_config:
...
@@ -26,9 +26,7 @@ fewshot_config:
(E)
07/14/1938
(E)
07/14/1938
(F)
12/14/1988'
(F)
12/14/1988'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
If
today
is
Christmas
Eve
of
1937,
then
today'
'
s
date
is
December
24,
1937.
If
today
is
Christmas
Eve
of
1937,
then
today'
'
s
date
is
December
24,
1937.
10
days
before
today
is
December
14,
1937,
that
is
12/14/1937.
So
the
answer
10
days
before
today
is
December
14,
1937,
that
is
12/14/1937.
So
the
answer
is
(D).'
is
(D).'
-
input
:
'
Tomorrow
is
11/12/2019.
What
is
the
date
one
year
ago
from
today
in
MM/DD/YYYY?
-
input
:
'
Tomorrow
is
11/12/2019.
What
is
the
date
one
year
ago
from
today
in
MM/DD/YYYY?
...
@@ -44,9 +42,7 @@ fewshot_config:
...
@@ -44,9 +42,7 @@ fewshot_config:
(D)
11/02/2018
(D)
11/02/2018
(E)
11/04/2018'
(E)
11/04/2018'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
If
tomorrow
is
11/12/2019,
then
today
is
11/11/2019.
The
date
one
year
ago
from
If
tomorrow
is
11/12/2019,
then
today
is
11/11/2019.
The
date
one
year
ago
from
today
is
11/11/2018.
So
the
answer
is
(B).'
today
is
11/11/2018.
So
the
answer
is
(B).'
-
input
:
'
Jane
and
John
married
on
Jan
2,
1958.
It
is
their
5-year
anniversary
today.
-
input
:
'
Jane
and
John
married
on
Jan
2,
1958.
It
is
their
5-year
anniversary
today.
What
is
the
date
tomorrow
in
MM/DD/YYYY?
What
is
the
date
tomorrow
in
MM/DD/YYYY?
...
@@ -64,9 +60,7 @@ fewshot_config:
...
@@ -64,9 +60,7 @@ fewshot_config:
(E)
01/03/1982
(E)
01/03/1982
(F)
12/03/1960'
(F)
12/03/1960'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
If
Jane
and
John
married
on
Jan
2,
1958,
then
and
if
it
is
their
5-year
anniversary
If
Jane
and
John
married
on
Jan
2,
1958,
then
and
if
it
is
their
5-year
anniversary
today,
then
today'
'
s
date
is
Jan
2,
1963.
The
date
tomorrow
is
Jan
3,
1963,
today,
then
today'
'
s
date
is
Jan
2,
1963.
The
date
tomorrow
is
Jan
3,
1963,
that
is
01/03/1963.
So
the
answer
is
(B).'
that
is
01/03/1963.
So
the
answer
is
(B).'
include
:
_cot_fewshot_template_yaml
include
:
_cot_fewshot_template_yaml
...
...
lm_eval/tasks/bbh/cot_fewshot/disambiguation_qa.yaml
View file @
b58e5556
...
@@ -23,9 +23,7 @@ fewshot_config:
...
@@ -23,9 +23,7 @@ fewshot_config:
(B)
The
counselor
took
the
day
off
(B)
The
counselor
took
the
day
off
(C)
Ambiguous'
(C)
Ambiguous'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
Here
we
need
to
determine
who
the
pronoun
"they"
might
be
referring
to.
There
Here
we
need
to
determine
who
the
pronoun
"they"
might
be
referring
to.
There
are
two
possible
referents
for
"they",
namely
the
chief
and
the
counselor.
The
are
two
possible
referents
for
"they",
namely
the
chief
and
the
counselor.
The
verb
"told"
might
be
able
to
help
us
determine
which
one
is
more
likely
(if
verb
"told"
might
be
able
to
help
us
determine
which
one
is
more
likely
(if
either).
Let
X
be
the
chief
and
Y
the
counselor.
The
sentence
is
then
of
the
either).
Let
X
be
the
chief
and
Y
the
counselor.
The
sentence
is
then
of
the
...
@@ -54,9 +52,7 @@ fewshot_config:
...
@@ -54,9 +52,7 @@ fewshot_config:
(B)
The
manager
didn'
'
t
reply
yet
(B)
The
manager
didn'
'
t
reply
yet
(C)
Ambiguous'
(C)
Ambiguous'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
Here
we
need
to
determine
who
the
pronoun
"he"
might
be
referring
to.
There
Here
we
need
to
determine
who
the
pronoun
"he"
might
be
referring
to.
There
are
two
possible
referents
for
"he",
namely
the
manager
and
the
secretary.
The
are
two
possible
referents
for
"he",
namely
the
manager
and
the
secretary.
The
verbs
"sent"
and
"reply"
might
be
able
to
help
us
determine
which
one
is
more
verbs
"sent"
and
"reply"
might
be
able
to
help
us
determine
which
one
is
more
likely
(if
either).
Let
X
be
the
manager
and
Y
the
secretary.
The
sentence
is
likely
(if
either).
Let
X
be
the
manager
and
Y
the
secretary.
The
sentence
is
...
@@ -84,9 +80,7 @@ fewshot_config:
...
@@ -84,9 +80,7 @@ fewshot_config:
(B)
It
will
be
the
director'
'
s
office
(B)
It
will
be
the
director'
'
s
office
(C)
Ambiguous'
(C)
Ambiguous'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
Here
we
need
to
determine
who
the
pronoun
"his"
might
be
referring
to.
There
Here
we
need
to
determine
who
the
pronoun
"his"
might
be
referring
to.
There
are
two
possible
referents
for
"his",
namely
Bailey'
'
s
and
the
director'
'
s.
are
two
possible
referents
for
"his",
namely
Bailey'
'
s
and
the
director'
'
s.
The
verb
phrase
"plan
to
meet"
might
be
able
to
help
us
determine
which
one
The
verb
phrase
"plan
to
meet"
might
be
able
to
help
us
determine
which
one
is
more
likely
(if
either).
Let
X
be
Bailey
and
Y
the
director.
The
sentence
is
more
likely
(if
either).
Let
X
be
Bailey
and
Y
the
director.
The
sentence
...
...
lm_eval/tasks/bbh/cot_fewshot/dyck_languages.yaml
View file @
b58e5556
...
@@ -13,9 +13,7 @@ fewshot_config:
...
@@ -13,9 +13,7 @@ fewshot_config:
samples
:
samples
:
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
closed
properly.
Input:
[
{
['
closed
properly.
Input:
[
{
['
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
0:
empty
stack
0:
empty
stack
...
@@ -32,9 +30,7 @@ fewshot_config:
...
@@ -32,9 +30,7 @@ fewshot_config:
So,
we
need
"]",
"}",
"]".
So
the
answer
is
]
}
].'
So,
we
need
"]",
"}",
"]".
So
the
answer
is
]
}
].'
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
closed
properly.
Input:
<
>
(
(
[
[
(
{
}
)
[
<
>
]
]'
closed
properly.
Input:
<
>
(
(
[
[
(
{
}
)
[
<
>
]
]'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
0:
empty
stack
0:
empty
stack
...
@@ -76,9 +72,7 @@ fewshot_config:
...
@@ -76,9 +72,7 @@ fewshot_config:
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
-
input
:
'
Complete
the
rest
of
the
sequence,
making
sure
that
the
parentheses
are
closed
properly.
Input:
<
[
<
[
{
<
[
]
<
{
}
>
>
}
]
>
{
{
(
)
}
{
<
[
<
>
closed
properly.
Input:
<
[
<
[
{
<
[
]
<
{
}
>
>
}
]
>
{
{
(
)
}
{
<
[
<
>
]
>
}'
]
>
}'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
We
should
process
each
input
one
by
one
and
keep
track
of
the
stack
configuration.
0:
empty
stack
0:
empty
stack
...
...
lm_eval/tasks/bbh/cot_fewshot/formal_fallacies.yaml
View file @
b58e5556
...
@@ -25,7 +25,7 @@ fewshot_config:
...
@@ -25,7 +25,7 @@ fewshot_config:
-
valid
-
valid
-
invalid'
-
invalid'
target
:
"
Let's
think
step
by
step.
\n
(1)
Lesley
is
a
close
friend
of
Fernando:
\
target
:
"
(1)
Lesley
is
a
close
friend
of
Fernando:
\
\
Lesley
=
friend(Fernando).
\n
(2)
Being
a
close
friend
of
Fernando
or
a
schoolmate
\
\
Lesley
=
friend(Fernando).
\n
(2)
Being
a
close
friend
of
Fernando
or
a
schoolmate
\
\
of
Lowell
is
sufficient
for
being
a
great-grandfather
of
Leroy:
If
X
=
friend(Fernando)
\
\
of
Lowell
is
sufficient
for
being
a
great-grandfather
of
Leroy:
If
X
=
friend(Fernando)
\
\
OR
SCHOOLMATE(Lowell),
then
X
=
great-grandfather(Leroy).
\n
Hypothesis:
Does
\
\
OR
SCHOOLMATE(Lowell),
then
X
=
great-grandfather(Leroy).
\n
Hypothesis:
Does
\
...
@@ -49,7 +49,7 @@ fewshot_config:
...
@@ -49,7 +49,7 @@ fewshot_config:
-
valid
-
valid
-
invalid'
-
invalid'
target
:
"
Let's
think
step
by
step.
\n
(1)
Whoever
is
not
a
great-grandfather
of
\
target
:
"
(1)
Whoever
is
not
a
great-grandfather
of
\
\
Clyde
is
a
stepbrother
of
Brian:
If
X
=
NOT
(great-grandfather(Clyde)),
then
\
\
Clyde
is
a
stepbrother
of
Brian:
If
X
=
NOT
(great-grandfather(Clyde)),
then
\
\
X
=
stepbrother(Brian).
\n
(2):
Being
an
ancestor
of
Dana
is
sufficient
for
\
\
X
=
stepbrother(Brian).
\n
(2):
Being
an
ancestor
of
Dana
is
sufficient
for
\
\
not
being
a
great-grandfather
of
Clyde:
If
X
=
ancestor(Dana),
X
=
NOT
(great-grandfather(Clyde)).
\n\
\
not
being
a
great-grandfather
of
Clyde:
If
X
=
ancestor(Dana),
X
=
NOT
(great-grandfather(Clyde)).
\n\
...
@@ -78,7 +78,7 @@ fewshot_config:
...
@@ -78,7 +78,7 @@ fewshot_config:
-
valid
-
valid
-
invalid'
-
invalid'
target
:
"
Let's
think
step
by
step.
\n
(1)
Every
infrequent
user
of
Paul
Mitchell
\
target
:
"
(1)
Every
infrequent
user
of
Paul
Mitchell
\
\
shampoo
is
either
a
rare
consumer
of
Nioxin
shampoo
or
a
loyal
buyer
of
Caress
\
\
shampoo
is
either
a
rare
consumer
of
Nioxin
shampoo
or
a
loyal
buyer
of
Caress
\
\
soap,
or
both:
If
X
=
infrequent-user(Paul
Mitchell),
then
X
=
rare-consumer(Nioxin)
\
\
soap,
or
both:
If
X
=
infrequent-user(Paul
Mitchell),
then
X
=
rare-consumer(Nioxin)
\
\
OR
X
=
loyal-buyer(Caress).
\n
(2):
No
regular
consumer
of
Lush
soap
is
a
rare
\
\
OR
X
=
loyal-buyer(Caress).
\n
(2):
No
regular
consumer
of
Lush
soap
is
a
rare
\
...
...
lm_eval/tasks/bbh/cot_fewshot/geometric_shapes.yaml
View file @
b58e5556
...
@@ -36,9 +36,7 @@ fewshot_config:
...
@@ -36,9 +36,7 @@ fewshot_config:
(I)
sector
(I)
sector
(J)
triangle'
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
@@ -90,9 +88,7 @@ fewshot_config:
...
@@ -90,9 +88,7 @@ fewshot_config:
(I)
sector
(I)
sector
(J)
triangle'
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
@@ -138,9 +134,7 @@ fewshot_config:
...
@@ -138,9 +134,7 @@ fewshot_config:
(I)
sector
(I)
sector
(J)
triangle'
(J)
triangle'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
This
SVG
path
element
contains
"M"
and
"L"
commands.
M
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
moves
the
current
point
to
the
coordinates
(x,y).
L
takes
two
parameters
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
(x,y)
and
draws
a
line
from
the
previous
coordinate
to
the
new
coordinate
(x,y).
...
...
lm_eval/tasks/bbh/cot_fewshot/hyperbaton.yaml
View file @
b58e5556
...
@@ -18,9 +18,7 @@ fewshot_config:
...
@@ -18,9 +18,7 @@ fewshot_config:
(A)
rubber
terrible
ship
(A)
rubber
terrible
ship
(B)
terrible
rubber
ship'
(B)
terrible
rubber
ship'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
@@ -39,9 +37,7 @@ fewshot_config:
...
@@ -39,9 +37,7 @@ fewshot_config:
(A)
repulsive
small
Brazilian
exercise
ship
(A)
repulsive
small
Brazilian
exercise
ship
(B)
Brazilian
repulsive
exercise
small
ship'
(B)
Brazilian
repulsive
exercise
small
ship'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
@@ -63,9 +59,7 @@ fewshot_config:
...
@@ -63,9 +59,7 @@ fewshot_config:
(A)
blue
gold
wonderful
square
shoe
(A)
blue
gold
wonderful
square
shoe
(B)
wonderful
square
blue
gold
shoe'
(B)
wonderful
square
blue
gold
shoe'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
When
there
is
more
than
one
adjective
before
a
noun,
the
adjectives
need
to
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
respect
the
following
order
before
a
noun:
"[1.
opinion]
[2.
size]
[3.
age]
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
[4.
shape]
[5.
color]
[6.
origin]
[7.
material]
[8.
purpose]
noun".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_five_objects.yaml
View file @
b58e5556
...
@@ -24,9 +24,7 @@ fewshot_config:
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(B)
Eli
finished
last
(C)
Eve
finished
last'
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
@@ -50,9 +48,7 @@ fewshot_config:
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
@@ -76,9 +72,7 @@ fewshot_config:
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_seven_objects.yaml
View file @
b58e5556
...
@@ -24,9 +24,7 @@ fewshot_config:
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(B)
Eli
finished
last
(C)
Eve
finished
last'
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
@@ -50,9 +48,7 @@ fewshot_config:
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
@@ -76,9 +72,7 @@ fewshot_config:
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/logical_deduction_three_objects.yaml
View file @
b58e5556
...
@@ -24,9 +24,7 @@ fewshot_config:
...
@@ -24,9 +24,7 @@ fewshot_config:
(B)
Eli
finished
last
(B)
Eli
finished
last
(C)
Eve
finished
last'
(C)
Eve
finished
last'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(1)
Eve
finished
above
Amy:
"(above)
?
Eve
?
Amy
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
(2)
Eli
finished
below
Amy:
"(above)
?
Amy
?
Eli
?
(below)".
...
@@ -50,9 +48,7 @@ fewshot_config:
...
@@ -50,9 +48,7 @@ fewshot_config:
(B)
The
green
book
is
the
leftmost
(B)
The
green
book
is
the
leftmost
(C)
The
orange
book
is
the
leftmost'
(C)
The
orange
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
(1)
The
green
book
is
to
the
right
of
the
white
book:
"(left)
?
white
?
green
?
(right)".
?
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
(2)
The
orange
book
is
the
rightmost:
"(left)
?
white
?
green
orange
(right)".
...
@@ -76,9 +72,7 @@ fewshot_config:
...
@@ -76,9 +72,7 @@ fewshot_config:
(B)
The
gray
book
is
the
leftmost
(B)
The
gray
book
is
the
leftmost
(C)
The
white
book
is
the
leftmost'
(C)
The
white
book
is
the
leftmost'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(1)
The
white
book
is
to
the
left
of
the
gray
book:
"(left)
?
white
?
gray
?
(right)".
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
(2)
The
red
book
is
the
second
from
the
left:
"(left)
?
white
red
gray
?
(right)".
...
...
lm_eval/tasks/bbh/cot_fewshot/movie_recommendation.yaml
View file @
b58e5556
...
@@ -26,9 +26,7 @@ fewshot_config:
...
@@ -26,9 +26,7 @@ fewshot_config:
(D)
The
Barkley
Marathons
The
Race
That
Eats
Its
Young
(D)
The
Barkley
Marathons
The
Race
That
Eats
Its
Young
(E)
Bug'
(E)
Bug'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
-
Star
Wars
Episode
IV
-
A
New
Hope
(action,
adventure,
fantasy;
1977)
-
Star
Wars
Episode
IV
-
A
New
Hope
(action,
adventure,
fantasy;
1977)
-
Indiana
Jones
and
the
Last
Crusade
(action,
adventure;
1989)
-
Indiana
Jones
and
the
Last
Crusade
(action,
adventure;
1989)
...
@@ -54,9 +52,7 @@ fewshot_config:
...
@@ -54,9 +52,7 @@ fewshot_config:
(D)
The
Salton
Sea
(D)
The
Salton
Sea
(E)
Extreme
Days'
(E)
Extreme
Days'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
-
Twister
(action,
adventure,
thriller;
1996)
-
Twister
(action,
adventure,
thriller;
1996)
-
The
Silence
of
the
Lambs
(crime,
drama,
thriller;
1991)
-
The
Silence
of
the
Lambs
(crime,
drama,
thriller;
1991)
...
@@ -79,9 +75,7 @@ fewshot_config:
...
@@ -79,9 +75,7 @@ fewshot_config:
(C)
Catwoman
(C)
Catwoman
(D)
Edge
of
Tomorrow'
(D)
Edge
of
Tomorrow'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
-
Minority
Report
(action,
crime,
mystery;
2002)
-
Minority
Report
(action,
crime,
mystery;
2002)
-
Total
Recall
(action,
adventure,
science-fiction;
2012)
-
Total
Recall
(action,
adventure,
science-fiction;
2012)
...
...
lm_eval/tasks/bbh/cot_fewshot/multistep_arithmetic_two.yaml
View file @
b58e5556
...
@@ -12,7 +12,7 @@ fewshot_config:
...
@@ -12,7 +12,7 @@ fewshot_config:
sampler
:
first_n
sampler
:
first_n
samples
:
samples
:
-
input
:
((-5 + 9 * -4 - 0) * (4 + -7 + 0 * -5)) =
-
input
:
((-5 + 9 * -4 - 0) * (4 + -7 + 0 * -5)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
@@ -23,7 +23,7 @@ fewshot_config:
...
@@ -23,7 +23,7 @@ fewshot_config:
\
+
0)
=
(4
+
-7)
=
(4
-
7)
=
-3.
\n
Then,
the
final
equation
is
A
*
B
=
-41
*
\
\
+
0)
=
(4
+
-7)
=
(4
-
7)
=
-3.
\n
Then,
the
final
equation
is
A
*
B
=
-41
*
\
\
-3
=
(-61)
*
(-3)
=
123.
So
the
answer
is
123."
\
-3
=
(-61)
*
(-3)
=
123.
So
the
answer
is
123."
-
input
:
((-9 * 7 * 7 * -9) + (4 * -9 - 8 - -4)) =
-
input
:
((-9 * 7 * 7 * -9) + (4 * -9 - 8 - -4)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
@@ -34,7 +34,7 @@ fewshot_config:
...
@@ -34,7 +34,7 @@ fewshot_config:
\
-
8)
-
(-4))
=
(-44
-
(-4))
=
-40.
\n
Then,
the
final
equation
is
A
+
B
=
3969
\
\
-
8)
-
(-4))
=
(-44
-
(-4))
=
-40.
\n
Then,
the
final
equation
is
A
+
B
=
3969
\
\
+
-40
=
3969
-
40
=
3929.
So
the
answer
is
3929."
\
+
-40
=
3969
-
40
=
3929.
So
the
answer
is
3929."
-
input
:
((-3 + 5 * 8 * -4) - (9 - 8 * -7 + -9)) =
-
input
:
((-3 + 5 * 8 * -4) - (9 - 8 * -7 + -9)) =
target
:
"
Let
's
think
step
by
step.
\n
Let
\u2019
s
recall
that
the
order
of
operations
\
target
:
"
Let
\u2019
s
recall
that
the
order
of
operations
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
in
mathematics
is
as
follows:
(1)
Parentheses,
(2)
exponents,
(3)
multiplication
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
and
division
(from
left
to
right),
(4)
addition
and
multiplication
(from
left
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
\
to
right).
So,
remember
to
always
compute
the
expressions
inside
parentheses
\
...
...
lm_eval/tasks/bbh/cot_fewshot/navigate.yaml
View file @
b58e5556
...
@@ -21,9 +21,7 @@ fewshot_config:
...
@@ -21,9 +21,7 @@ fewshot_config:
-
Yes
-
Yes
-
No'
-
No'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Turn
left:
(0,
0),
facing
the
negative
x-axis.
(1)
Turn
left:
(0,
0),
facing
the
negative
x-axis.
...
@@ -49,9 +47,7 @@ fewshot_config:
...
@@ -49,9 +47,7 @@ fewshot_config:
-
Yes
-
Yes
-
No'
-
No'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Turn
around:
(0,
0),
facing
the
negative
y-axis.
(1)
Turn
around:
(0,
0),
facing
the
negative
y-axis.
...
@@ -76,9 +72,7 @@ fewshot_config:
...
@@ -76,9 +72,7 @@ fewshot_config:
-
Yes
-
Yes
-
No'
-
No'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
We
start
at
the
origin
(0,
0),
facing
the
positive
y-axis.
(1)
Always
face
forward:
(0,
0),
facing
the
positive
y-axis.
(1)
Always
face
forward:
(0,
0),
facing
the
positive
y-axis.
...
...
lm_eval/tasks/bbh/cot_fewshot/object_counting.yaml
View file @
b58e5556
...
@@ -14,9 +14,7 @@ fewshot_config:
...
@@ -14,9 +14,7 @@ fewshot_config:
samples
:
samples
:
-
input
:
I have a blackberry, a clarinet, a nectarine, a plum, a strawberry, a banana,
-
input
:
I have a blackberry, a clarinet, a nectarine, a plum, a strawberry, a banana,
a flute, an orange, and a violin. How many fruits do I have?
a flute, an orange, and a violin. How many fruits do I have?
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
-
blackberry
(1)
-
blackberry
(1)
...
@@ -34,9 +32,7 @@ fewshot_config:
...
@@ -34,9 +32,7 @@ fewshot_config:
answer
is
6.'
answer
is
6.'
-
input
:
I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape,
-
input
:
I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape,
a nectarine, and three plums. How many fruits do I have?
a nectarine, and three plums. How many fruits do I have?
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
We
first
identify
the
fruits
on
the
list
and
include
their
quantity
in
parentheses:
-
orange
(1)
-
orange
(1)
...
@@ -58,9 +54,7 @@ fewshot_config:
...
@@ -58,9 +54,7 @@ fewshot_config:
11.
So
the
answer
is
11.'
11.
So
the
answer
is
11.'
-
input
:
I have a lettuce head, a head of broccoli, an onion, a stalk of celery,
-
input
:
I have a lettuce head, a head of broccoli, an onion, a stalk of celery,
two carrots, a garlic, and a yam. How many vegetables do I have?
two carrots, a garlic, and a yam. How many vegetables do I have?
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
first
identify
the
vegetables
on
the
list
and
include
their
quantity
in
parentheses:
We
first
identify
the
vegetables
on
the
list
and
include
their
quantity
in
parentheses:
-
lettuce
(1)
-
lettuce
(1)
...
...
lm_eval/tasks/bbh/cot_fewshot/penguins_in_a_table.yaml
View file @
b58e5556
...
@@ -32,9 +32,7 @@ fewshot_config:
...
@@ -32,9 +32,7 @@ fewshot_config:
(D)
4
(D)
4
(E)
5'
(E)
5'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
Now,
we
add
James
to
this
table:
James
is
12
years
old.
Now,
we
add
James
to
this
table:
James
is
12
years
old.
...
@@ -59,9 +57,7 @@ fewshot_config:
...
@@ -59,9 +57,7 @@ fewshot_config:
(D)
Gwen
(D)
Gwen
(E)
James'
(E)
James'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
This
question
focuses
on
age.
We
know
the
following:
Louis
is
7
years
old,
Bernard
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
is
5
years
old,
Vincent
is
9
years
old,
and
Gwen
is
8
years
old.
According
to
the
table,
Bernard
(5)
is
the
youngest
amongst
them.
According
to
the
table,
Bernard
(5)
is
the
youngest
amongst
them.
...
@@ -84,9 +80,7 @@ fewshot_config:
...
@@ -84,9 +80,7 @@ fewshot_config:
(D)
Gwen
(D)
Gwen
(E)
James'
(E)
James'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
This
question
focuses
on
the
name.
We
know
the
following:
The
names
of
the
penguin
This
question
focuses
on
the
name.
We
know
the
following:
The
names
of
the
penguin
in
the
table
are
Louis,
Bernard,
Vincent,
and
Gwen.
in
the
table
are
Louis,
Bernard,
Vincent,
and
Gwen.
When
we
sort
their
names
alphabetically,
we
get
Bernard,
Gwen,
Louis,
Vincent.
When
we
sort
their
names
alphabetically,
we
get
Bernard,
Gwen,
Louis,
Vincent.
...
...
lm_eval/tasks/bbh/cot_fewshot/reasoning_about_colored_objects.yaml
View file @
b58e5556
...
@@ -52,9 +52,7 @@ fewshot_config:
...
@@ -52,9 +52,7 @@ fewshot_config:
(Q)
purple
(Q)
purple
(R)
pink'
(R)
pink'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
According
to
this
question,
the
color
of
the
stress
ball
is
blue.
So
the
answer
According
to
this
question,
the
color
of
the
stress
ball
is
blue.
So
the
answer
is
(E).'
is
(E).'
-
input
:
'
On
the
table,
you
see
a
bunch
of
objects
arranged
in
a
row:
a
purple
paperclip,
-
input
:
'
On
the
table,
you
see
a
bunch
of
objects
arranged
in
a
row:
a
purple
paperclip,
a
pink
stress
ball,
a
brown
keychain,
a
green
scrunchiephone
charger,
a
mauve
a
pink
stress
ball,
a
brown
keychain,
a
green
scrunchiephone
charger,
a
mauve
...
@@ -98,9 +96,7 @@ fewshot_config:
...
@@ -98,9 +96,7 @@ fewshot_config:
(Q)
purple
(Q)
purple
(R)
pink'
(R)
pink'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
right,
as
follows:
(1)
a
purple
paperclip,
(2)
a
pink
stress
ball,
(3)
a
brown
right,
as
follows:
(1)
a
purple
paperclip,
(2)
a
pink
stress
ball,
(3)
a
brown
keychain,
(4)
a
green
scrunchiephone
charger,
(5)
a
mauve
fidget
spinner,
(6)
keychain,
(4)
a
green
scrunchiephone
charger,
(5)
a
mauve
fidget
spinner,
(6)
a
burgundy
pen.
a
burgundy
pen.
...
@@ -129,9 +125,7 @@ fewshot_config:
...
@@ -129,9 +125,7 @@ fewshot_config:
(F)
five
(F)
five
(G)
six'
(G)
six'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
According
to
this
question,
the
objects
are
arranged
in
a
row,
from
left
to
right,
as
follows:
(1)
a
teal
plate,
(2)
a
burgundy
keychain,
(3)
a
yellow
scrunchiephone
right,
as
follows:
(1)
a
teal
plate,
(2)
a
burgundy
keychain,
(3)
a
yellow
scrunchiephone
charger,
(4)
an
orange
mug,
(5)
a
pink
notebook,
(6)
a
grey
cup.
charger,
(4)
an
orange
mug,
(5)
a
pink
notebook,
(6)
a
grey
cup.
...
...
lm_eval/tasks/bbh/cot_fewshot/ruin_names.yaml
View file @
b58e5556
...
@@ -24,9 +24,7 @@ fewshot_config:
...
@@ -24,9 +24,7 @@ fewshot_config:
(C)
whitesnuake
(C)
whitesnuake
(D)
mwhitesnake'
(D)
mwhitesnake'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
The
original
name
is
"whitesnake".
This
is
the
name
of
an
old
English
hard
rock
The
original
name
is
"whitesnake".
This
is
the
name
of
an
old
English
hard
rock
band.
It
is
a
compound
word,
formed
by
the
words
"white"
and
"snake".
band.
It
is
a
compound
word,
formed
by
the
words
"white"
and
"snake".
(A)
"whitesnape":
It
is
formed
by
the
combination
of
"white"
and
"snake";
therefore,
(A)
"whitesnape":
It
is
formed
by
the
combination
of
"white"
and
"snake";
therefore,
...
@@ -57,9 +55,7 @@ fewshot_config:
...
@@ -57,9 +55,7 @@ fewshot_config:
(C)
one
of
our
dinosaurs
is
pissing
(C)
one
of
our
dinosaurs
is
pissing
(D)
one
of
our
dinosaur
is
missing'
(D)
one
of
our
dinosaur
is
missing'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
The
original
name
is
"one
of
our
dinosaurs
is
missing".
This
is
the
name
of
The
original
name
is
"one
of
our
dinosaurs
is
missing".
This
is
the
name
of
an
old
British
movie.
an
old
British
movie.
(A)
"ofne
of
our
dinosaurs
is
missing":
Here
"one
of"
is
changed
to
"ofne",
(A)
"ofne
of
our
dinosaurs
is
missing":
Here
"one
of"
is
changed
to
"ofne",
...
@@ -91,9 +87,7 @@ fewshot_config:
...
@@ -91,9 +87,7 @@ fewshot_config:
(C)
courting
crows
(C)
courting
crows
(D)
coutnting
crows'
(D)
coutnting
crows'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
The
original
name
is
"counting
crows".
This
is
the
name
of
an
American
rock
The
original
name
is
"counting
crows".
This
is
the
name
of
an
American
rock
band.
Historically,
the
band
name
comes
from
the
British
nursery
rhyme
"One
band.
Historically,
the
band
name
comes
from
the
British
nursery
rhyme
"One
for
Sorrow",
which
is
about
counting
of
magpies.
for
Sorrow",
which
is
about
counting
of
magpies.
...
...
lm_eval/tasks/bbh/cot_fewshot/salient_translation_error_detection.yaml
View file @
b58e5556
...
@@ -42,9 +42,7 @@ fewshot_config:
...
@@ -42,9 +42,7 @@ fewshot_config:
(E)
Dropped
Content
(E)
Dropped
Content
(F)
Facts'
(F)
Facts'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
the
provided
translation.
According
to
then
by
comparing
our
translation
with
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
to
English
is
"The
list
of
monuments
in
Lenzen
(Elbe)
includes
all
the
monuments
to
English
is
"The
list
of
monuments
in
Lenzen
(Elbe)
includes
all
the
monuments
...
@@ -70,9 +68,7 @@ fewshot_config:
...
@@ -70,9 +68,7 @@ fewshot_config:
\
am
Lech.
\n
The
translation
contains
an
error
pertaining
to
\n
Options:
\n
(A)
Modifiers
\
\
am
Lech.
\n
The
translation
contains
an
error
pertaining
to
\n
Options:
\n
(A)
Modifiers
\
\
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
Antonyms
\n
(D)
Named
Entities
\n\
\
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
Antonyms
\n
(D)
Named
Entities
\n\
(E)
Dropped
Content
\n
(F)
Facts"
(E)
Dropped
Content
\n
(F)
Facts"
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
We
solve
this
question
by
first
translating
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
the
provided
translation.
According
to
then
by
comparing
our
translation
with
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
Google
Translate,
the
correct
translation
of
the
source
sentence
from
German
to
English
is
"The
monuments
of
the
Upper
Bavarian
district
town
of
Landsberg
to
English
is
"The
monuments
of
the
Upper
Bavarian
district
town
of
Landsberg
...
@@ -98,7 +94,7 @@ fewshot_config:
...
@@ -98,7 +94,7 @@ fewshot_config:
\
Voivodeship
of
Poland.
\n
The
translation
contains
an
error
pertaining
to
\n\
\
Voivodeship
of
Poland.
\n
The
translation
contains
an
error
pertaining
to
\n\
Options:
\n
(A)
Modifiers
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
\
Options:
\n
(A)
Modifiers
or
Adjectives
\n
(B)
Numerical
Values
\n
(C)
Negation
or
\
\
Antonyms
\n
(D)
Named
Entities
\n
(E)
Dropped
Content
\n
(F)
Facts"
\
Antonyms
\n
(D)
Named
Entities
\n
(E)
Dropped
Content
\n
(F)
Facts"
target
:
"
Let's
think
step
by
step.
\n
We
solve
this
question
by
first
translating
\
target
:
"
We
solve
this
question
by
first
translating
\
\
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
\
\
the
source
sentence
to
English
and
then
by
comparing
our
translation
with
\
\
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
\
\
the
provided
translation.
According
to
Google
Translate,
the
correct
translation
\
\
of
the
source
sentence
from
German
to
English
is
\"\u0141
eba
is
a
small
town
\
\
of
the
source
sentence
from
German
to
English
is
\"\u0141
eba
is
a
small
town
\
...
...
lm_eval/tasks/bbh/cot_fewshot/snarks.yaml
View file @
b58e5556
...
@@ -24,9 +24,7 @@ fewshot_config:
...
@@ -24,9 +24,7 @@ fewshot_config:
(A)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
waste
(A)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
waste
(B)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
deal'
(B)
Yes,
because
having
interests
and
actively
researching
them
is
a
huge
deal'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
If
we
look
at
(A),
it
says
that
having
interests
and
actively
researching
them
If
we
look
at
(A),
it
says
that
having
interests
and
actively
researching
them
is
a
huge
waste,
implying
that
it
is
a
useless
effort.
However,
we
know
that
is
a
huge
waste,
implying
that
it
is
a
useless
effort.
However,
we
know
that
having
interests
and
actively
researching
them
is
typically
not
a
waste
but
having
interests
and
actively
researching
them
is
typically
not
a
waste
but
rather
is
beneficial
to
the
individual.
The
presence
of
such
a
juxtaposition
rather
is
beneficial
to
the
individual.
The
presence
of
such
a
juxtaposition
...
@@ -46,9 +44,7 @@ fewshot_config:
...
@@ -46,9 +44,7 @@ fewshot_config:
(B)
No
one
is
going
to
disagree
with
you
on
this.
Ad
hominem
attacks
really
(B)
No
one
is
going
to
disagree
with
you
on
this.
Ad
hominem
attacks
really
help
your
case'
help
your
case'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
If
we
look
at
(A),
it
says
that
avoiding
ad
hominem
attacks
really
help
your
If
we
look
at
(A),
it
says
that
avoiding
ad
hominem
attacks
really
help
your
case,
implying
that
ad
hominem
attacks
are
adverse
and
injurious.
Because
ad
case,
implying
that
ad
hominem
attacks
are
adverse
and
injurious.
Because
ad
hominem
attacks
are
adressed
at
a
person
rather
than
an
idea,
it
is
indeed
true
hominem
attacks
are
adressed
at
a
person
rather
than
an
idea,
it
is
indeed
true
that
avoiding
them
is
often
useful
and
helpful;
so,
(A)
is
a
neutral
(valid
that
avoiding
them
is
often
useful
and
helpful;
so,
(A)
is
a
neutral
(valid
...
@@ -70,9 +66,7 @@ fewshot_config:
...
@@ -70,9 +66,7 @@ fewshot_config:
(B)
Consistency
in
the
league'
'
s
punishments?
What
do
you
think
this
is
supposed
(B)
Consistency
in
the
league'
'
s
punishments?
What
do
you
think
this
is
supposed
to
be,
moral?'
to
be,
moral?'
target
:
'
Let'
'
s
think
step
by
step.
target
:
'
If
we
look
at
(A),
it
likens
the
consistency
in
the
league'
'
s
punishments
with
If
we
look
at
(A),
it
likens
the
consistency
in
the
league'
'
s
punishments
with
that
in
politics.
Because
politics
or
political
affairs
are
often
not
considered
that
in
politics.
Because
politics
or
political
affairs
are
often
not
considered
to
be
consistent
or
dependable,
this
sentence
appears
to
be
satirical.
to
be
consistent
or
dependable,
this
sentence
appears
to
be
satirical.
...
...
Prev
1
2
3
4
5
6
7
…
17
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment