Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
470059f6
Commit
470059f6
authored
Nov 24, 2023
by
lintangsutawika
Browse files
merge conflict
parents
b8d7d6c3
9d030712
Changes
1000
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
117 additions
and
0 deletions
+117
-0
lm_eval/tasks/bbh/flan_cot_fewshot/web_of_lies.yaml
lm_eval/tasks/bbh/flan_cot_fewshot/web_of_lies.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_fewshot/word_sorting.yaml
lm_eval/tasks/bbh/flan_cot_fewshot/word_sorting.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/_flan_cot_zeroshot_template_yaml
...ks/bbh/flan_cot_zeroshot/_flan_cot_zeroshot_template_yaml
+22
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/boolean_expressions.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/boolean_expressions.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/causal_judgement.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/causal_judgement.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/date_understanding.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/date_understanding.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/disambiguation_qa.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/disambiguation_qa.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/dyck_languages.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/dyck_languages.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/formal_fallacies.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/formal_fallacies.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/geometric_shapes.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/geometric_shapes.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/hyperbaton.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/hyperbaton.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_five_objects.yaml
...bbh/flan_cot_zeroshot/logical_deduction_five_objects.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_seven_objects.yaml
...bh/flan_cot_zeroshot/logical_deduction_seven_objects.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_three_objects.yaml
...bh/flan_cot_zeroshot/logical_deduction_three_objects.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/movie_recommendation.yaml
...val/tasks/bbh/flan_cot_zeroshot/movie_recommendation.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/multistep_arithmetic_two.yaml
...tasks/bbh/flan_cot_zeroshot/multistep_arithmetic_two.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/navigate.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/navigate.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/object_counting.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/object_counting.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/penguins_in_a_table.yaml
lm_eval/tasks/bbh/flan_cot_zeroshot/penguins_in_a_table.yaml
+5
-0
lm_eval/tasks/bbh/flan_cot_zeroshot/reasoning_about_colored_objects.yaml
...bh/flan_cot_zeroshot/reasoning_about_colored_objects.yaml
+5
-0
No files found.
Too many changes to show.
To preserve performance only
1000 of 1000+
files are displayed.
Plain diff
Email patch
lm_eval/tasks/bbh/flan_cot_fewshot/web_of_lies.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
web_of_lies"
"
description"
:
"
Evaluate
a
random
boolean
function
expressed
as
a
word
problem.
\n\n
"
"
doc_to_text"
:
"
Question:
Fidel
tells
the
truth.
Jerry
says
Fidel
tells
the
truth.
Vina
says
Jerry
tells
the
truth.
Millicent
says
Vina
lies.
Raymond
says
Millicent
lies.
Does
Raymond
tell
the
truth?
\n
A:
Let's
think
step
by
step.
\n
(1)
Fidel
tells
the
truth.
So,
we
know
that
Fidel
tells
the
truth.
\n
(2)
Jerry
says
Fidel
tells
the
truth.
Since
we
know
from
(1)
that
Fidel
tells
the
truth,
if
Jerry
says
that
Fidel
tells
the
truth,
then
Jerry
tells
the
truth.
\n
(3)
Vina
says
Jerry
tells
the
truth.
Since
we
know
from
(2)
that
Jerry
tells
the
truth,
if
Vina
says
Jerry
tells
the
truth,
then
Vine
tells
the
truth.
\n
(4)
Millicent
says
Vina
lies.
Since
we
know
from
(3)
that
Vina
tells
the
truth,
if
Millicent
says
Vina
lies,
then
Millicent
lies.
\n
(5)
Raymond
says
Millicent
lies.
Since
we
know
from
(4)
that
Millicent
lies,
if
Raymond
says
Millicent
lies,
then
Raymond
tells
the
truth.
\n
Now,
the
question
asks:
Does
Raymond
tell
the
truth?
We
know
from
(5)
that
Raymond
tells
the
truth.
So
the
answer
is
Yes.
Question:
Kristian
lies.
Millie
says
Kristian
lies.
Maybelle
says
Millie
tells
the
truth.
Fidel
says
Maybelle
lies.
Leda
says
Fidel
lies.
Does
Leda
tell
the
truth?
\n
A:
Let's
think
step
by
step.
\n
(1)
Kristian
lies.
So,
we
know
that
Kristian
lies.
\n
(2)
Millie
says
Kristian
lies.
Since
we
know
from
(1)
that
Kristian
lies,
if
Millie
says
Kristian
lies,
then
Millie
tells
the
truth.
\n
(3)
Maybelle
says
Millie
tells
the
truth.
Since
we
know
from
(2)
that
Millie
tells
the
truth,
if
Maybelle
says
Millie
tells
the
truth,
then
Maybelle
tells
the
truth.
\n
(4)
Fidel
says
Maybelle
lies.
Since
we
know
from
(3)
that
Maybelle
tells
the
truth,
if
Fidel
says
Maybelle
lies,
then
Fidel
lies.
\n
(5)
Leda
says
Fidel
lies.
Since
we
know
from
(4)
that
Fidel
lies,
if
Leda
says
Fidel
lies,
then
Leda
tells
the
truth.
\n
Now,
the
question
asks:
Does
Leda
tell
the
truth?
We
know
from
(5)
that
Leda
tells
the
truth.
So
the
answer
is
Yes.
Question:
Kristian
tells
the
truth.
Michaela
says
Kristian
lies.
Raymond
says
Michaela
tells
the
truth.
Osvaldo
says
Raymond
tells
the
truth.
Jamey
says
Osvaldo
tells
the
truth.
Does
Jamey
tell
the
truth?
\n
A:
Let's
think
step
by
step.
\n
(1)
Kristian
tells
the
truth.
So,
we
know
that
Kristian
tells
the
truth.
\n
(2)
Michaela
says
Kristian
lies.
Since
we
know
from
(1)
that
Kristian
tells
the
truth,
if
Michaela
says
Kristian
lies,
then
Michaela
lies.
\n
(3)
Raymond
says
Michaela
tells
the
truth.
Since
we
know
from
(2)
that
Michaela
lies,
if
Raymond
says
Michaela
tells
the
truth,
then
Raymond
lies.
\n
(4)
Osvaldo
says
Raymond
tells
the
truth.
Since
we
know
from
(3)
that
Raymond
lies,
if
Osvaldo
says
Raymond
tells
the
truth,
then
Osvaldo
lies.
\n
(5)
Jamey
says
Osvaldo
tells
the
truth.
Since
we
know
from
(4)
that
Osvaldo
lies,
if
Jamey
says
Osvaldo
tells
the
truth,
then
Jamey
lies.
\n
Now,
the
question
asks:
Does
Jamey
tell
the
truth?
We
know
from
(5)
that
Jamey
lies.
So
the
answer
is
No.Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_fewshot_template_yaml"
"
task"
:
"
bbh_flan_cot_fewshot_web_of_lies"
lm_eval/tasks/bbh/flan_cot_fewshot/word_sorting.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
word_sorting"
"
description"
:
"
Sort
a
list
of
words.
\n\n
"
"
doc_to_text"
:
"
Sort
the
following
words
alphabetically:
List:
oven
costume
counterpart
\n
A:
Let's
think
step
by
step.
\n
The
first
letter:
\"
oven
\"
:
\"
o
\"
(15).
\"
costume
\"
:
\"
c
\"
(3).
\"
counterpart
\"
:
\"
c
\"
(3).
We
now
have:
(3)
[
\"
costume
\"
?
\"
counterpart
\"
]
<
(15)
\"
oven
\"
.
Now
let's
sort
this
subpart
[
\"
costume
\"
?
\"
counterpart
\"
]
by
looking
at
their
second
letters.
\n
The
second
letter:
\"
costume
\"
:
\"
o
\"
(15).
\"
counterpart
\"
:
\"
o
\"
(15).
We
now
have:
(15)
[
\"
costume
\"
?
\"
counterpart
\"
].
Now
let's
sort
this
subpart
[
\"
costume
\"
?
\"
counterpart
\"
]
by
looking
at
their
third
letters.
\n
The
third
letter:
\"
costume
\"
:
\"
s
\"
(19).
\"
counterpart
\"
:
\"
u
\"
(21).
We
now
have:
(19)
\"
costume
\"
<
(21)
\"
counterpart
\"
.
Hence,
we
have
[
\"
costume
\"
<
\"
counterpart
\"
]
<
\"
oven
\"
.
So
the
answer
is
costume
counterpart
oven.
Sort
the
following
words
alphabetically:
List:
hypochlorite
ponderosa
phone
credulity
\n
A:
Let's
think
step
by
step.
\n
The
first
letter:
\"
hypochlorite
\"
:
\"
h
\"
(8).
\"
ponderosa
\"
:
\"
p
\"
(16).
\"
phone
\"
:
\"
p
\"
(16).
\"
credulity
\"
:
\"
c
\"
(3).
We
now
have:
(3)
\"
credulity
\"
<
(8)
\"
hypochlorite
\"
<
(16)
[
\"
ponderosa
\"
?
\"
phone
\"
].
Now
let's
sort
this
subpart
[
\"
ponderosa
\"
?
\"
phone
\"
]
by
looking
at
their
second
letters.
\n
The
second
letter:
\"
ponderosa
\"
:
\"
o
\"
(15).
\"
phone
\"
:
\"
h
\"
(8).
We
now
have:
(8)
\"
phone
\"
<
(15)
\"
ponderosa
\"
.
Hence,
we
have
\"
credulity
\"
<
\"
hypochlorite
\"
<
[
\"
phone
\"
<
\"
ponderosa
\"
].
So
the
answer
is
credulity
hypochlorite
phone
ponderosa.
Sort
the
following
words
alphabetically:
List:
newt
arson
parthia
seismography
mugho
aspect
census
\n
A:
Let's
think
step
by
step.
\n
The
first
letter:
\"
newt
\"
:
\"
n
\"
(14).
\"
arson
\"
:
\"
a
\"
(1).
\"
parthia
\"
:
\"
p
\"
(16).
\"
seismography
\"
:
\"
s
\"
(19).
\"
mugho
\"
:
\"
m
\"
(13).
\"
aspect
\"
:
\"
a
\"
(1).
\"
census
\"
:
\"
c
\"
(3).
We
now
have:
(1)
[
\"
arson
\"
?
\"
aspect
\"
]
<
(3)
\"
census
\"
<
(13)
\"
mugho
\"
<
(14)
\"
newt
\"
<
(16)
\"
parthia
\"
<
(19)
\"
seismography
\"
.
Now
let's
sort
this
subpart
[
\"
arson
\"
?
\"
aspect
\"
]
by
looking
at
their
second
letters.
\n
The
second
letter:
\"
arson
\"
:
\"
r
\"
(18).
\"
aspect
\"
:
\"
s
\"
(19).
We
now
have:
(18)
\"
arson
\"
<
(19)
\"
aspect
\"
.
Hence,
we
have
[
\"
arson
\"
<
\"
aspect
\"
]
<
\"
census
\"
<
\"
mugho
\"
<
\"
newt
\"
<
\"
parthia
\"
<
\"
seismography
\"
.
So
the
answer
is
arson
aspect
census
mugho
newt
parthia
seismography.Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_fewshot_template_yaml"
"
task"
:
"
bbh_flan_cot_fewshot_word_sorting"
lm_eval/tasks/bbh/flan_cot_zeroshot/_flan_cot_zeroshot_template_yaml
0 → 100644
View file @
470059f6
group: bbh_flan_cot_zeroshot
dataset_path: lukaemon/bbh
output_type: generate_until
test_split: test
doc_to_target: "{{target}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
# ignore_case: true
# ignore_punctuation: true
generation_kwargs:
until:
- "</s>"
do_sample: false
temperature: 0.0
filter_list:
- name: "get-answer"
filter:
- function: "regex"
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
- function: "take_first"
lm_eval/tasks/bbh/flan_cot_zeroshot/boolean_expressions.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
boolean_expressions"
"
description"
:
"
Evaluate
the
result
of
a
random
Boolean
expression.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_boolean_expressions"
lm_eval/tasks/bbh/flan_cot_zeroshot/causal_judgement.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
causal_judgement"
"
description"
:
"
Answer
questions
about
causal
attribution.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_causal_judgement"
lm_eval/tasks/bbh/flan_cot_zeroshot/date_understanding.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
date_understanding"
"
description"
:
"
Infer
the
date
from
context.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_date_understanding"
lm_eval/tasks/bbh/flan_cot_zeroshot/disambiguation_qa.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
disambiguation_qa"
"
description"
:
"
Clarify
the
meaning
of
sentences
with
ambiguous
pronouns.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_disambiguation_qa"
lm_eval/tasks/bbh/flan_cot_zeroshot/dyck_languages.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
dyck_languages"
"
description"
:
"
Correctly
close
a
Dyck-n
word.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_dyck_languages"
lm_eval/tasks/bbh/flan_cot_zeroshot/formal_fallacies.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
formal_fallacies"
"
description"
:
"
Distinguish
deductively
valid
arguments
from
formal
fallacies.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_formal_fallacies"
lm_eval/tasks/bbh/flan_cot_zeroshot/geometric_shapes.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
geometric_shapes"
"
description"
:
"
Name
geometric
shapes
from
their
SVG
paths.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_geometric_shapes"
lm_eval/tasks/bbh/flan_cot_zeroshot/hyperbaton.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
hyperbaton"
"
description"
:
"
Order
adjectives
correctly
in
English
sentences.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_hyperbaton"
lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_five_objects.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
logical_deduction_five_objects"
"
description"
:
"
A
logical
deduction
task
which
requires
deducing
the
order
of
a
sequence
of
objects.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_logical_deduction_five_objects"
lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_seven_objects.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
logical_deduction_seven_objects"
"
description"
:
"
A
logical
deduction
task
which
requires
deducing
the
order
of
a
sequence
of
objects.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_logical_deduction_seven_objects"
lm_eval/tasks/bbh/flan_cot_zeroshot/logical_deduction_three_objects.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
logical_deduction_three_objects"
"
description"
:
"
A
logical
deduction
task
which
requires
deducing
the
order
of
a
sequence
of
objects.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_logical_deduction_three_objects"
lm_eval/tasks/bbh/flan_cot_zeroshot/movie_recommendation.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
movie_recommendation"
"
description"
:
"
Recommend
movies
similar
to
the
given
list
of
movies.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_movie_recommendation"
lm_eval/tasks/bbh/flan_cot_zeroshot/multistep_arithmetic_two.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
multistep_arithmetic_two"
"
description"
:
"
Solve
multi-step
arithmetic
problems.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_multistep_arithmetic_two"
lm_eval/tasks/bbh/flan_cot_zeroshot/navigate.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
navigate"
"
description"
:
"
Given
a
series
of
navigation
instructions,
determine
whether
one
would
end
up
back
at
the
starting
point.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_navigate"
lm_eval/tasks/bbh/flan_cot_zeroshot/object_counting.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
object_counting"
"
description"
:
"
Questions
that
involve
enumerating
objects
and
asking
the
model
to
count
them.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_object_counting"
lm_eval/tasks/bbh/flan_cot_zeroshot/penguins_in_a_table.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
penguins_in_a_table"
"
description"
:
"
Answer
questions
about
a
table
of
penguins
and
their
attributes.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_penguins_in_a_table"
lm_eval/tasks/bbh/flan_cot_zeroshot/reasoning_about_colored_objects.yaml
0 → 100644
View file @
470059f6
"
dataset_name"
:
"
reasoning_about_colored_objects"
"
description"
:
"
Answer
extremely
simple
questions
about
the
colors
of
objects
on
a
surface.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
include"
:
"
_flan_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_flan_cot_zeroshot_reasoning_about_colored_objects"
Prev
1
2
3
4
5
6
7
8
…
50
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment