Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
e1ae8a2f
Commit
e1ae8a2f
authored
Nov 26, 2023
by
Herbie Bradley
Browse files
Merge remote-tracking branch 'origin/big-refactor' into calibration
parents
50e99bd7
30936bc7
Changes
1000
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
193 additions
and
20 deletions
+193
-20
lm_eval/tasks/code_x_glue/code-text/php.yaml
lm_eval/tasks/code_x_glue/code-text/php.yaml
+19
-0
lm_eval/tasks/code_x_glue/code-text/python.yaml
lm_eval/tasks/code_x_glue/code-text/python.yaml
+19
-0
lm_eval/tasks/code_x_glue/code-text/ruby.yaml
lm_eval/tasks/code_x_glue/code-text/ruby.yaml
+19
-0
lm_eval/tasks/code_x_glue/code-text/utils.py
lm_eval/tasks/code_x_glue/code-text/utils.py
+14
-0
lm_eval/tasks/coqa/default.yaml
lm_eval/tasks/coqa/default.yaml
+1
-1
lm_eval/tasks/crows_pairs/README.md
lm_eval/tasks/crows_pairs/README.md
+1
-2
lm_eval/tasks/csatqa/_default_csatqa_yaml
lm_eval/tasks/csatqa/_default_csatqa_yaml
+15
-0
lm_eval/tasks/csatqa/_generate_configs.py
lm_eval/tasks/csatqa/_generate_configs.py
+51
-0
lm_eval/tasks/csatqa/csatqa_gr.yaml
lm_eval/tasks/csatqa/csatqa_gr.yaml
+3
-0
lm_eval/tasks/csatqa/csatqa_li.yaml
lm_eval/tasks/csatqa/csatqa_li.yaml
+3
-0
lm_eval/tasks/csatqa/csatqa_rch.yaml
lm_eval/tasks/csatqa/csatqa_rch.yaml
+3
-0
lm_eval/tasks/csatqa/csatqa_rcs.yaml
lm_eval/tasks/csatqa/csatqa_rcs.yaml
+3
-0
lm_eval/tasks/csatqa/csatqa_rcss.yaml
lm_eval/tasks/csatqa/csatqa_rcss.yaml
+3
-0
lm_eval/tasks/csatqa/csatqa_wr.yaml
lm_eval/tasks/csatqa/csatqa_wr.yaml
+3
-0
lm_eval/tasks/csatqa/utils.py
lm_eval/tasks/csatqa/utils.py
+20
-0
lm_eval/tasks/drop/default.yaml
lm_eval/tasks/drop/default.yaml
+1
-1
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
+5
-4
lm_eval/tasks/gsm8k/gsm8k.yaml
lm_eval/tasks/gsm8k/gsm8k.yaml
+9
-10
lm_eval/tasks/hendrycks_ethics/utilitarianism_original_yaml
lm_eval/tasks/hendrycks_ethics/utilitarianism_original_yaml
+0
-1
lm_eval/tasks/logiqa2/logieval.yaml
lm_eval/tasks/logiqa2/logieval.yaml
+1
-1
No files found.
Too many changes to show.
To preserve performance only
1000 of 1000+
files are displayed.
Plain diff
Email patch
lm_eval/tasks/code_x_glue/code-text/php.yaml
0 → 100644
View file @
e1ae8a2f
group
:
-
codexglue_code2text
task
:
code2text_php
dataset_path
:
CM/codexglue_code2text_php
training_split
:
train
validation_split
:
validation
test_split
:
test
output_type
:
generate_until
generation_kwargs
:
num_beams
:
10
max_length
:
128
until
:
-
"
</s>"
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
metric_list
:
-
metric
:
!function
bleu.smoothed_bleu_4
aggregation
:
mean
higher_is_better
:
True
lm_eval/tasks/code_x_glue/code-text/python.yaml
0 → 100644
View file @
e1ae8a2f
group
:
-
codexglue_code2text
task
:
code2text_python
dataset_path
:
CM/codexglue_code2text_python
training_split
:
train
validation_split
:
validation
test_split
:
test
output_type
:
generate_until
generation_kwargs
:
num_beams
:
10
max_length
:
128
until
:
-
"
</s>"
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
metric_list
:
-
metric
:
!function
bleu.smoothed_bleu_4
aggregation
:
mean
higher_is_better
:
True
lm_eval/tasks/code_x_glue/code-text/ruby.yaml
0 → 100644
View file @
e1ae8a2f
group
:
-
codexglue_code2text
task
:
code2text_ruby
dataset_path
:
CM/codexglue_code2text_ruby
training_split
:
train
validation_split
:
validation
test_split
:
test
output_type
:
generate_until
generation_kwargs
:
num_beams
:
10
max_length
:
128
until
:
-
"
</s>"
doc_to_text
:
!function
utils.doc_to_text
doc_to_target
:
!function
utils.doc_to_target
metric_list
:
-
metric
:
!function
bleu.smoothed_bleu_4
aggregation
:
mean
higher_is_better
:
True
lm_eval/tasks/code_x_glue/code-text/utils.py
0 → 100644
View file @
e1ae8a2f
def
doc_to_text
(
doc
):
inputs
=
" "
.
join
(
doc
[
"code_tokens"
]).
replace
(
"
\n
"
,
" "
)
inputs
=
" "
.
join
(
inputs
.
strip
().
split
())
return
inputs
def
doc_to_target
(
doc
):
targets
=
" "
.
join
(
doc
[
"docstring_tokens"
]).
replace
(
"
\n
"
,
""
)
targets
=
" "
.
join
(
targets
.
strip
().
split
())
return
targets
lm_eval/tasks/coqa/default.yaml
View file @
e1ae8a2f
task
:
coqa
task
:
coqa
dataset_path
:
EleutherAI/coqa
dataset_path
:
EleutherAI/coqa
output_type
:
g
reedy
_until
output_type
:
g
enerate
_until
training_split
:
train
training_split
:
train
validation_split
:
validation
validation_split
:
validation
doc_to_text
:
!function
utils.doc_to_text
doc_to_text
:
!function
utils.doc_to_text
...
...
lm_eval/tasks/crows_pairs/README.md
View file @
e1ae8a2f
...
@@ -93,10 +93,9 @@ All tasks evaluate the percentage of more-stereotypical sentences that are rated
...
@@ -93,10 +93,9 @@ All tasks evaluate the percentage of more-stereotypical sentences that are rated
*
[x] Is the task an existing benchmark in the literature?
*
[x] Is the task an existing benchmark in the literature?
*
[x] Have you referenced the original paper that introduced the task?
*
[x] Have you referenced the original paper that introduced the task?
*
[x] If yes, does the original paper provide a reference implementation?
*
[x] If yes, does the original paper provide a reference implementation?
*
[x] The original paper does not for causal language models, so
*
[x] The original paper does not for causal language models, so
this is a novel formulation of the task for autoregressive LMs.
If other tasks on this dataset are already supported:
If other tasks on this dataset are already supported:
*
[x] Is the "Main" variant of this task clearly denoted?
*
[x] Is the "Main" variant of this task clearly denoted?
*
[x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
*
[x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
*
[x] Have you noted which, if any, published evaluation setups are matched by this variant?
*
[x] Have you noted which, if any, published evaluation setups are matched by this variant?
*
[
x] This matches the evaluations performed in the [Pythia paper
](
https://arxiv.org/abs/2304.01373
)
lm_eval/tasks/csatqa/_default_csatqa_yaml
0 → 100644
View file @
e1ae8a2f
group: csatqa
dataset_path: EleutherAI/csatqa
test_split: test
output_type: multiple_choice
process_docs: !function utils.process_docs
doc_to_text: "{{question}}"
doc_to_choice: "{{choices}}"
doc_to_target: "{{gold}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
lm_eval/tasks/csatqa/_generate_configs.py
0 → 100644
View file @
e1ae8a2f
"""
Take in a YAML, and output all other splits with this YAML
"""
import
os
import
yaml
import
argparse
from
tqdm
import
tqdm
from
lm_eval.logger
import
eval_logger
SUBSETS
=
[
"WR"
,
"GR"
,
"RCS"
,
"RCSS"
,
"RCH"
,
"LI"
]
def
parse_args
():
parser
=
argparse
.
ArgumentParser
()
parser
.
add_argument
(
"--base_yaml_path"
,
required
=
True
)
parser
.
add_argument
(
"--save_prefix_path"
,
default
=
"csatqa"
)
parser
.
add_argument
(
"--task_prefix"
,
default
=
""
)
return
parser
.
parse_args
()
if
__name__
==
"__main__"
:
args
=
parse_args
()
# get filename of base_yaml so we can `"include": ` it in our other YAMLs.
base_yaml_name
=
os
.
path
.
split
(
args
.
base_yaml_path
)[
-
1
]
with
open
(
args
.
base_yaml_path
)
as
f
:
base_yaml
=
yaml
.
full_load
(
f
)
for
name
in
tqdm
(
SUBSETS
):
yaml_dict
=
{
"include"
:
base_yaml_name
,
"task"
:
f
"csatqa_
{
args
.
task_prefix
}
_
{
name
}
"
if
args
.
task_prefix
!=
""
else
f
"csatqa_
{
name
.
lower
()
}
"
,
"dataset_name"
:
name
,
}
file_save_path
=
args
.
save_prefix_path
+
f
"_
{
name
.
lower
()
}
.yaml"
eval_logger
.
info
(
f
"Saving yaml for subset
{
name
}
to
{
file_save_path
}
"
)
with
open
(
file_save_path
,
"w"
)
as
yaml_file
:
yaml
.
dump
(
yaml_dict
,
yaml_file
,
width
=
float
(
"inf"
),
allow_unicode
=
True
,
default_style
=
'"'
,
)
lm_eval/tasks/csatqa/csatqa_gr.yaml
0 → 100644
View file @
e1ae8a2f
"
dataset_name"
:
"
GR"
"
include"
:
"
_default_csatqa_yaml"
"
task"
:
"
csatqa_gr"
lm_eval/tasks/csatqa/csatqa_li.yaml
0 → 100644
View file @
e1ae8a2f
"
dataset_name"
:
"
LI"
"
include"
:
"
_default_csatqa_yaml"
"
task"
:
"
csatqa_li"
lm_eval/tasks/csatqa/csatqa_rch.yaml
0 → 100644
View file @
e1ae8a2f
"
dataset_name"
:
"
RCH"
"
include"
:
"
_default_csatqa_yaml"
"
task"
:
"
csatqa_rch"
lm_eval/tasks/csatqa/csatqa_rcs.yaml
0 → 100644
View file @
e1ae8a2f
"
dataset_name"
:
"
RCS"
"
include"
:
"
_default_csatqa_yaml"
"
task"
:
"
csatqa_rcs"
lm_eval/tasks/csatqa/csatqa_rcss.yaml
0 → 100644
View file @
e1ae8a2f
"
dataset_name"
:
"
RCSS"
"
include"
:
"
_default_csatqa_yaml"
"
task"
:
"
csatqa_rcss"
lm_eval/tasks/csatqa/csatqa_wr.yaml
0 → 100644
View file @
e1ae8a2f
"
dataset_name"
:
"
WR"
"
include"
:
"
_default_csatqa_yaml"
"
task"
:
"
csatqa_wr"
lm_eval/tasks/csatqa/utils.py
0 → 100644
View file @
e1ae8a2f
import
datasets
def
process_docs
(
dataset
:
datasets
.
Dataset
)
->
datasets
.
Dataset
:
def
_process_doc
(
doc
):
instruction
=
f
"""다음을 읽고 정답으로 알맞은 것을 고르시요.
### Context:
{
doc
[
"context"
]
}
### Question:
{
doc
[
"question"
]
}
### Options:
(1)
{
doc
[
'option#1'
]
}
\n
(2)
{
doc
[
"option#2"
]
}
\n
(3)
{
doc
[
"option#3"
]
}
\n
(4)
{
doc
[
'option#4'
]
}
\n
(5)
{
doc
[
'option#5'
]
}
### Answer: 주어진 문제의 정답은"""
out_doc
=
{
"question"
:
instruction
,
"choices"
:
[
"(1)"
,
"(2)"
,
"(3)"
,
"(4)"
,
"(5)"
],
"gold"
:
int
(
doc
[
"gold"
])
-
1
,
}
return
out_doc
return
dataset
.
map
(
_process_doc
)
lm_eval/tasks/drop/default.yaml
View file @
e1ae8a2f
task
:
drop
task
:
drop
dataset_path
:
EleutherAI/drop
dataset_path
:
EleutherAI/drop
output_type
:
g
reedy
_until
output_type
:
g
enerate
_until
training_split
:
train
training_split
:
train
validation_split
:
validation
validation_split
:
validation
process_docs
:
!function
utils.process_docs
process_docs
:
!function
utils.process_docs
...
...
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
View file @
e1ae8a2f
...
@@ -3,7 +3,7 @@ group:
...
@@ -3,7 +3,7 @@ group:
task
:
gsm8k_cot
task
:
gsm8k_cot
dataset_path
:
gsm8k
dataset_path
:
gsm8k
dataset_name
:
main
dataset_name
:
main
output_type
:
g
reedy
_until
output_type
:
g
enerate
_until
test_split
:
test
test_split
:
test
doc_to_text
:
"
Q:
There
are
15
trees
in
the
grove.
Grove
workers
will
plant
trees
in
the
grove
today.
After
they
are
done,
there
will
be
21
trees.
How
many
trees
did
the
grove
workers
plant
today?
\n\n
A:
There
are
15
trees
originally.
Then
there
were
21
trees
after
some
more
were
planted.
So
there
must
have
been
21
-
15
=
6.
The
answer
is
6.
\n\n\
doc_to_text
:
"
Q:
There
are
15
trees
in
the
grove.
Grove
workers
will
plant
trees
in
the
grove
today.
After
they
are
done,
there
will
be
21
trees.
How
many
trees
did
the
grove
workers
plant
today?
\n\n
A:
There
are
15
trees
originally.
Then
there
were
21
trees
after
some
more
were
planted.
So
there
must
have
been
21
-
15
=
6.
The
answer
is
6.
\n\n\
Q:
If
there
are
3
cars
in
the
parking
lot
and
2
more
cars
arrive,
how
many
cars
are
in
the
parking
lot?
\n\n
A:
There
are
originally
3
cars.
2
more
cars
arrive.
3
+
2
=
5.
The
answer
is
5.
\n\n\
Q:
If
there
are
3
cars
in
the
parking
lot
and
2
more
cars
arrive,
how
many
cars
are
in
the
parking
lot?
\n\n
A:
There
are
originally
3
cars.
2
more
cars
arrive.
3
+
2
=
5.
The
answer
is
5.
\n\n\
...
@@ -14,8 +14,7 @@ Q: There were nine computers in the server room. Five more computers were instal
...
@@ -14,8 +14,7 @@ Q: There were nine computers in the server room. Five more computers were instal
Q:
Michael
had
58
golf
balls.
On
tuesday,
he
lost
23
golf
balls.
On
wednesday,
he
lost
2
more.
How
many
golf
balls
did
he
have
at
the
end
of
wednesday?
\n\n
A:
Michael
started
with
58
golf
balls.
After
losing
23
on
tuesday,
he
had
58
-
23
=
35.
After
losing
2
more,
he
had
35
-
2
=
33
golf
balls.
The
answer
is
33.
\n\n\
Q:
Michael
had
58
golf
balls.
On
tuesday,
he
lost
23
golf
balls.
On
wednesday,
he
lost
2
more.
How
many
golf
balls
did
he
have
at
the
end
of
wednesday?
\n\n
A:
Michael
started
with
58
golf
balls.
After
losing
23
on
tuesday,
he
had
58
-
23
=
35.
After
losing
2
more,
he
had
35
-
2
=
33
golf
balls.
The
answer
is
33.
\n\n\
Q:
Olivia
has
$23.
She
bought
five
bagels
for
$3
each.
How
much
money
does
she
have
left?
\n\n
A:
Olivia
had
23
dollars.
5
bagels
for
3
dollars
each
will
be
5
x
3
=
15
dollars.
So
she
has
23
-
15
dollars
left.
23
-
15
is
8.
The
answer
is
8.
\n\n\
Q:
Olivia
has
$23.
She
bought
five
bagels
for
$3
each.
How
much
money
does
she
have
left?
\n\n
A:
Olivia
had
23
dollars.
5
bagels
for
3
dollars
each
will
be
5
x
3
=
15
dollars.
So
she
has
23
-
15
dollars
left.
23
-
15
is
8.
The
answer
is
8.
\n\n\
Q:
{{question}}
\n\n
A:"
Q:
{{question}}
\n\n
A:"
doc_to_target
:
"
{{answer}}"
#" {{answer.split('### ')[-1].rstrip()}}"
doc_to_target
:
"
{{answer.split('###
')[-1].rstrip()}}"
gold_alias
:
"
{{answer.split('###
')[-1].rstrip()}}"
# this post-processes the reference that we'll score against
metric_list
:
metric_list
:
-
metric
:
exact_match
-
metric
:
exact_match
aggregation
:
mean
aggregation
:
mean
...
@@ -25,6 +24,8 @@ metric_list:
...
@@ -25,6 +24,8 @@ metric_list:
regexes_to_ignore
:
regexes_to_ignore
:
-
"
,"
-
"
,"
-
"
\\
$"
-
"
\\
$"
-
"
(?s).*####
"
-
"
\n\n
"
generation_kwargs
:
generation_kwargs
:
until
:
until
:
-
"
Q:"
-
"
Q:"
...
@@ -37,5 +38,5 @@ filter_list:
...
@@ -37,5 +38,5 @@ filter_list:
-
name
:
"
get-answer"
-
name
:
"
get-answer"
filter
:
filter
:
-
function
:
"
regex"
-
function
:
"
regex"
regex_pattern
:
"
The
answer
is
(
\\
-?[0-9
\\
.
\\
,]+)"
regex_pattern
:
"
The
answer
is
(
\\
-?[0-9
\\
.
\\
,]+)
.
"
-
function
:
"
take_first"
-
function
:
"
take_first"
lm_eval/tasks/gsm8k/gsm8k.yaml
View file @
e1ae8a2f
group
:
group
:
-
math_word_problems
-
math_word_problems
task
:
gsm8k
_yaml
task
:
gsm8k
dataset_path
:
gsm8k
dataset_path
:
gsm8k
dataset_name
:
main
dataset_name
:
main
output_type
:
g
reedy
_until
output_type
:
g
enerate
_until
training_split
:
train
training_split
:
train
fewshot_split
:
train
fewshot_split
:
train
test_split
:
test
test_split
:
test
doc_to_text
:
"
Question:
{{question}}
\n
Answer:"
doc_to_text
:
"
Question:
{{question}}
\n
Answer:"
doc_to_target
:
"
{{answer}}"
#" {{answer.split('### ')[-1].rstrip()}}"
doc_to_target
:
"
{{answer}}"
#" {{answer.split('### ')[-1].rstrip()}}"
gold_alias
:
"
{{answer.split('###
')[-1].rstrip()}}"
# this post-processes the reference that we'll score against
metric_list
:
metric_list
:
-
metric
:
exact_match
-
metric
:
exact_match
aggregation
:
mean
aggregation
:
mean
...
@@ -19,7 +18,7 @@ metric_list:
...
@@ -19,7 +18,7 @@ metric_list:
regexes_to_ignore
:
regexes_to_ignore
:
-
"
,"
-
"
,"
-
"
\\
$"
-
"
\\
$"
-
"
.*###
"
-
"
(?s)
.*###
#
"
generation_kwargs
:
generation_kwargs
:
until
:
until
:
-
"
\n\n
"
-
"
\n\n
"
...
@@ -28,9 +27,9 @@ generation_kwargs:
...
@@ -28,9 +27,9 @@ generation_kwargs:
temperature
:
0.0
temperature
:
0.0
repeats
:
1
repeats
:
1
num_fewshot
:
5
num_fewshot
:
5
#
filter_list:
filter_list
:
#
- name: "get-answer"
-
name
:
"
get-answer"
#
filter:
filter
:
#
- function: "regex"
-
function
:
"
regex"
#
regex_pattern: "### (\\-?[0-9\\.\\,]+)"
regex_pattern
:
"
###
#
(
\\
-?[0-9
\\
.
\\
,]+)"
#
- function: "take_first"
-
function
:
"
take_first"
lm_eval/tasks/hendrycks_ethics/utilitarianism_original_yaml
View file @
e1ae8a2f
...
@@ -9,7 +9,6 @@
...
@@ -9,7 +9,6 @@
# template_aliases: #"{% set answer_choices = range(1, 11)|list %}"
# template_aliases: #"{% set answer_choices = range(1, 11)|list %}"
# doc_to_text: 'Activity: "{{activity}}"\nRating:'
# doc_to_text: 'Activity: "{{activity}}"\nRating:'
# doc_to_target: "{{answer_choices[label]}}"
# doc_to_target: "{{answer_choices[label]}}"
# gold_alias: "{{label}}" # this will be cast to an int.
# metric_list:
# metric_list:
# - metric: acc
# - metric: acc
# TODO: we want this to be implemented as a winograd_schema task type, actually
# TODO: we want this to be implemented as a winograd_schema task type, actually
lm_eval/tasks/logiqa2/logieval.yaml
View file @
e1ae8a2f
task
:
logieval
task
:
logieval
dataset_path
:
baber/logiqa2
dataset_path
:
baber/logiqa2
dataset_name
:
logieval
dataset_name
:
logieval
output_type
:
g
reedy
_until
output_type
:
g
enerate
_until
training_split
:
train
training_split
:
train
test_split
:
test
test_split
:
test
# Instructions + {content}
# Instructions + {content}
...
...
Prev
1
…
35
36
37
38
39
40
41
42
43
…
50
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment