Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
79545adb
Commit
79545adb
authored
Jun 11, 2023
by
Benjamin Fattori
Browse files
Merge remote-tracking branch 'upstream/big-refactor' into seq2seq-refactor
parents
eb7b9095
761f0087
Changes
64
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
159 additions
and
66 deletions
+159
-66
lm_eval/tasks/arc.py
lm_eval/tasks/arc.py
+1
-1
lm_eval/tasks/gsm8k.py
lm_eval/tasks/gsm8k.py
+3
-3
lm_eval/tasks/gsm8k/README.md
lm_eval/tasks/gsm8k/README.md
+32
-0
lm_eval/tasks/gsm8k/cot-gsm8k.yaml
lm_eval/tasks/gsm8k/cot-gsm8k.yaml
+0
-48
lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
+32
-0
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
+42
-0
lm_eval/tasks/gsm8k/gsm8k.yaml
lm_eval/tasks/gsm8k/gsm8k.yaml
+35
-0
lm_eval/tasks/lambada.py
lm_eval/tasks/lambada.py
+1
-1
lm_eval/tasks/lambada/README.md
lm_eval/tasks/lambada/README.md
+2
-2
lm_eval/tasks/pile.py
lm_eval/tasks/pile.py
+1
-1
lm_eval/tasks/pile/README.md
lm_eval/tasks/pile/README.md
+1
-1
lm_eval/tasks/pile/pile_arxiv.yaml
lm_eval/tasks/pile/pile_arxiv.yaml
+1
-1
lm_eval/tasks/pile/pile_bookcorpus2.yaml
lm_eval/tasks/pile/pile_bookcorpus2.yaml
+1
-1
lm_eval/tasks/pile/pile_books3.yaml
lm_eval/tasks/pile/pile_books3.yaml
+1
-1
lm_eval/tasks/pile/pile_dm-mathematics.yaml
lm_eval/tasks/pile/pile_dm-mathematics.yaml
+1
-1
lm_eval/tasks/pile/pile_europarl.yaml
lm_eval/tasks/pile/pile_europarl.yaml
+1
-1
lm_eval/tasks/pile/pile_freelaw.yaml
lm_eval/tasks/pile/pile_freelaw.yaml
+1
-1
lm_eval/tasks/pile/pile_github.yaml
lm_eval/tasks/pile/pile_github.yaml
+1
-1
lm_eval/tasks/pile/pile_gutenberg.yaml
lm_eval/tasks/pile/pile_gutenberg.yaml
+1
-1
lm_eval/tasks/pile/pile_hackernews.yaml
lm_eval/tasks/pile/pile_hackernews.yaml
+1
-1
No files found.
lm_eval/tasks/arc.py
View file @
79545adb
...
@@ -16,7 +16,7 @@ from lm_eval import utils
...
@@ -16,7 +16,7 @@ from lm_eval import utils
from
lm_eval.prompts
import
get_prompt
from
lm_eval.prompts
import
get_prompt
from
lm_eval.api.task
import
MultipleChoiceTask
from
lm_eval.api.task
import
MultipleChoiceTask
from
lm_eval.api.regist
e
r
import
register_task
,
register_group
from
lm_eval.api.registr
y
import
register_task
,
register_group
_CITATION
=
"""
_CITATION
=
"""
@article{Clark2018ThinkYH,
@article{Clark2018ThinkYH,
...
...
lm_eval/tasks/gsm8k.py
View file @
79545adb
...
@@ -24,7 +24,7 @@ from lm_eval.api.instance import Instance
...
@@ -24,7 +24,7 @@ from lm_eval.api.instance import Instance
from
lm_eval.prompts
import
get_prompt
from
lm_eval.prompts
import
get_prompt
from
lm_eval.api.regist
e
r
import
register_task
,
register_group
from
lm_eval.api.registr
y
import
register_task
,
register_group
_CITATION
=
"""
_CITATION
=
"""
@misc{cobbe2021training,
@misc{cobbe2021training,
...
@@ -92,7 +92,7 @@ class GradeSchoolMath8K(Task):
...
@@ -92,7 +92,7 @@ class GradeSchoolMath8K(Task):
return
Instance
(
return
Instance
(
request_type
=
self
.
OUTPUT_TYPE
,
request_type
=
self
.
OUTPUT_TYPE
,
doc
=
doc
,
doc
=
doc
,
arguments
=
(
ctx
,
[
"
\n
"
]),
arguments
=
(
ctx
,
[
"
\n
\n
"
]),
idx
=
0
,
idx
=
0
,
**
kwargs
**
kwargs
)
)
...
@@ -113,7 +113,7 @@ class GradeSchoolMath8K(Task):
...
@@ -113,7 +113,7 @@ class GradeSchoolMath8K(Task):
assert
gold
!=
INVALID_ANS
,
"No ground truth answer found in the document."
assert
gold
!=
INVALID_ANS
,
"No ground truth answer found in the document."
# return self._extract_answer(completion) == gold
# return self._extract_answer(completion) == gold
# print(completion)
# print(completion)
return
completion
==
gold
return
self
.
_extract_answer
(
completion
)
==
gold
def
process_results
(
self
,
doc
,
results
):
def
process_results
(
self
,
doc
,
results
):
"""Take a single document and the LM results and evaluates, returning a
"""Take a single document and the LM results and evaluates, returning a
...
...
lm_eval/tasks/gsm8k/README.md
0 → 100644
View file @
79545adb
# GSM8k
## Paper
Training Verifiers to Solve Math Word Problems
https://arxiv.org/abs/2110.14168
State-of-the-art language models can match human performance on many tasks, but
they still struggle to robustly perform multi-step mathematical reasoning. To
diagnose the failures of current models and support research, we introduce GSM8K,
a dataset of 8.5K high quality linguistically diverse grade school math word problems.
We find that even the largest transformer models fail to achieve high test performance,
despite the conceptual simplicity of this problem distribution.
NOTE: See the official implementation of the task:
https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py
for how to make use of the dataset's calculator annotations in your language
model's sample/generation function.
Homepage: https://github.com/openai/grade-school-math
## Citation
```
@misc{cobbe2021training,
title={Training Verifiers to Solve Math Word Problems},
author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
year={2021},
eprint={2110.14168},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
lm_eval/tasks/gsm8k/cot-gsm8k.yaml
deleted
100644 → 0
View file @
eb7b9095
# "Training Verifiers to Solve Math Word Problems"
# https://arxiv.org/abs/2110.14168
# State-of-the-art language models can match human performance on many tasks, but
# they still struggle to robustly perform multi-step mathematical reasoning. To
# diagnose the failures of current models and support research, we introduce GSM8K,
# a dataset of 8.5K high quality linguistically diverse grade school math word problems.
# We find that even the largest transformer models fail to achieve high test performance,
# despite the conceptual simplicity of this problem distribution.
# NOTE: See the official implementation of the task:
# https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py
# for how to make use of the dataset's calculator annotations in your language
# model's sample/generation function.
# Homepage: https://github.com/openai/grade-school-math
# _CITATION = """
# @misc{cobbe2021training,
# title={Training Verifiers to Solve Math Word Problems},
# author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
# year={2021},
# eprint={2110.14168},
# archivePrefix={arXiv},
# primaryClass={cs.LG}
# }
# """
task
:
gsm8k_yaml
dataset_path
:
gsm8k
dataset_name
:
main
training_split
:
train
test_split
:
test
use_prompt
:
"
qa-basic:question-newline-answer"
doc_to_target
:
"
{{answer.split('###
')[-1]}}"
metric_list
:
-
metric
:
exact_match
aggregation
:
mean
higher_is_better
:
true
ignore_case
:
true
ignore_punctuation
:
true
delimiter
:
"
\n
"
repeats
:
4
# filter_list:
# - name: "get-answer"
# filter:
# - function: "regex"
# regex_pattern: "#### (\-?[0-9\.\,]+)"
lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
0 → 100644
View file @
79545adb
include
:
gsm8k-cot.yaml
group
:
-
chain_of_thought
-
self_consistency
task
:
gsm8k_cot_self_consistency
generation_kwargs
:
until
:
-
"
Q:"
-
"
\n\n
"
do_sample
:
true
temperature
:
0.2
repeats
:
8
filter_list
:
-
name
:
"
score-first"
# pick only the first response, and report metrics on that
filter
:
-
function
:
"
regex"
regex_pattern
:
"
The
answer
is
(
\\
-?[0-9
\\
.
\\
,]*[0-9]+)"
-
function
:
"
take_first"
-
name
:
"
maj@64"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
The
answer
is
(
\\
-?[0-9
\\
.
\\
,]*[0-9]+)"
-
function
:
"
majority_vote"
-
function
:
"
take_first"
-
name
:
"
maj@8"
# get Maj@8 , via selecting the first 8 responses. Using a better estimator would be optimal.
filter
:
-
function
:
"
take_first_k"
k
:
8
-
function
:
"
regex"
regex_pattern
:
"
The
answer
is
(
\\
-?[0-9
\\
.
\\
,]*[0-9]+)"
-
function
:
"
majority_vote"
-
function
:
"
take_first"
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
0 → 100644
View file @
79545adb
group
:
-
chain_of_thought
task
:
gsm8k_cot
dataset_path
:
gsm8k
dataset_name
:
main
output_type
:
greedy_until
test_split
:
test
doc_to_text
:
"
Q:
There
are
15
trees
in
the
grove.
Grove
workers
will
plant
trees
in
the
grove
today.
After
they
are
done,
there
will
be
21
trees.
How
many
trees
did
the
grove
workers
plant
today?
\n\n
A:
There
are
15
trees
originally.
Then
there
were
21
trees
after
some
more
were
planted.
So
there
must
have
been
21
-
15
=
6.
The
answer
is
6.
\n\n\
Q:
If
there
are
3
cars
in
the
parking
lot
and
2
more
cars
arrive,
how
many
cars
are
in
the
parking
lot?
\n\n
A:
There
are
originally
3
cars.
2
more
cars
arrive.
3
+
2
=
5.
The
answer
is
5.
\n\n\
Q:
Leah
had
32
chocolates
and
her
sister
had
42.
If
they
ate
35,
how
many
pieces
do
they
have
left
in
total?
\n\n
A:
Originally,
Leah
had
32
chocolates.
Her
sister
had
42.
So
in
total
they
had
32
+
42
=
74.
After
eating
35,
they
had
74
-
35
=
39.
The
answer
is
39.
\n\n\
Q:
Jason
had
20
lollipops.
He
gave
Denny
some
lollipops.
Now
Jason
has
12
lollipops.
How
many
lollipops
did
Jason
give
to
Denny?
\n\n
A:
Jason
started
with
20
lollipops.
Then
he
had
12
after
giving
some
to
Denny.
So
he
gave
Denny
20
-
12
=
8.
The
answer
is
8.
\n\n\
Q:
Shawn
has
five
toys.
For
Christmas,
he
got
two
toys
each
from
his
mom
and
dad.
How
many
toys
does
he
have
now?
\n\n
A:
Shawn
started
with
5
toys.
If
he
got
2
toys
each
from
his
mom
and
dad,
then
that
is
4
more
toys.
5
+
4
=
9.
The
answer
is
9.
\n\n\
Q:
There
were
nine
computers
in
the
server
room.
Five
more
computers
were
installed
each
day,
from
monday
to
thursday.
How
many
computers
are
now
in
the
server
room?
\n\n
A:
There
were
originally
9
computers.
For
each
of
4
days,
5
more
computers
were
added.
So
5
*
4
=
20
computers
were
added.
9
+
20
is
29.
The
answer
is
29.
\n\n\
Q:
Michael
had
58
golf
balls.
On
tuesday,
he
lost
23
golf
balls.
On
wednesday,
he
lost
2
more.
How
many
golf
balls
did
he
have
at
the
end
of
wednesday?
\n\n
A:
Michael
started
with
58
golf
balls.
After
losing
23
on
tuesday,
he
had
58
-
23
=
35.
After
losing
2
more,
he
had
35
-
2
=
33
golf
balls.
The
answer
is
33.
\n\n\
Q:
Olivia
has
$23.
She
bought
five
bagels
for
$3
each.
How
much
money
does
she
have
left?
\n\n
A:
Olivia
had
23
dollars.
5
bagels
for
3
dollars
each
will
be
5
x
3
=
15
dollars.
So
she
has
23
-
15
dollars
left.
23
-
15
is
8.
The
answer
is
8.
\n\n\
Q:
{{question}}
\n\n
A:"
doc_to_target
:
"
{{answer}}"
#" {{answer.split('### ')[-1].rstrip()}}"
gold_alias
:
"
{{answer.split('###
')[-1].rstrip()}}"
# this post-processes the reference that we'll score against
metric_list
:
-
metric
:
exact_match
aggregation
:
mean
higher_is_better
:
true
ignore_case
:
true
ignore_punctuation
:
false
regexes_to_ignore
:
-
"
,"
-
"
\\
$"
delimiter
:
"
\n\n
"
generation_kwargs
:
until
:
-
"
Q:"
-
"
\n\n
"
do_sample
:
false
temperature
:
0.0
repeats
:
1
num_fewshot
:
0
filter_list
:
-
name
:
"
get-answer"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
The
answer
is
(
\\
-?[0-9
\\
.
\\
,]+)"
-
function
:
"
take_first"
lm_eval/tasks/gsm8k/gsm8k.yaml
0 → 100644
View file @
79545adb
task
:
gsm8k_yaml
dataset_path
:
gsm8k
dataset_name
:
main
output_type
:
greedy_until
training_split
:
train
fewshot_split
:
train
test_split
:
test
doc_to_text
:
"
Question:
{{question}}
\n
Answer:"
doc_to_target
:
"
{{answer}}"
#" {{answer.split('### ')[-1].rstrip()}}"
gold_alias
:
"
{{answer.split('###
')[-1].rstrip()}}"
# this post-processes the reference that we'll score against
metric_list
:
-
metric
:
exact_match
aggregation
:
mean
higher_is_better
:
true
ignore_case
:
true
ignore_punctuation
:
false
regexes_to_ignore
:
-
"
,"
-
"
\\
$"
-
"
.*###
"
delimiter
:
"
\n\n
"
generation_kwargs
:
until
:
-
"
\n\n
"
-
"
Question:"
do_sample
:
false
temperature
:
0.0
repeats
:
2
num_fewshot
:
5
# filter_list:
# - name: "get-answer"
# filter:
# - function: "regex"
# regex_pattern: "### (\\-?[0-9\\.\\,]+)"
# - function: "take_first"
lm_eval/tasks/lambada.py
View file @
79545adb
...
@@ -16,7 +16,7 @@ from lm_eval.api.task import Task
...
@@ -16,7 +16,7 @@ from lm_eval.api.task import Task
from
lm_eval.api.instance
import
Instance
from
lm_eval.api.instance
import
Instance
from
lm_eval.api.metrics
import
mean
,
perplexity
from
lm_eval.api.metrics
import
mean
,
perplexity
from
lm_eval.api.regist
e
r
import
register_task
,
register_group
from
lm_eval.api.registr
y
import
register_task
,
register_group
_CITATION
=
"""
_CITATION
=
"""
@misc{
@misc{
...
...
lm_eval/tasks/lambada/README.md
View file @
79545adb
# LAMBADA
# LAMBADA
### Paper
### Paper
The LAMBADA dataset: Word prediction requiring a broad discourse context
The LAMBADA dataset: Word prediction requiring a broad discourse context
https://arxiv.org/pdf/1606.06031.pdf
https://arxiv.org/pdf/1606.06031.pdf
LAMBADA is a dataset to evaluate the capabilities of computational models for text
LAMBADA is a dataset to evaluate the capabilities of computational models for text
...
@@ -23,4 +23,4 @@ Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
...
@@ -23,4 +23,4 @@ Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
publisher={Zenodo},
publisher={Zenodo},
year={2016},
year={2016},
month={Aug}
month={Aug}
}
}
\ No newline at end of file
lm_eval/tasks/pile.py
View file @
79545adb
...
@@ -12,7 +12,7 @@ Homepage: https://pile.eleuther.ai/
...
@@ -12,7 +12,7 @@ Homepage: https://pile.eleuther.ai/
from
lm_eval.api.task
import
PerplexityTask
from
lm_eval.api.task
import
PerplexityTask
from
lm_eval.api.regist
e
r
import
register_task
,
register_group
from
lm_eval.api.registr
y
import
register_task
,
register_group
_CITATION
=
"""
_CITATION
=
"""
@article{pile,
@article{pile,
...
...
lm_eval/tasks/pile/README.md
View file @
79545adb
...
@@ -20,4 +20,4 @@ Homepage: https://pile.eleuther.ai/
...
@@ -20,4 +20,4 @@ Homepage: https://pile.eleuther.ai/
journal={arXiv preprint arXiv:2101.00027},
journal={arXiv preprint arXiv:2101.00027},
year={2020}
year={2020}
}
}
```
```
\ No newline at end of file
lm_eval/tasks/pile/pile_arxiv.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_bookcorpus2.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_books3.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_dm-mathematics.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_europarl.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_freelaw.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_github.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_gutenberg.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
lm_eval/tasks/pile/pile_hackernews.yaml
View file @
79545adb
...
@@ -19,4 +19,4 @@ metric_list:
...
@@ -19,4 +19,4 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
-
metric
:
bits_per_byte
-
metric
:
bits_per_byte
aggregation
:
bits_per_byte
aggregation
:
bits_per_byte
higher_is_better
:
false
higher_is_better
:
false
\ No newline at end of file
Prev
1
2
3
4
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment