Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
a2af2101
Unverified
Commit
a2af2101
authored
Jul 12, 2024
by
Yen-Ting Lin
Committed by
GitHub
Jul 12, 2024
Browse files
Merge branch 'EleutherAI:main' into main
parents
82cb25c1
d5f39bf8
Changes
1000
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
118 additions
and
66 deletions
+118
-66
lm_eval/tasks/glue/sst2/default.yaml
lm_eval/tasks/glue/sst2/default.yaml
+1
-1
lm_eval/tasks/glue/wnli/default.yaml
lm_eval/tasks/glue/wnli/default.yaml
+1
-1
lm_eval/tasks/gpqa/README.md
lm_eval/tasks/gpqa/README.md
+6
-2
lm_eval/tasks/gpqa/cot_n_shot/_gpqa_cot_n_shot_yaml
lm_eval/tasks/gpqa/cot_n_shot/_gpqa_cot_n_shot_yaml
+1
-1
lm_eval/tasks/gpqa/cot_zeroshot/_gpqa_cot_zeroshot_yaml
lm_eval/tasks/gpqa/cot_zeroshot/_gpqa_cot_zeroshot_yaml
+1
-1
lm_eval/tasks/gpqa/generative/_gpqa_generative_n_shot_yaml
lm_eval/tasks/gpqa/generative/_gpqa_generative_n_shot_yaml
+1
-1
lm_eval/tasks/gpqa/n_shot/_gpqa_n_shot_yaml
lm_eval/tasks/gpqa/n_shot/_gpqa_n_shot_yaml
+1
-1
lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml
lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml
+1
-1
lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
+1
-1
lm_eval/tasks/gsm8k/gsm8k-cot-zeroshot.yaml
lm_eval/tasks/gsm8k/gsm8k-cot-zeroshot.yaml
+1
-1
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
+78
-46
lm_eval/tasks/gsm8k/gsm8k.yaml
lm_eval/tasks/gsm8k/gsm8k.yaml
+1
-1
lm_eval/tasks/haerae/_default_haerae_yaml
lm_eval/tasks/haerae/_default_haerae_yaml
+0
-1
lm_eval/tasks/haerae/_haerae.yaml
lm_eval/tasks/haerae/_haerae.yaml
+16
-0
lm_eval/tasks/headqa/headqa_en.yaml
lm_eval/tasks/headqa/headqa_en.yaml
+1
-2
lm_eval/tasks/hellaswag/hellaswag.yaml
lm_eval/tasks/hellaswag/hellaswag.yaml
+3
-1
lm_eval/tasks/hendrycks_ethics/commonsense.yaml
lm_eval/tasks/hendrycks_ethics/commonsense.yaml
+1
-1
lm_eval/tasks/hendrycks_ethics/justice.yaml
lm_eval/tasks/hendrycks_ethics/justice.yaml
+1
-1
lm_eval/tasks/hendrycks_ethics/utilitarianism.yaml
lm_eval/tasks/hendrycks_ethics/utilitarianism.yaml
+1
-1
lm_eval/tasks/hendrycks_ethics/virtue.yaml
lm_eval/tasks/hendrycks_ethics/virtue.yaml
+1
-1
No files found.
Too many changes to show.
To preserve performance only
1000 of 1000+
files are displayed.
Plain diff
Email patch
lm_eval/tasks/glue/sst2/default.yaml
View file @
a2af2101
group
:
glue
tag
:
glue
task
:
sst2
dataset_path
:
glue
dataset_name
:
sst2
...
...
lm_eval/tasks/glue/wnli/default.yaml
View file @
a2af2101
group
:
glue
tag
:
glue
task
:
wnli
dataset_path
:
glue
dataset_name
:
wnli
...
...
lm_eval/tasks/gpqa/README.md
View file @
a2af2101
...
...
@@ -25,11 +25,15 @@ Homepage: `https://github.com/idavidrein/gpqa/tree/main`
This dataset is gated, so you will have to accept the terms of use at https://huggingface.co/datasets/Idavidrein/gpqa and login via
`huggingface-cli login`
using your HF Hub token before running this task.
### Groups and Tasks
### Groups
, Tags,
and Tasks
#### Groups
*
`gpqa`
None
#### Tags
*
`gpqa`
: runs all GPQA variants.
#### Tasks
...
...
lm_eval/tasks/gpqa/cot_n_shot/_gpqa_cot_n_shot_yaml
View file @
a2af2101
dataset_path: Idavidrein/gpqa
group
: gpqa
tag
: gpqa
output_type: generate_until
process_docs: !function utils.process_docs
training_split: train
...
...
lm_eval/tasks/gpqa/cot_zeroshot/_gpqa_cot_zeroshot_yaml
View file @
a2af2101
dataset_path: Idavidrein/gpqa
group
: gpqa
tag
: gpqa
output_type: generate_until
process_docs: !function utils.process_docs
training_split: train
...
...
lm_eval/tasks/gpqa/generative/_gpqa_generative_n_shot_yaml
View file @
a2af2101
dataset_path: Idavidrein/gpqa
group
: gpqa
tag
: gpqa
output_type: generate_until
process_docs: !function utils.process_docs
training_split: train
...
...
lm_eval/tasks/gpqa/n_shot/_gpqa_n_shot_yaml
View file @
a2af2101
dataset_path: Idavidrein/gpqa
group
: gpqa
tag
: gpqa
output_type: multiple_choice
process_docs: !function utils.process_docs
training_split: train
...
...
lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml
View file @
a2af2101
dataset_path: Idavidrein/gpqa
group
: gpqa
tag
: gpqa
output_type: multiple_choice
process_docs: !function utils.process_docs
training_split: train
...
...
lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
View file @
a2af2101
include
:
gsm8k-cot.yaml
group
:
tag
:
-
chain_of_thought
-
self_consistency
task
:
gsm8k_cot_self_consistency
...
...
lm_eval/tasks/gsm8k/gsm8k-cot-zeroshot.yaml
View file @
a2af2101
group
:
tag
:
-
math_word_problems
task
:
gsm8k_cot_zeroshot
dataset_path
:
gsm8k
...
...
lm_eval/tasks/gsm8k/gsm8k-cot.yaml
View file @
a2af2101
group
:
-
chain_of_thought
task
:
gsm8k_cot
dataset_path
:
gsm8k
dataset_name
:
main
output_type
:
generate_until
test_split
:
test
doc_to_text
:
"
Q:
There
are
15
trees
in
the
grove.
Grove
workers
will
plant
trees
in
the
grove
today.
After
they
are
done,
there
will
be
21
trees.
How
many
trees
did
the
grove
workers
plant
today?
\n
A:
There
are
15
trees
originally.
Then
there
were
21
trees
after
some
more
were
planted.
So
there
must
have
been
21
-
15
=
6.
The
answer
is
6.
\n\n\
Q:
If
there
are
3
cars
in
the
parking
lot
and
2
more
cars
arrive,
how
many
cars
are
in
the
parking
lot?
\n
A:
There
are
originally
3
cars.
2
more
cars
arrive.
3
+
2
=
5.
The
answer
is
5.
\n\n\
Q:
Leah
had
32
chocolates
and
her
sister
had
42.
If
they
ate
35,
how
many
pieces
do
they
have
left
in
total?
\n
A:
Originally,
Leah
had
32
chocolates.
Her
sister
had
42.
So
in
total
they
had
32
+
42
=
74.
After
eating
35,
they
had
74
-
35
=
39.
The
answer
is
39.
\n\n\
Q:
Jason
had
20
lollipops.
He
gave
Denny
some
lollipops.
Now
Jason
has
12
lollipops.
How
many
lollipops
did
Jason
give
to
Denny?
\n
A:
Jason
started
with
20
lollipops.
Then
he
had
12
after
giving
some
to
Denny.
So
he
gave
Denny
20
-
12
=
8.
The
answer
is
8.
\n\n\
Q:
Shawn
has
five
toys.
For
Christmas,
he
got
two
toys
each
from
his
mom
and
dad.
How
many
toys
does
he
have
now?
\n
A:
Shawn
started
with
5
toys.
If
he
got
2
toys
each
from
his
mom
and
dad,
then
that
is
4
more
toys.
5
+
4
=
9.
The
answer
is
9.
\n\n\
Q:
There
were
nine
computers
in
the
server
room.
Five
more
computers
were
installed
each
day,
from
monday
to
thursday.
How
many
computers
are
now
in
the
server
room?
\n
A:
There
were
originally
9
computers.
For
each
of
4
days,
5
more
computers
were
added.
So
5
*
4
=
20
computers
were
added.
9
+
20
is
29.
The
answer
is
29.
\n\n\
Q:
Michael
had
58
golf
balls.
On
tuesday,
he
lost
23
golf
balls.
On
wednesday,
he
lost
2
more.
How
many
golf
balls
did
he
have
at
the
end
of
wednesday?
\n
A:
Michael
started
with
58
golf
balls.
After
losing
23
on
tuesday,
he
had
58
-
23
=
35.
After
losing
2
more,
he
had
35
-
2
=
33
golf
balls.
The
answer
is
33.
\n\n\
Q:
Olivia
has
$23.
She
bought
five
bagels
for
$3
each.
How
much
money
does
she
have
left?
\n
A:
Olivia
had
23
dollars.
5
bagels
for
3
dollars
each
will
be
5
x
3
=
15
dollars.
So
she
has
23
-
15
dollars
left.
23
-
15
is
8.
The
answer
is
8.
\n\n\
Q:
{{question}}
\n
A:"
doc_to_target
:
"
{{answer.split('####')[-1].strip()}}"
metric_list
:
-
metric
:
exact_match
aggregation
:
mean
higher_is_better
:
true
ignore_case
:
true
ignore_punctuation
:
false
regexes_to_ignore
:
-
"
,"
-
"
\\
$"
-
"
(?s).*####
"
-
"
\\
.$"
dataset_path
:
gsm8k
doc_to_target
:
'
{{answer.split('
'
####'
'
)[-1].strip()
if
answer
is
defined
else
target}}'
doc_to_text
:
'
Q:
{{question}}
A:'
fewshot_config
:
sampler
:
first_n
samples
:
-
question
:
There are 15 trees in the grove. Grove workers will plant trees in the
grove today. After they are done, there will be 21 trees. How many trees did
the grove workers plant today?
target
:
There are 15 trees originally. Then there were 21 trees after some more
were planted. So there must have been 21 - 15 = 6. The answer is 6.
-
question
:
If there are 3 cars in the parking lot and 2 more cars arrive, how many
cars are in the parking lot?
target
:
There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer
is 5.
-
question
:
Leah had 32 chocolates and her sister had 42. If they ate 35, how many
pieces do they have left in total?
target
:
Originally, Leah had 32 chocolates. Her sister had 42. So in total they
had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.
-
question
:
Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has
12
lollipops. How many lollipops did Jason give to Denny?
target
:
Jason started with 20 lollipops. Then he had 12 after giving some to Denny.
So he gave Denny 20 - 12 = 8. The answer is 8.
-
question
:
Shawn has five toys. For Christmas, he got two toys each from his mom and
dad. How many toys does he have now?
target
:
Shawn started with 5 toys. If he got 2 toys each from his mom and dad,
then that is 4 more toys. 5 + 4 = 9. The answer is 9.
-
question
:
There were nine computers in the server room. Five more computers were
installed each day, from monday to thursday. How many computers are now in the
server room?
target
:
There were originally 9 computers. For each of 4 days, 5 more computers
were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is
29.
-
question
:
Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday,
he lost 2 more. How many golf balls did he have at the end of wednesday?
target
:
Michael started with 58 golf balls. After losing 23 on tuesday, he had
58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer
is 33.
-
question
:
Olivia has $23. She bought five bagels for $3 each. How much money does
she have left?
target
:
Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 =
15
dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.
filter_list
:
-
filter
:
-
function
:
regex
regex_pattern
:
The answer is (\-?[0-9\.\,]+).
-
function
:
take_first
name
:
strict-match
-
filter
:
-
function
:
regex
group_select
:
-1
regex_pattern
:
(-?[$0-9.,]{2,})|(-?[0-9]+)
-
function
:
take_first
name
:
flexible-extract
generation_kwargs
:
until
:
-
"
Q:"
-
"
</s>"
-
"
<|im_end|>"
do_sample
:
false
repeats
:
1
num_fewshot
:
0
filter_list
:
-
name
:
"
strict-match"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
The
answer
is
(
\\
-?[0-9
\\
.
\\
,]+)."
-
function
:
"
take_first"
-
name
:
"
flexible-extract"
filter
:
-
function
:
"
regex"
group_select
:
-1
regex_pattern
:
"
(-?[$0-9.,]{2,})|(-?[0-9]+)"
-
function
:
"
take_first"
until
:
-
'
Q:'
-
</s>
-
<|im_end|>
tag
:
-
chain_of_thought
metadata
:
version
:
3.0
num_fewshot
:
8
metric_list
:
-
aggregation
:
mean
higher_is_better
:
true
ignore_case
:
true
ignore_punctuation
:
false
metric
:
exact_match
regexes_to_ignore
:
-
'
,'
-
\$
-
'
(?s).*####
'
-
\.$
num_fewshot
:
8
output_type
:
generate_until
repeats
:
1
task
:
gsm8k_cot
test_split
:
test
lm_eval/tasks/gsm8k/gsm8k.yaml
View file @
a2af2101
group
:
tag
:
-
math_word_problems
task
:
gsm8k
dataset_path
:
gsm8k
...
...
lm_eval/tasks/haerae/_default_haerae_yaml
View file @
a2af2101
group: haerae
dataset_path: HAERAE-HUB/HAE_RAE_BENCH
test_split: test
fewshot_split: test
...
...
lm_eval/tasks/haerae/_haerae.yaml
0 → 100644
View file @
a2af2101
group
:
haerae
task
:
-
haerae_gk
-
haerae_hi
-
haerae_lw
-
haerae_rw
-
haerae_sn
aggregate_metric_list
:
-
metric
:
acc
aggregation
:
mean
weight_by_size
:
true
-
metric
:
acc_norm
aggregation
:
mean
weight_by_size
:
true
metadata
:
version
:
1.0
lm_eval/tasks/headqa/headqa_en.yaml
View file @
a2af2101
group
:
-
headqa
tag
:
headqa
task
:
headqa_en
dataset_path
:
EleutherAI/headqa
dataset_name
:
en
...
...
lm_eval/tasks/hellaswag/hellaswag.yaml
View file @
a2af2101
group
:
tag
:
-
multiple_choice
task
:
hellaswag
dataset_path
:
hellaswag
...
...
@@ -20,3 +20,5 @@ metric_list:
higher_is_better
:
true
metadata
:
version
:
1.0
dataset_kwargs
:
trust_remote_code
:
true
lm_eval/tasks/hendrycks_ethics/commonsense.yaml
View file @
a2af2101
group
:
tag
:
-
hendrycks_ethics
task
:
ethics_cm
dataset_path
:
EleutherAI/hendrycks_ethics
...
...
lm_eval/tasks/hendrycks_ethics/justice.yaml
View file @
a2af2101
include
:
deontology.yaml
group
:
tag
:
-
hendrycks_ethics
task
:
ethics_justice
dataset_name
:
justice
...
...
lm_eval/tasks/hendrycks_ethics/utilitarianism.yaml
View file @
a2af2101
include
:
commonsense.yaml
group
:
tag
:
-
hendrycks_ethics
task
:
ethics_utilitarianism
dataset_name
:
utilitarianism
...
...
lm_eval/tasks/hendrycks_ethics/virtue.yaml
View file @
a2af2101
include
:
commonsense.yaml
group
:
tag
:
-
hendrycks_ethics
task
:
ethics_virtue
dataset_name
:
virtue
...
...
Prev
1
…
25
26
27
28
29
30
31
32
33
…
50
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment