Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
b2e1bfc6
Commit
b2e1bfc6
authored
Apr 22, 2025
by
artemorloff
Browse files
Merge remote-tracking branch 'origin' into feature/eval_from_config
parents
b5d16d61
e4a7b69f
Changes
48
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
63 additions
and
42 deletions
+63
-42
lm_eval/tasks/longbench/multi_news.yaml
lm_eval/tasks/longbench/multi_news.yaml
+3
-2
lm_eval/tasks/longbench/multi_news_e.yaml
lm_eval/tasks/longbench/multi_news_e.yaml
+3
-2
lm_eval/tasks/longbench/multifieldqa_en.yaml
lm_eval/tasks/longbench/multifieldqa_en.yaml
+3
-2
lm_eval/tasks/longbench/multifieldqa_en_e.yaml
lm_eval/tasks/longbench/multifieldqa_en_e.yaml
+3
-2
lm_eval/tasks/longbench/multifieldqa_zh.yaml
lm_eval/tasks/longbench/multifieldqa_zh.yaml
+3
-2
lm_eval/tasks/longbench/musique.yaml
lm_eval/tasks/longbench/musique.yaml
+3
-2
lm_eval/tasks/longbench/narrativeqa.yaml
lm_eval/tasks/longbench/narrativeqa.yaml
+4
-3
lm_eval/tasks/longbench/passage_count.yaml
lm_eval/tasks/longbench/passage_count.yaml
+3
-2
lm_eval/tasks/longbench/passage_count_e.yaml
lm_eval/tasks/longbench/passage_count_e.yaml
+3
-2
lm_eval/tasks/longbench/passage_retrieval_en.yaml
lm_eval/tasks/longbench/passage_retrieval_en.yaml
+3
-2
lm_eval/tasks/longbench/passage_retrieval_en_e.yaml
lm_eval/tasks/longbench/passage_retrieval_en_e.yaml
+3
-2
lm_eval/tasks/longbench/passage_retrieval_zh.yaml
lm_eval/tasks/longbench/passage_retrieval_zh.yaml
+3
-2
lm_eval/tasks/longbench/qasper.yaml
lm_eval/tasks/longbench/qasper.yaml
+3
-2
lm_eval/tasks/longbench/qasper_e.yaml
lm_eval/tasks/longbench/qasper_e.yaml
+3
-2
lm_eval/tasks/longbench/qmsum.yaml
lm_eval/tasks/longbench/qmsum.yaml
+3
-2
lm_eval/tasks/longbench/repobench-p.yaml
lm_eval/tasks/longbench/repobench-p.yaml
+3
-2
lm_eval/tasks/longbench/repobench-p_e.yaml
lm_eval/tasks/longbench/repobench-p_e.yaml
+3
-2
lm_eval/tasks/longbench/samsum.yaml
lm_eval/tasks/longbench/samsum.yaml
+3
-2
lm_eval/tasks/longbench/samsum_e.yaml
lm_eval/tasks/longbench/samsum_e.yaml
+3
-2
lm_eval/tasks/longbench/trec.yaml
lm_eval/tasks/longbench/trec.yaml
+5
-3
No files found.
lm_eval/tasks/longbench/multi_news.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
multi_news
doc_to_text
:
'
You
are
given
several
news
passages.
Write
a
one-page
summary
of
all
news.
\n\nNews:\n{{context}}\n\nNow,
write
a
one-page
summary
of
all
the
news.\n\nSummary:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
512
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/multi_news_e.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
multi_news_e
doc_to_text
:
'
You
are
given
several
news
passages.
Write
a
one-page
summary
of
all
news.
\n\nNews:\n{{context}}\n\nNow,
write
a
one-page
summary
of
all
the
news.\n\nSummary:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
512
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/multifieldqa_en.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
multifieldqa_en
doc_to_text
:
'
Read
the
following
text
and
answer
briefly.\n\n{{context}}\n\nNow,
answer
the
following
question
based
on
the
above
text,
only
give
me
the
answer
and
do
not
output
any
other
words.\n\nQuestion:
{{input}}\nAnswer:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/multifieldqa_en_e.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
multifieldqa_en_e
doc_to_text
:
'
Read
the
following
text
and
answer
briefly.\n\n{{context}}\n\nNow,
answer
the
following
question
based
on
the
above
text,
only
give
me
the
answer
and
do
not
output
any
other
words.\n\nQuestion:
{{input}}\nAnswer:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/multifieldqa_zh.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
multifieldqa_zh
doc_to_text
:
'
阅读以下文字并用中文简短回答:\n\n{{context}}\n\n现在请基于上面的文章回答下面的问题,只告诉我答案,不要输出任何其他字词。\n\n问题:{{input}}\n回答:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.qa_f1_zh_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/musique.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
musique
doc_to_text
:
'
Answer
the
question
based
on
the
given
passages.
Only
give
me
the
answer
and
do
not
output
any
other
words.\n\nThe
following
are
given
passages.\n{{context}}\n\nAnswer
the
question
based
on
the
given
passages.
Only
give
me
the
answer
and
do
not
output
any
other
words.\n\nQuestion:
{{input}}\nAnswer:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/narrativeqa.yaml
View file @
b2e1bfc6
...
...
@@ -5,15 +5,16 @@ task: longbench_narrativeqa
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
narrativeqa
doc_to_text
:
'
You
are
given
a
story,
which
can
be
either
a
novel
or
a
movie
script,
and
a
question.
Answer
the
question
asconcisely
as
you
can,
using
a
single
phrase
if
possible.
Do
not
provide
any
explanation.\n\nStory:
{{context}}\n\nNow,
answer
the
question
based
on
the
story
as
concisely
as
you
can,
using
a
single
phrase
if
possible.
Do
not
provide
any
explanation.\n\nQuestion:
{{input}}\n\nAnswer:'
doc_to_target
:
'
{{answers}}'
doc_to_text
:
'
You
are
given
a
story,
which
can
be
either
a
novel
or
a
movie
script,
and
a
question.
Answer
the
question
asconcisely
as
you
can,
using
a
single
phrase
if
possible.
Do
not
provide
any
explanation.\n\nStory:
{{context}}\n\nNow,
answer
the
question
based
on
the
story
asconcisely
as
you
can,
using
a
single
phrase
if
possible.
Do
not
provide
any
explanation.\n\nQuestion:
{{input}}\n\nAnswer:'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/passage_count.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
passage_count
doc_to_text
:
'
There
are
some
paragraphs
below
sourced
from
Wikipedia.
Some
of
them
may
be
duplicates.
Please
carefully
read
these
paragraphs
and
determine
how
many
unique
paragraphs
there
are
after
removing
duplicates.
In
other
words,
how
many
non-repeating
paragraphs
are
there
in
total?\n\n{{context}}\n\nPlease
enter
the
final
count
of
unique
paragraphs
after
removing
duplicates.
The
output
format
should
only
contain
the
number,
such
as
1,
2,
3,
and
so
on.\n\nThe
final
answer
is:
'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.count_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/passage_count_e.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
passage_count_e
doc_to_text
:
'
There
are
some
paragraphs
below
sourced
from
Wikipedia.
Some
of
them
may
be
duplicates.
Please
carefully
read
these
paragraphs
and
determine
how
many
unique
paragraphs
there
are
after
removing
duplicates.
In
other
words,
how
many
non-repeating
paragraphs
are
there
in
total?\n\n{{context}}\n\nPlease
enter
the
final
count
of
unique
paragraphs
after
removing
duplicates.
The
output
format
should
only
contain
the
number,
such
as
1,
2,
3,
and
so
on.\n\nThe
final
answer
is:
'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.count_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/passage_retrieval_en.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
passage_retrieval_en
doc_to_text
:
'
Here
are
30
paragraphs
from
Wikipedia,
along
with
an
abstract.
Please
determine
which
paragraph
the
abstract
is
from.\n\n{{context}}\n\nThe
following
is
an
abstract.\n\n{{input}}\n\nPlease
enter
the
number
of
the
paragraph
that
the
abstract
is
from.
The
answer
format
must
be
like
"Paragraph
1",
"Paragraph
2",
etc.\n\nThe
answer
is:
'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.retrieval_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/passage_retrieval_en_e.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
passage_retrieval_en_e
doc_to_text
:
'
Here
are
30
paragraphs
from
Wikipedia,
along
with
an
abstract.
Please
determine
which
paragraph
the
abstract
is
from.\n\n{{context}}\n\nThe
following
is
an
abstract.\n\n{{input}}\n\nPlease
enter
the
number
of
the
paragraph
that
the
abstract
is
from.
The
answer
format
must
be
like
"Paragraph
1",
"Paragraph
2",
etc.\n\nThe
answer
is:
'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.retrieval_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/passage_retrieval_zh.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
passage_retrieval_zh
doc_to_text
:
'
以下是若干段落文字,以及其中一个段落的摘要。请确定给定的摘要出自哪一段。\n\n{{context}}\n\n下面是一个摘要\n\n{{input}}\n\n请输入摘要所属段落的编号。答案格式必须是"段落1","段落2"等格式\n\n答案是:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.retrieval_zh_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/qasper.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
qasper
doc_to_text
:
'
You
are
given
a
scientific
article
and
a
question.
Answer
the
question
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nArticle:
{{context}}\n\n
Answer
the
question
based
on
the
above
article
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nQuestion:
{{input}}\n\nAnswer:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/qasper_e.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
qasper_e
doc_to_text
:
'
You
are
given
a
scientific
article
and
a
question.
Answer
the
question
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nArticle:
{{context}}\n\n
Answer
the
question
based
on
the
above
article
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nQuestion:
{{input}}\n\nAnswer:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/qmsum.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
qmsum
doc_to_text
:
'
You
are
given
a
meeting
transcript
and
a
query
containing
a
question
or
instruction.
Answer
the
query
in
one
or
more
sentences.\n\nTranscript:\n{{context}}\n\nNow,
answer
the
query
based
on
the
above
meeting
transcript
in
one
or
more
sentences.\n\nQuery:
{{input}}\nAnswer:'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
512
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/repobench-p.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
repobench-p
doc_to_text
:
'
Please
complete
the
code
given
below.
\n{{context}}{{input}}Next
line
of
code:\n'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.code_sim_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/repobench-p_e.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
repobench-p_e
doc_to_text
:
'
Please
complete
the
code
given
below.
\n{{context}}{{input}}Next
line
of
code:\n'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
True
until
:
[]
metric_list
:
-
metric
:
!function
metrics.code_sim_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/samsum.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
samsum
doc_to_text
:
'
Summarize
the
dialogue
into
a
few
short
sentences.
The
following
are
some
examples.\n\n{{context}}\n\n{{input}}'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
True
until
:
[
'
\n'
]
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/samsum_e.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,15 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
samsum_e
doc_to_text
:
'
Summarize
the
dialogue
into
a
few
short
sentences.
The
following
are
some
examples.\n\n{{context}}\n\n{{input}}'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers
[0]
}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
True
until
:
[
'
\n'
]
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
lm_eval/tasks/longbench/trec.yaml
View file @
b2e1bfc6
...
...
@@ -6,14 +6,16 @@ dataset_path: THUDM/LongBench
test_split
:
test
dataset_name
:
trec
doc_to_text
:
'
Please
determine
the
type
of
the
question
below.
Here
are
some
examples
of
questions.\n\n{{context}}\n{{input}}'
doc_to_target
:
'
{{answers}}'
doc_to_target
:
'
{{answers[0]}}'
process_results
:
!function
metrics.classification_score
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
True
until
:
[
'
\n'
]
metric_list
:
-
metric
:
!function
metrics.
classification_score
-
metric
:
"
classification_score
"
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1
.0
version
:
2
.0
Prev
1
2
3
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment