Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
76e517d1
Commit
76e517d1
authored
Dec 18, 2024
by
Baber
Browse files
add longbench
parent
a8601618
Changes
35
Show whitespace changes
Inline
Side-by-side
Showing
15 changed files
with
285 additions
and
0 deletions
+285
-0
lm_eval/tasks/longbench/passage_count_e.yaml
lm_eval/tasks/longbench/passage_count_e.yaml
+19
-0
lm_eval/tasks/longbench/passage_retrieval_en.yaml
lm_eval/tasks/longbench/passage_retrieval_en.yaml
+19
-0
lm_eval/tasks/longbench/passage_retrieval_en_e.yaml
lm_eval/tasks/longbench/passage_retrieval_en_e.yaml
+19
-0
lm_eval/tasks/longbench/passage_retrieval_zh.yaml
lm_eval/tasks/longbench/passage_retrieval_zh.yaml
+19
-0
lm_eval/tasks/longbench/qasper.yaml
lm_eval/tasks/longbench/qasper.yaml
+19
-0
lm_eval/tasks/longbench/qasper_e.yaml
lm_eval/tasks/longbench/qasper_e.yaml
+19
-0
lm_eval/tasks/longbench/qmsum.yaml
lm_eval/tasks/longbench/qmsum.yaml
+19
-0
lm_eval/tasks/longbench/repobench-p.yaml
lm_eval/tasks/longbench/repobench-p.yaml
+19
-0
lm_eval/tasks/longbench/repobench-p_e.yaml
lm_eval/tasks/longbench/repobench-p_e.yaml
+19
-0
lm_eval/tasks/longbench/samsum.yaml
lm_eval/tasks/longbench/samsum.yaml
+19
-0
lm_eval/tasks/longbench/samsum_e.yaml
lm_eval/tasks/longbench/samsum_e.yaml
+19
-0
lm_eval/tasks/longbench/trec.yaml
lm_eval/tasks/longbench/trec.yaml
+19
-0
lm_eval/tasks/longbench/trec_e.yaml
lm_eval/tasks/longbench/trec_e.yaml
+19
-0
lm_eval/tasks/longbench/triviaqa.yaml
lm_eval/tasks/longbench/triviaqa.yaml
+19
-0
lm_eval/tasks/longbench/triviaqa_e.yaml
lm_eval/tasks/longbench/triviaqa_e.yaml
+19
-0
No files found.
lm_eval/tasks/longbench/passage_count_e.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench_e
task
:
longbench_passage_count_e
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
passage_count_e
doc_to_text
:
'
There
are
some
paragraphs
below
sourced
from
Wikipedia.
Some
of
them
may
be
duplicates.
Please
carefully
read
these
paragraphs
and
determine
how
many
unique
paragraphs
there
are
after
removing
duplicates.
In
other
words,
how
many
non-repeating
paragraphs
are
there
in
total?\n\n{{context}}\n\nPlease
enter
the
final
count
of
unique
paragraphs
after
removing
duplicates.
The
output
format
should
only
contain
the
number,
such
as
1,
2,
3,
and
so
on.\n\nThe
final
answer
is:
'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.count_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/passage_retrieval_en.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_passage_retrieval_en
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
passage_retrieval_en
doc_to_text
:
'
Here
are
30
paragraphs
from
Wikipedia,
along
with
an
abstract.
Please
determine
which
paragraph
the
abstract
is
from.\n\n{{context}}\n\nThe
following
is
an
abstract.\n\n{{input}}\n\nPlease
enter
the
number
of
the
paragraph
that
the
abstract
is
from.
The
answer
format
must
be
like
"Paragraph
1",
"Paragraph
2",
etc.\n\nThe
answer
is:
'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.retrieval_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/passage_retrieval_en_e.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench_e
task
:
longbench_passage_retrieval_en_e
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
passage_retrieval_en_e
doc_to_text
:
'
Here
are
30
paragraphs
from
Wikipedia,
along
with
an
abstract.
Please
determine
which
paragraph
the
abstract
is
from.\n\n{{context}}\n\nThe
following
is
an
abstract.\n\n{{input}}\n\nPlease
enter
the
number
of
the
paragraph
that
the
abstract
is
from.
The
answer
format
must
be
like
"Paragraph
1",
"Paragraph
2",
etc.\n\nThe
answer
is:
'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.retrieval_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/passage_retrieval_zh.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_passage_retrieval_zh
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
passage_retrieval_zh
doc_to_text
:
'
以下是若干段落文字,以及其中一个段落的摘要。请确定给定的摘要出自哪一段。\n\n{{context}}\n\n下面是一个摘要\n\n{{input}}\n\n请输入摘要所属段落的编号。答案格式必须是"段落1","段落2"等格式\n\n答案是:'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.retrieval_zh_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/qasper.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_qasper
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
qasper
doc_to_text
:
'
You
are
given
a
scientific
article
and
a
question.
Answer
the
question
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nArticle:
{{context}}\n\n
Answer
the
question
based
on
the
above
article
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nQuestion:
{{input}}\n\nAnswer:'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/qasper_e.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench_e
task
:
longbench_qasper_e
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
qasper_e
doc_to_text
:
'
You
are
given
a
scientific
article
and
a
question.
Answer
the
question
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nArticle:
{{context}}\n\n
Answer
the
question
based
on
the
above
article
as
concisely
as
you
can,
using
a
single
phrase
or
sentence
if
possible.
If
the
question
cannot
be
answered
based
on
the
information
in
the
article,
write
"unanswerable".
If
the
question
is
a
yes/no
question,
answer
"yes",
"no",
or
"unanswerable".
Do
not
provide
any
explanation.\n\nQuestion:
{{input}}\n\nAnswer:'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/qmsum.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_qmsum
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
qmsum
doc_to_text
:
'
You
are
given
a
meeting
transcript
and
a
query
containing
a
question
or
instruction.
Answer
the
query
in
one
or
more
sentences.\n\nTranscript:\n{{context}}\n\nNow,
answer
the
query
based
on
the
above
meeting
transcript
in
one
or
more
sentences.\n\nQuery:
{{input}}\nAnswer:'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
512
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/repobench-p.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_repobench-p
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
repobench-p
doc_to_text
:
'
Please
complete
the
code
given
below.
\n{{context}}{{input}}Next
line
of
code:\n'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.code_sim_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/repobench-p_e.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench_e
task
:
longbench_repobench-p_e
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
repobench-p_e
doc_to_text
:
'
Please
complete
the
code
given
below.
\n{{context}}{{input}}Next
line
of
code:\n'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.code_sim_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/samsum.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_samsum
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
samsum
doc_to_text
:
'
Summarize
the
dialogue
into
a
few
short
sentences.
The
following
are
some
examples.\n\n{{context}}\n\n{{input}}'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/samsum_e.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench_e
task
:
longbench_samsum_e
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
samsum_e
doc_to_text
:
'
Summarize
the
dialogue
into
a
few
short
sentences.
The
following
are
some
examples.\n\n{{context}}\n\n{{input}}'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
128
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.rouge_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/trec.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_trec
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
trec
doc_to_text
:
'
Please
determine
the
type
of
the
question
below.
Here
are
some
examples
of
questions.\n\n{{context}}\n{{input}}'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.classification_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/trec_e.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench_e
task
:
longbench_trec_e
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
trec_e
doc_to_text
:
'
Please
determine
the
type
of
the
question
below.
Here
are
some
examples
of
questions.\n\n{{context}}\n{{input}}'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
64
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.classification_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/triviaqa.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench
task
:
longbench_triviaqa
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
triviaqa
doc_to_text
:
'
Answer
the
question
based
on
the
given
passage.
Only
give
me
the
answer
and
do
not
output
any
other
words.
The
following
are
some
examples.\n\n{{context}}\n\n{{input}}'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
lm_eval/tasks/longbench/triviaqa_e.yaml
0 → 100644
View file @
76e517d1
tag
:
-
longbench_e
task
:
longbench_triviaqa_e
dataset_path
:
THUDM/LongBench
test_split
:
test
dataset_name
:
triviaqa_e
doc_to_text
:
'
Answer
the
question
based
on
the
given
passage.
Only
give
me
the
answer
and
do
not
output
any
other
words.
The
following
are
some
examples.\n\n{{context}}\n\n{{input}}'
doc_to_target
:
'
{{answers}}'
generation_kwargs
:
max_gen_toks
:
32
temperature
:
1
do_sample
:
False
metric_list
:
-
metric
:
!function
metrics.qa_f1_score
aggregation
:
mean
higher_is_better
:
True
metadata
:
version
:
1.0
Prev
1
2
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment