Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
9f518392
Commit
9f518392
authored
Nov 28, 2023
by
lintangsutawika
Browse files
resolved merge conflict
parents
37ccb191
bf26d979
Changes
161
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
74 additions
and
34 deletions
+74
-34
lm_eval/tasks/bbh/zeroshot/navigate.yaml
lm_eval/tasks/bbh/zeroshot/navigate.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/object_counting.yaml
lm_eval/tasks/bbh/zeroshot/object_counting.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/penguins_in_a_table.yaml
lm_eval/tasks/bbh/zeroshot/penguins_in_a_table.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/reasoning_about_colored_objects.yaml
...l/tasks/bbh/zeroshot/reasoning_about_colored_objects.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/ruin_names.yaml
lm_eval/tasks/bbh/zeroshot/ruin_names.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/salient_translation_error_detection.yaml
...sks/bbh/zeroshot/salient_translation_error_detection.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/snarks.yaml
lm_eval/tasks/bbh/zeroshot/snarks.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/sports_understanding.yaml
lm_eval/tasks/bbh/zeroshot/sports_understanding.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/temporal_sequences.yaml
lm_eval/tasks/bbh/zeroshot/temporal_sequences.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/tracking_shuffled_objects_five_objects.yaml
.../bbh/zeroshot/tracking_shuffled_objects_five_objects.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/tracking_shuffled_objects_seven_objects.yaml
...bbh/zeroshot/tracking_shuffled_objects_seven_objects.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/tracking_shuffled_objects_three_objects.yaml
...bbh/zeroshot/tracking_shuffled_objects_three_objects.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/web_of_lies.yaml
lm_eval/tasks/bbh/zeroshot/web_of_lies.yaml
+2
-2
lm_eval/tasks/bbh/zeroshot/word_sorting.yaml
lm_eval/tasks/bbh/zeroshot/word_sorting.yaml
+2
-2
lm_eval/tasks/bigbench/generate_until_template_yaml
lm_eval/tasks/bigbench/generate_until_template_yaml
+2
-2
lm_eval/tasks/bigbench/multiple_choice_template_yaml
lm_eval/tasks/bigbench/multiple_choice_template_yaml
+2
-2
lm_eval/tasks/minerva_math/utils.py
lm_eval/tasks/minerva_math/utils.py
+1
-1
lm_eval/tasks/realtoxicityprompts/metric.py
lm_eval/tasks/realtoxicityprompts/metric.py
+1
-1
lm_eval/tasks/scrolls/README.md
lm_eval/tasks/scrolls/README.md
+31
-0
lm_eval/tasks/scrolls/scrolls.yaml
lm_eval/tasks/scrolls/scrolls.yaml
+9
-0
No files found.
lm_eval/tasks/bbh/
flan_
zeroshot/navigate.yaml
→
lm_eval/tasks/bbh/zeroshot/navigate.yaml
View file @
9f518392
"
dataset_name"
:
"
navigate"
"
description"
:
"
Given
a
series
of
navigation
instructions,
determine
whether
one
would
end
up
back
at
the
starting
point.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_navigate"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_navigate"
lm_eval/tasks/bbh/
flan_
zeroshot/object_counting.yaml
→
lm_eval/tasks/bbh/zeroshot/object_counting.yaml
View file @
9f518392
"
dataset_name"
:
"
object_counting"
"
description"
:
"
Questions
that
involve
enumerating
objects
and
asking
the
model
to
count
them.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_object_counting"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_object_counting"
lm_eval/tasks/bbh/
flan_
zeroshot/penguins_in_a_table.yaml
→
lm_eval/tasks/bbh/zeroshot/penguins_in_a_table.yaml
View file @
9f518392
"
dataset_name"
:
"
penguins_in_a_table"
"
description"
:
"
Answer
questions
about
a
table
of
penguins
and
their
attributes.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_penguins_in_a_table"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_penguins_in_a_table"
lm_eval/tasks/bbh/
flan_
zeroshot/reasoning_about_colored_objects.yaml
→
lm_eval/tasks/bbh/zeroshot/reasoning_about_colored_objects.yaml
View file @
9f518392
"
dataset_name"
:
"
reasoning_about_colored_objects"
"
description"
:
"
Answer
extremely
simple
questions
about
the
colors
of
objects
on
a
surface.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_reasoning_about_colored_objects"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_reasoning_about_colored_objects"
lm_eval/tasks/bbh/
flan_
zeroshot/ruin_names.yaml
→
lm_eval/tasks/bbh/zeroshot/ruin_names.yaml
View file @
9f518392
"
dataset_name"
:
"
ruin_names"
"
description"
:
"
Select
the
humorous
edit
that
'ruins'
the
input
movie
or
musical
artist
name.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_ruin_names"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_ruin_names"
lm_eval/tasks/bbh/
flan_
zeroshot/salient_translation_error_detection.yaml
→
lm_eval/tasks/bbh/zeroshot/salient_translation_error_detection.yaml
View file @
9f518392
"
dataset_name"
:
"
salient_translation_error_detection"
"
description"
:
"
Detect
the
type
of
error
in
an
English
translation
of
a
German
source
sentence.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_salient_translation_error_detection"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_salient_translation_error_detection"
lm_eval/tasks/bbh/
flan_
zeroshot/snarks.yaml
→
lm_eval/tasks/bbh/zeroshot/snarks.yaml
View file @
9f518392
"
dataset_name"
:
"
snarks"
"
description"
:
"
Determine
which
of
two
sentences
is
sarcastic.
\n\n
According
to
Cambridge
University
Dictionary,
sarcasm
is
\"
the
use
of
remarks
that
clearly
mean
the
opposite
of
what
they
say,
made
in
order
to
hurt
someone's
feelings
or
to
criticize
something
in
a
humorous
way.
\"
Sarcastic
sentences
often
contain
satirical
or
ironic
utterances,
hyperboles,
ambivalent
or
witty
remarks.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_snarks"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_snarks"
lm_eval/tasks/bbh/
flan_
zeroshot/sports_understanding.yaml
→
lm_eval/tasks/bbh/zeroshot/sports_understanding.yaml
View file @
9f518392
"
dataset_name"
:
"
sports_understanding"
"
description"
:
"
Determine
whether
an
artificially
constructed
sentence
relating
to
sports
is
plausible
or
not.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_sports_understanding"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_sports_understanding"
lm_eval/tasks/bbh/
flan_
zeroshot/temporal_sequences.yaml
→
lm_eval/tasks/bbh/zeroshot/temporal_sequences.yaml
View file @
9f518392
"
dataset_name"
:
"
temporal_sequences"
"
description"
:
"
Task
description:
Answer
questions
about
which
times
certain
events
could
have
occurred.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_temporal_sequences"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_temporal_sequences"
lm_eval/tasks/bbh/
flan_
zeroshot/tracking_shuffled_objects_five_objects.yaml
→
lm_eval/tasks/bbh/zeroshot/tracking_shuffled_objects_five_objects.yaml
View file @
9f518392
"
dataset_name"
:
"
tracking_shuffled_objects_five_objects"
"
description"
:
"
A
task
requiring
determining
the
final
positions
of
a
set
of
objects
given
their
initial
positions
and
a
description
of
a
sequence
of
swaps.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_tracking_shuffled_objects_five_objects"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_tracking_shuffled_objects_five_objects"
lm_eval/tasks/bbh/
flan_
zeroshot/tracking_shuffled_objects_seven_objects.yaml
→
lm_eval/tasks/bbh/zeroshot/tracking_shuffled_objects_seven_objects.yaml
View file @
9f518392
"
dataset_name"
:
"
tracking_shuffled_objects_seven_objects"
"
description"
:
"
A
task
requiring
determining
the
final
positions
of
a
set
of
objects
given
their
initial
positions
and
a
description
of
a
sequence
of
swaps.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_tracking_shuffled_objects_seven_objects"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_tracking_shuffled_objects_seven_objects"
lm_eval/tasks/bbh/
flan_
zeroshot/tracking_shuffled_objects_three_objects.yaml
→
lm_eval/tasks/bbh/zeroshot/tracking_shuffled_objects_three_objects.yaml
View file @
9f518392
"
dataset_name"
:
"
tracking_shuffled_objects_three_objects"
"
description"
:
"
A
task
requiring
determining
the
final
positions
of
a
set
of
objects
given
their
initial
positions
and
a
description
of
a
sequence
of
swaps.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_tracking_shuffled_objects_three_objects"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_tracking_shuffled_objects_three_objects"
lm_eval/tasks/bbh/
flan_
zeroshot/web_of_lies.yaml
→
lm_eval/tasks/bbh/zeroshot/web_of_lies.yaml
View file @
9f518392
"
dataset_name"
:
"
web_of_lies"
"
description"
:
"
Evaluate
a
random
boolean
function
expressed
as
a
word
problem.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_web_of_lies"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_web_of_lies"
lm_eval/tasks/bbh/
flan_
zeroshot/word_sorting.yaml
→
lm_eval/tasks/bbh/zeroshot/word_sorting.yaml
View file @
9f518392
"
dataset_name"
:
"
word_sorting"
"
description"
:
"
Sort
a
list
of
words.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:"
"
include"
:
"
_
flan_
zeroshot_template_yaml"
"
task"
:
"
bbh_
flan_
zeroshot_word_sorting"
"
include"
:
"
_zeroshot_template_yaml"
"
task"
:
"
bbh_zeroshot_word_sorting"
lm_eval/tasks/bigbench/generate_until_template_yaml
View file @
9f518392
group: bigbench
dataset_path: bigbench
# will switch to `hails/bigbench` when all tasks are pushed
group: bigbench
_generate_until
dataset_path:
hails/
bigbench
output_type: generate_until
dataset_kwargs:
# num_shots: 0 # TODO: num of shots for `bigbench` HF dataset should be controlled through this, not through the typical methods
...
...
lm_eval/tasks/bigbench/multiple_choice_template_yaml
View file @
9f518392
group: bigbench
dataset_path: bigbench
# will switch to `hails/bigbench` when all tasks are pushed
group: bigbench
_multiple_choice
dataset_path:
hails/
bigbench
dataset_kwargs:
# num_shots: 0 # TODO: num of shots for `bigbench` HF dataset should be controlled through this, not through the typical methods
# subtask_name: null
...
...
lm_eval/tasks/minerva_math/utils.py
View file @
9f518392
import
datasets
import
re
import
signal
from
lm_eval.
logger
import
eval_logger
from
lm_eval.
utils
import
eval_logger
from
typing
import
Optional
,
List
,
Dict
try
:
...
...
lm_eval/tasks/realtoxicityprompts/metric.py
View file @
9f518392
...
...
@@ -3,7 +3,7 @@ import json
import
requests
import
numpy
as
np
from
lm_eval.
logger
import
eval_logger
from
lm_eval.
utils
import
eval_logger
def
toxicity_perspective_api
(
references
,
predictions
,
**
kwargs
):
...
...
lm_eval/tasks/scrolls/README.md
0 → 100644
View file @
9f518392
"""
SCROLLS: Standardized CompaRison Over Long Language Sequences
https://arxiv.org/abs/2201.03533
SCROLLS is a suite of datasets that require synthesizing information over long texts.
The benchmark includes seven natural language tasks across multiple domains,
including summarization, question answering, and natural language inference.
Homepage: https://www.scrolls-benchmark.com/
Since SCROLLS tasks are generally longer than the maximum sequence length of many models,
it is possible to create "subset" tasks that contain only those samples whose tokenized length
is less than some pre-defined limit. For example, to create a subset of "Qasper" that would
be suitable for a model using the GPTNeoX tokenizer and a 4K maximium sequence length:
```
class QasperGPTNeoX4K(Qasper):
PRUNE_TOKENIZERS = ["EleutherAI/pythia-410m-deduped"]
PRUNE_MAX_TOKENS = 4096
PRUNE_NUM_PROC = _num_cpu_cores() # optional, to speed up pruning of large datasets like NarrativeQA
```
`PRUNE_TOKENIZERS`
can contain more than one tokenizer; this will include only samples that are
less than
`PRUNE_MAX_TOKENS`
for ALL of the tokenizers. This can be useful to comparing models
that use different tokenizers but the same maximum sequence length.
Once the subset task class has been defined in this file, it can be used by adding the class
to
`lm_eval/tasks/__init__.py`
.
NOTE: GovReport may need
`max_gen_toks`
set larger for causal models.
"""
lm_eval/tasks/scrolls/scrolls.yaml
0 → 100644
View file @
9f518392
group
:
scrolls
task
:
-
scrolls_qasper
-
scrolls_quality
-
scrolls_narrativeqa
-
scrolls_contractnli
-
scrolls_govreport
-
scrolls_summscreenfd
-
scrolls_qmsum
Prev
1
…
3
4
5
6
7
8
9
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment