Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
9822b06e
Unverified
Commit
9822b06e
authored
Mar 01, 2024
by
Lintang Sutawika
Committed by
GitHub
Mar 01, 2024
Browse files
Merge branch 'main' into weight_by_size
parents
51f27158
b177c82c
Changes
656
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
142 additions
and
20 deletions
+142
-20
lm_eval/tasks/ammlu/ammlu_professional_accounting.yaml
lm_eval/tasks/ammlu/ammlu_professional_accounting.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_professional_law.yaml
lm_eval/tasks/ammlu/ammlu_professional_law.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_professional_medicine.yaml
lm_eval/tasks/ammlu/ammlu_professional_medicine.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_professional_psychology.yaml
lm_eval/tasks/ammlu/ammlu_professional_psychology.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_public_relations.yaml
lm_eval/tasks/ammlu/ammlu_public_relations.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_security_studies.yaml
lm_eval/tasks/ammlu/ammlu_security_studies.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_sociology.yaml
lm_eval/tasks/ammlu/ammlu_sociology.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_us_foreign_policy.yaml
lm_eval/tasks/ammlu/ammlu_us_foreign_policy.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_virology.yaml
lm_eval/tasks/ammlu/ammlu_virology.yaml
+4
-0
lm_eval/tasks/ammlu/ammlu_world_religions.yaml
lm_eval/tasks/ammlu/ammlu_world_religions.yaml
+4
-0
lm_eval/tasks/arc/README.md
lm_eval/tasks/arc/README.md
+1
-1
lm_eval/tasks/bbh/_generate_configs.py
lm_eval/tasks/bbh/_generate_configs.py
+3
-3
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
+1
-0
lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
+10
-10
lm_eval/tasks/bbh/cot_zeroshot/boolean_expressions.yaml
lm_eval/tasks/bbh/cot_zeroshot/boolean_expressions.yaml
+14
-1
lm_eval/tasks/bbh/cot_zeroshot/causal_judgement.yaml
lm_eval/tasks/bbh/cot_zeroshot/causal_judgement.yaml
+14
-1
lm_eval/tasks/bbh/cot_zeroshot/date_understanding.yaml
lm_eval/tasks/bbh/cot_zeroshot/date_understanding.yaml
+16
-1
lm_eval/tasks/bbh/cot_zeroshot/disambiguation_qa.yaml
lm_eval/tasks/bbh/cot_zeroshot/disambiguation_qa.yaml
+16
-1
lm_eval/tasks/bbh/cot_zeroshot/dyck_languages.yaml
lm_eval/tasks/bbh/cot_zeroshot/dyck_languages.yaml
+13
-1
lm_eval/tasks/bbh/cot_zeroshot/formal_fallacies.yaml
lm_eval/tasks/bbh/cot_zeroshot/formal_fallacies.yaml
+14
-1
No files found.
lm_eval/tasks/ammlu/ammlu_professional_accounting.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
professional_accounting"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
علوم
أخرى
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_professional_accounting"
lm_eval/tasks/ammlu/ammlu_professional_law.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
professional_law"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
العلوم
الانسانية
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_professional_law"
lm_eval/tasks/ammlu/ammlu_professional_medicine.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
professional_medicine"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
علوم
أخرى
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_professional_medicine"
lm_eval/tasks/ammlu/ammlu_professional_psychology.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
professional_psychology"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
العلوم
الإجتماعية
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_professional_psychology"
lm_eval/tasks/ammlu/ammlu_public_relations.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
public_relations"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
العلوم
الإجتماعية
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_public_relations"
lm_eval/tasks/ammlu/ammlu_security_studies.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
security_studies"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
العلوم
الإجتماعية
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_security_studies"
lm_eval/tasks/ammlu/ammlu_sociology.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
sociology"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
العلوم
الإجتماعية
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_sociology"
lm_eval/tasks/ammlu/ammlu_us_foreign_policy.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
us_foreign_policy"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
العلوم
الإجتماعية
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_us_foreign_policy"
lm_eval/tasks/ammlu/ammlu_virology.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
virology"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
علوم
أخرى
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_virology"
lm_eval/tasks/ammlu/ammlu_world_religions.yaml
0 → 100644
View file @
9822b06e
"
dataset_name"
:
"
world_religions"
"
description"
:
"
فم
بعملية
التقييم
في
مجال
العلوم
الانسانية
\n\n
"
"
include"
:
"
_default_template_yaml"
"
task"
:
"
ammlu_world_religions"
lm_eval/tasks/arc/README.md
View file @
9822b06e
...
...
@@ -38,7 +38,7 @@ Homepage: https://allenai.org/data/arc
#### Tasks
*
`arc_easy`
*
`arc_chall
a
nge`
*
`arc_chall
e
nge`
### Checklist
...
...
lm_eval/tasks/bbh/_generate_configs.py
View file @
9822b06e
"""
Take in a YAML, and output all other splits with this YAML
"""
import
argparse
import
os
import
re
import
yaml
import
requests
import
argparse
import
datasets
import
requests
import
yaml
from
tqdm
import
tqdm
from
lm_eval
import
utils
...
...
lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
View file @
9822b06e
...
...
@@ -28,3 +28,4 @@ filter_list:
num_fewshot: 0
metadata:
version: 2.0
num_fewshot: 3 # controls what is printed in n-shot
lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
View file @
9822b06e
...
...
@@ -7,21 +7,21 @@ metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
#
ignore_case: true
ignore_case: true
# ignore_punctuation: true
regexes_to_ignore:
- "\\.$"
- ","
- "\\\\"
- "\n"
- '"'
generation_kwargs:
until:
- "</s>"
- "Q"
- "
\n\n
"
- "Q
:
"
- "
<|im_end|>
"
do_sample: false
temperature: 0.0
filter_list:
- name: "get-answer"
filter:
- function: "regex"
regex_pattern: "((?<=The answer is )(.*)(?=.)|(?<=the answer is )(.*)(?=.)|(?<=The answer: )(.*)(?=.)|(?<=The final answer: )(.*)(?=.))"
- function: "take_first"
num_fewshot: 0
metadata:
version:
1
.0
version:
2
.0
lm_eval/tasks/bbh/cot_zeroshot/boolean_expressions.yaml
View file @
9822b06e
"
dataset_name"
:
"
boolean_expressions"
"
description"
:
"
Evaluate
the
result
of
a
random
Boolean
expression.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step."
"
include"
:
"
_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_cot_zeroshot_boolean_expressions"
filter_list
:
-
name
:
"
flexible-extract"
filter
:
-
function
:
"
regex"
group_select
:
-1
regex_pattern
:
"
\\
b(True|False)
\\
b"
-
function
:
"
take_first"
-
name
:
"
strict-match"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
((?<=The
answer
is
)(.*)(?=.)|(?<=the
answer
is
)(.*)(?=.)|(?<=The
answer:
)(.*)(?=.)|(?<=The
final
answer:
)(.*)(?=.))"
-
function
:
"
take_first"
lm_eval/tasks/bbh/cot_zeroshot/causal_judgement.yaml
View file @
9822b06e
"
dataset_name"
:
"
causal_judgement"
"
description"
:
"
Answer
questions
about
causal
attribution.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step."
"
include"
:
"
_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_cot_zeroshot_causal_judgement"
filter_list
:
-
name
:
"
flexible-extract"
filter
:
-
function
:
"
regex"
group_select
:
-1
regex_pattern
:
"
\\
b(Yes|No|yes|no)
\\
b"
-
function
:
"
take_first"
-
name
:
"
strict-match"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
((?<=The
answer
is
)(.*)(?=.)|(?<=the
answer
is
)(.*)(?=.)|(?<=The
answer:
)(.*)(?=.)|(?<=The
final
answer:
)(.*)(?=.))"
-
function
:
"
take_first"
lm_eval/tasks/bbh/cot_zeroshot/date_understanding.yaml
View file @
9822b06e
"
dataset_name"
:
"
date_understanding"
"
description"
:
"
Infer
the
date
from
context.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step."
"
include"
:
"
_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_cot_zeroshot_date_understanding"
filter_list
:
-
name
:
"
flexible-extract"
filter
:
-
function
:
!function
utils.MultiChoiceRegexFilter
group_select
:
-1
ignore_case
:
true
ignore_punctuation
:
true
regex_pattern
:
"
(
\\
([A-Z]
\\
))"
-
function
:
"
take_first"
-
name
:
"
strict-match"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
((?<=The
answer
is
)(.*)(?=.)|(?<=the
answer
is
)(.*)(?=.)|(?<=The
answer:
)(.*)(?=.)|(?<=The
final
answer:
)(.*)(?=.))"
-
function
:
"
take_first"
lm_eval/tasks/bbh/cot_zeroshot/disambiguation_qa.yaml
View file @
9822b06e
"
dataset_name"
:
"
disambiguation_qa"
"
description"
:
"
Clarify
the
meaning
of
sentences
with
ambiguous
pronouns.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step."
"
include"
:
"
_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_cot_zeroshot_disambiguation_qa"
filter_list
:
-
name
:
"
flexible-extract"
filter
:
-
function
:
!function
utils.MultiChoiceRegexFilter
group_select
:
-1
ignore_case
:
true
ignore_punctuation
:
true
regex_pattern
:
"
(
\\
([A-Z]
\\
))"
-
function
:
"
take_first"
-
name
:
"
strict-match"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
((?<=The
answer
is
)(.*)(?=.)|(?<=the
answer
is
)(.*)(?=.)|(?<=The
answer:
)(.*)(?=.)|(?<=The
final
answer:
)(.*)(?=.))"
-
function
:
"
take_first"
lm_eval/tasks/bbh/cot_zeroshot/dyck_languages.yaml
View file @
9822b06e
"
dataset_name"
:
"
dyck_languages"
"
description"
:
"
Correctly
close
a
Dyck-n
word.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step."
"
include"
:
"
_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_cot_zeroshot_dyck_languages"
filter_list
:
-
name
:
"
flexible-extract"
filter
:
-
function
:
"
regex"
group_select
:
-1
regex_pattern
:
"
(?<=
)([
\"
\\
[
\\
(<{}>
\\
)
\\
]]+)|([
\"
\\
[
\\
(<{}>
\\
)
\\
]]+)"
-
function
:
"
take_first"
-
name
:
"
strict-match"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
((?<=The
answer
is
)(.*)(?=.)|(?<=the
answer
is
)(.*)(?=.)|(?<=The
answer:
)(.*)(?=.)|(?<=The
final
answer:
)(.*)(?=.))"
-
function
:
"
take_first"
lm_eval/tasks/bbh/cot_zeroshot/formal_fallacies.yaml
View file @
9822b06e
"
dataset_name"
:
"
formal_fallacies"
"
description"
:
"
Distinguish
deductively
valid
arguments
from
formal
fallacies.
\n\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step.
\n
"
"
doc_to_text"
:
"
Q:
{{input}}
\n
A:
Let's
think
step
by
step."
"
include"
:
"
_cot_zeroshot_template_yaml"
"
task"
:
"
bbh_cot_zeroshot_formal_fallacies"
filter_list
:
-
name
:
"
flexible-extract"
filter
:
-
function
:
"
regex"
group_select
:
-1
regex_pattern
:
"
\\
b(valid|invalid)
\\
b"
-
function
:
"
take_first"
-
name
:
"
strict-match"
filter
:
-
function
:
"
regex"
regex_pattern
:
"
((?<=The
answer
is
)(.*)(?=.)|(?<=the
answer
is
)(.*)(?=.)|(?<=The
answer:
)(.*)(?=.)|(?<=The
final
answer:
)(.*)(?=.))"
-
function
:
"
take_first"
Prev
1
2
3
4
5
6
7
8
9
…
33
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment