Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
314f7176
Unverified
Commit
314f7176
authored
Jul 23, 2025
by
Baber Abbasi
Committed by
GitHub
Jul 23, 2025
Browse files
remove trust-remote-code in configs; fix escape sequences (#3180)
* remove trust-remote-code * add W605 rule
parent
8c6fde08
Changes
98
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
6 additions
and
40 deletions
+6
-40
lm_eval/tasks/mediqa_qa2019/mediqa_qa2019_perplexity.yaml
lm_eval/tasks/mediqa_qa2019/mediqa_qa2019_perplexity.yaml
+0
-2
lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
+1
-3
lm_eval/tasks/mlqa/mlqa_common_yaml
lm_eval/tasks/mlqa/mlqa_common_yaml
+0
-2
lm_eval/tasks/mmlu/continuation/_continuation_template_yaml
lm_eval/tasks/mmlu/continuation/_continuation_template_yaml
+0
-2
lm_eval/tasks/mmlu/default/_default_template_yaml
lm_eval/tasks/mmlu/default/_default_template_yaml
+0
-2
lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu_flan_cot_fewshot_template_yaml
...mlu/flan_cot_fewshot/_mmlu_flan_cot_fewshot_template_yaml
+0
-2
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
...u/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
+0
-2
lm_eval/tasks/mmlu/flan_cot_zeroshot/utils.py
lm_eval/tasks/mmlu/flan_cot_zeroshot/utils.py
+2
-2
lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml
...lan_n_shot/generative/_mmlu_flan_generative_template_yaml
+0
-2
lm_eval/tasks/mmlu/flan_n_shot/generative/utils.py
lm_eval/tasks/mmlu/flan_n_shot/generative/utils.py
+2
-2
lm_eval/tasks/mmlu/flan_n_shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
...shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
+0
-2
lm_eval/tasks/mmlu/generative/_default_template_yaml
lm_eval/tasks/mmlu/generative/_default_template_yaml
+0
-2
lm_eval/tasks/mmlu_prox/mmlu_prox_config_generator.py
lm_eval/tasks/mmlu_prox/mmlu_prox_config_generator.py
+1
-1
lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_nlp_survey.yaml
...el_written_evals/sycophancy/sycophancy_on_nlp_survey.yaml
+0
-2
lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_philpapers2020.yaml
...ritten_evals/sycophancy/sycophancy_on_philpapers2020.yaml
+0
-2
lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_political_typology_quiz.yaml
...als/sycophancy/sycophancy_on_political_typology_quiz.yaml
+0
-2
lm_eval/tasks/mutual/mutual.yaml
lm_eval/tasks/mutual/mutual.yaml
+0
-2
lm_eval/tasks/noreval/tatoeba/_tatoeba_yaml
lm_eval/tasks/noreval/tatoeba/_tatoeba_yaml
+0
-2
lm_eval/tasks/piqa/piqa.yaml
lm_eval/tasks/piqa/piqa.yaml
+0
-2
lm_eval/tasks/portuguese_bench/flores_pt/_flores_common_yaml
lm_eval/tasks/portuguese_bench/flores_pt/_flores_common_yaml
+0
-2
No files found.
lm_eval/tasks/mediqa_qa2019/mediqa_qa2019_perplexity.yaml
View file @
314f7176
...
@@ -23,5 +23,3 @@ metric_list:
...
@@ -23,5 +23,3 @@ metric_list:
higher_is_better
:
false
higher_is_better
:
false
metadata
:
metadata
:
version
:
1.0
version
:
1.0
dataset_kwargs
:
trust_remote_code
:
true
lm_eval/tasks/minerva_math/minerva_math_algebra.yaml
View file @
314f7176
...
@@ -25,8 +25,6 @@ metric_list:
...
@@ -25,8 +25,6 @@ metric_list:
num_fewshot
:
4
num_fewshot
:
4
metadata
:
metadata
:
version
:
2.0
version
:
2.0
dataset_kwargs
:
trust_remote_code
:
true
fewshot_config
:
fewshot_config
:
sampler
:
first_n
sampler
:
first_n
samples
:
!function
utils.list_fewshot_samples
samples
:
!function
utils.list_fewshot_samples
lm_eval/tasks/mlqa/mlqa_common_yaml
View file @
314f7176
dataset_path: facebook/mlqa
dataset_path: facebook/mlqa
dataset_kwargs:
trust_remote_code: true
test_split: test
test_split: test
validation_split: validation
validation_split: validation
output_type: generate_until
output_type: generate_until
...
...
lm_eval/tasks/mmlu/continuation/_continuation_template_yaml
View file @
314f7176
...
@@ -9,5 +9,3 @@ doc_to_choice: "{{choices}}"
...
@@ -9,5 +9,3 @@ doc_to_choice: "{{choices}}"
doc_to_target: "{{answer}}"
doc_to_target: "{{answer}}"
metadata:
metadata:
version: 1.0
version: 1.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/default/_default_template_yaml
View file @
314f7176
...
@@ -13,5 +13,3 @@ metric_list:
...
@@ -13,5 +13,3 @@ metric_list:
higher_is_better: true
higher_is_better: true
metadata:
metadata:
version: 1.0
version: 1.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/flan_cot_fewshot/_mmlu_flan_cot_fewshot_template_yaml
View file @
314f7176
...
@@ -26,5 +26,3 @@ metric_list:
...
@@ -26,5 +26,3 @@ metric_list:
ignore_punctuation: true
ignore_punctuation: true
metadata:
metadata:
version: 2.0
version: 2.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/flan_cot_zeroshot/_mmlu_flan_cot_zeroshot_template_yaml
View file @
314f7176
...
@@ -34,5 +34,3 @@ metric_list:
...
@@ -34,5 +34,3 @@ metric_list:
ignore_punctuation: true
ignore_punctuation: true
metadata:
metadata:
version: 3.0
version: 3.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/flan_cot_zeroshot/utils.py
View file @
314f7176
...
@@ -17,7 +17,7 @@ class MultiChoiceRegexFilter(RegexFilter):
...
@@ -17,7 +17,7 @@ class MultiChoiceRegexFilter(RegexFilter):
ignore_punctuation
=
False
,
ignore_punctuation
=
False
,
regexes_to_ignore
=
None
,
regexes_to_ignore
=
None
,
)
->
None
:
)
->
None
:
"""
r
"""
regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
- step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
- step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
- step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
- step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
...
@@ -90,7 +90,7 @@ class MultiChoiceRegexFilter(RegexFilter):
...
@@ -90,7 +90,7 @@ class MultiChoiceRegexFilter(RegexFilter):
fallback_regex
=
re
.
compile
(
"|"
.
join
(
fallback_regexes
))
fallback_regex
=
re
.
compile
(
"|"
.
join
(
fallback_regexes
))
without_paren_fallback_regex
=
"|"
.
join
(
without_paren_fallback_regexes
)
without_paren_fallback_regex
=
"|"
.
join
(
without_paren_fallback_regexes
)
without_paren_fallback_regex
=
re
.
compile
(
without_paren_fallback_regex
=
re
.
compile
(
f
":[\s]*(
{
without_paren_fallback_regex
}
)"
r
f
":[\s]*(
{
without_paren_fallback_regex
}
)"
)
)
filtered
=
[]
filtered
=
[]
...
...
lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml
View file @
314f7176
...
@@ -30,5 +30,3 @@ metric_list:
...
@@ -30,5 +30,3 @@ metric_list:
higher_is_better: true
higher_is_better: true
metadata:
metadata:
version: 3.0
version: 3.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/flan_n_shot/generative/utils.py
View file @
314f7176
...
@@ -17,7 +17,7 @@ class MultiChoiceRegexFilter(RegexFilter):
...
@@ -17,7 +17,7 @@ class MultiChoiceRegexFilter(RegexFilter):
ignore_punctuation
=
False
,
ignore_punctuation
=
False
,
regexes_to_ignore
=
None
,
regexes_to_ignore
=
None
,
)
->
None
:
)
->
None
:
"""
r
"""
regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
- step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
- step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
- step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
- step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
...
@@ -90,7 +90,7 @@ class MultiChoiceRegexFilter(RegexFilter):
...
@@ -90,7 +90,7 @@ class MultiChoiceRegexFilter(RegexFilter):
fallback_regex
=
re
.
compile
(
"|"
.
join
(
fallback_regexes
))
fallback_regex
=
re
.
compile
(
"|"
.
join
(
fallback_regexes
))
without_paren_fallback_regex
=
"|"
.
join
(
without_paren_fallback_regexes
)
without_paren_fallback_regex
=
"|"
.
join
(
without_paren_fallback_regexes
)
without_paren_fallback_regex
=
re
.
compile
(
without_paren_fallback_regex
=
re
.
compile
(
f
":[\s]*(
{
without_paren_fallback_regex
}
)"
r
f
":[\s]*(
{
without_paren_fallback_regex
}
)"
)
)
filtered
=
[]
filtered
=
[]
...
...
lm_eval/tasks/mmlu/flan_n_shot/loglikelihood/_mmlu_flan_loglikelihood_template_yaml
View file @
314f7176
...
@@ -13,5 +13,3 @@ metric_list:
...
@@ -13,5 +13,3 @@ metric_list:
higher_is_better: true
higher_is_better: true
metadata:
metadata:
version: 2.0
version: 2.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu/generative/_default_template_yaml
View file @
314f7176
...
@@ -30,5 +30,3 @@ filter_list:
...
@@ -30,5 +30,3 @@ filter_list:
- function: take_first
- function: take_first
metadata:
metadata:
version: 3.0
version: 3.0
dataset_kwargs:
trust_remote_code: true
lm_eval/tasks/mmlu_prox/mmlu_prox_config_generator.py
View file @
314f7176
...
@@ -44,7 +44,7 @@ if __name__ == "__main__":
...
@@ -44,7 +44,7 @@ if __name__ == "__main__":
line
=
line
.
format
(
lang
=
lang_abbr
)
line
=
line
.
format
(
lang
=
lang_abbr
)
if
"{ans_regex}"
in
line
:
if
"{ans_regex}"
in
line
:
ans_regex
=
lang_lib_list
[
-
1
].
replace
(
ans_regex
=
lang_lib_list
[
-
1
].
replace
(
"({})"
,
"\(?([ABCDEFGHIJ])\)?"
"({})"
,
r
"\(?([ABCDEFGHIJ])\)?"
)
)
if
lang_abbr
==
"en"
:
if
lang_abbr
==
"en"
:
ans_regex
=
ans_regex
.
lstrip
(
"the"
).
strip
()
ans_regex
=
ans_regex
.
lstrip
(
"the"
).
strip
()
...
...
lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_nlp_survey.yaml
View file @
314f7176
...
@@ -12,5 +12,3 @@ metric_list:
...
@@ -12,5 +12,3 @@ metric_list:
-
metric
:
acc
-
metric
:
acc
metadata
:
metadata
:
version
:
0.0
version
:
0.0
dataset_kwargs
:
trust_remote_code
:
true
lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_philpapers2020.yaml
View file @
314f7176
...
@@ -12,5 +12,3 @@ metric_list:
...
@@ -12,5 +12,3 @@ metric_list:
-
metric
:
acc
-
metric
:
acc
metadata
:
metadata
:
version
:
0.0
version
:
0.0
dataset_kwargs
:
trust_remote_code
:
true
lm_eval/tasks/model_written_evals/sycophancy/sycophancy_on_political_typology_quiz.yaml
View file @
314f7176
...
@@ -12,5 +12,3 @@ metric_list:
...
@@ -12,5 +12,3 @@ metric_list:
-
metric
:
acc
-
metric
:
acc
metadata
:
metadata
:
version
:
0.0
version
:
0.0
dataset_kwargs
:
trust_remote_code
:
true
lm_eval/tasks/mutual/mutual.yaml
View file @
314f7176
...
@@ -23,5 +23,3 @@ metric_list:
...
@@ -23,5 +23,3 @@ metric_list:
higher_is_better
:
true
higher_is_better
:
true
metadata
:
metadata
:
version
:
2.0
version
:
2.0
dataset_kwargs
:
trust_remote_code
:
true
lm_eval/tasks/noreval/tatoeba/_tatoeba_yaml
View file @
314f7176
...
@@ -2,8 +2,6 @@ dataset_path: Helsinki-NLP/tatoeba_mt
...
@@ -2,8 +2,6 @@ dataset_path: Helsinki-NLP/tatoeba_mt
training_split: validation
training_split: validation
test_split: test
test_split: test
output_type: generate_until
output_type: generate_until
dataset_kwargs:
trust_remote_code: true
metric_list:
metric_list:
- metric: bleu
- metric: bleu
higher_is_better: true
higher_is_better: true
...
...
lm_eval/tasks/piqa/piqa.yaml
View file @
314f7176
...
@@ -19,5 +19,3 @@ metric_list:
...
@@ -19,5 +19,3 @@ metric_list:
higher_is_better
:
true
higher_is_better
:
true
metadata
:
metadata
:
version
:
1.0
version
:
1.0
dataset_kwargs
:
trust_remote_code
:
true
lm_eval/tasks/portuguese_bench/flores_pt/_flores_common_yaml
View file @
314f7176
...
@@ -23,5 +23,3 @@ metric_list:
...
@@ -23,5 +23,3 @@ metric_list:
higher_is_better: true
higher_is_better: true
metadata:
metadata:
version: 1.0
version: 1.0
dataset_kwargs:
trust_remote_code: true
Prev
1
2
3
4
5
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment