Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
gaoqiong
lm-evaluation-harness
Commits
b2c090cc
Unverified
Commit
b2c090cc
authored
Jan 22, 2025
by
Minho Ryu
Committed by
GitHub
Jan 21, 2025
Browse files
aggregate by group (total and categories) (#2643)
parent
ed9c6fc8
Changes
204
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
68 additions
and
14 deletions
+68
-14
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_nondestructive_testing.yaml
...kmmlu/cot_hard/kmmlu_cot_hard_nondestructive_testing.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_patent.yaml
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_patent.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_political_science_and_sociology.yaml
..._hard/kmmlu_cot_hard_political_science_and_sociology.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_psychology.yaml
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_psychology.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_public_safety.yaml
...al/tasks/kmmlu/cot_hard/kmmlu_cot_hard_public_safety.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_railway_and_automotive_engineering.yaml
...rd/kmmlu_cot_hard_railway_and_automotive_engineering.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_real_estate.yaml
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_real_estate.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_refrigerating_machinery.yaml
...mmlu/cot_hard/kmmlu_cot_hard_refrigerating_machinery.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_social_welfare.yaml
...l/tasks/kmmlu/cot_hard/kmmlu_cot_hard_social_welfare.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_taxation.yaml
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_taxation.yaml
+2
-1
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_telecommunications_and_wireless_technology.yaml
..._cot_hard_telecommunications_and_wireless_technology.yaml
+2
-1
lm_eval/tasks/kmmlu/direct/_direct_kmmlu_yaml
lm_eval/tasks/kmmlu/direct/_direct_kmmlu_yaml
+0
-3
lm_eval/tasks/kmmlu/direct/_kmmlu_direct.yaml
lm_eval/tasks/kmmlu/direct/_kmmlu_direct.yaml
+11
-0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_applied_science.yaml
...val/tasks/kmmlu/direct/_kmmlu_direct_applied_science.yaml
+8
-0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_humss.yaml
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_humss.yaml
+8
-0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_other.yaml
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_other.yaml
+8
-0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_stem.yaml
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_stem.yaml
+8
-0
lm_eval/tasks/kmmlu/direct/kmmlu_direct_accounting.yaml
lm_eval/tasks/kmmlu/direct/kmmlu_direct_accounting.yaml
+1
-0
lm_eval/tasks/kmmlu/direct/kmmlu_direct_agricultural_sciences.yaml
...asks/kmmlu/direct/kmmlu_direct_agricultural_sciences.yaml
+1
-0
lm_eval/tasks/kmmlu/direct/kmmlu_direct_aviation_engineering_and_maintenance.yaml
...ct/kmmlu_direct_aviation_engineering_and_maintenance.yaml
+1
-0
No files found.
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_nondestructive_testing.yaml
View file @
b2c090cc
...
...
@@ -91,4 +91,5 @@ fewshot_config:
시험체의 두께 t를 계산하면 다음과 같습니다. t = v / (2f) = 4800 / (2 * 2 * 10^6) = 0.0012m = 1.2mm
따라서, 정답은 (A) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_nondestructive_testing
task
:
kmmlu_cot_hard_nondestructive_testing
tag
:
kmmlu_cot_hard_applied_science_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_patent.yaml
View file @
b2c090cc
...
...
@@ -110,4 +110,5 @@ fewshot_config:
발명에 대해서는 먼저 출원한 자만이 특허를 받을 수 있다고 규정하고 있으므로, 乙은 특허를 받을 수 없습니다. , (D)는 옳은 설명입니다.
따라서, 정답은 (A) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_patent
task
:
kmmlu_cot_hard_patent
tag
:
kmmlu_cot_hard_other_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_political_science_and_sociology.yaml
View file @
b2c090cc
...
...
@@ -88,4 +88,5 @@ fewshot_config:
범위에서도 활용되는 전략입니다. 도시의 이미지를 국제적으로 홍보하고, 외국인 투자자나 관광객을 유치하는 것이 도시마케팅의 일부이기 때문입니다. 도시마케팅의
공간적 범위가 국내로만 한정되어 있다는 것은 잘못된 설명입니다. 따라서, 정답은 (D) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_political_science_and_sociology
task
:
kmmlu_cot_hard_political_science_and_sociology
tag
:
kmmlu_cot_hard_humss_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_psychology.yaml
View file @
b2c090cc
...
...
@@ -95,4 +95,5 @@ fewshot_config:
이러한 분석을 통해 고급 상담자의 특징은 (C) 내담자에게 의도적으로 주의를 기울이고 중요한 정보를 수집하고 인식할 수 있다는 것으로 보입니다.
이는 상담자의 기본적인 역량을 넘어서서 고급 상담자가 갖추어야 할 능력으로 보입니다. 따라서, 정답은 (C) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_psychology
task
:
kmmlu_cot_hard_psychology
tag
:
kmmlu_cot_hard_humss_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_public_safety.yaml
View file @
b2c090cc
...
...
@@ -90,4 +90,5 @@ fewshot_config:
산업안전ᆞ보건과 관련된 그 밖의 사항 , 선택지 중에서 산업안전보건위원회의 심의ᆞ의결을 거치지 않아도 되는 사항은 (B) 안전ᆞ보건과
관련된 안전장치 구입 시의 적격품 여부 확인에 관한 사항입니다. 따라서, 정답은 (B) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_public_safety
task
:
kmmlu_cot_hard_public_safety
tag
:
kmmlu_cot_hard_other_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_railway_and_automotive_engineering.yaml
View file @
b2c090cc
...
...
@@ -85,4 +85,5 @@ fewshot_config:
위한
것입니다.
(D)
기관의
과냉
및
소음방지를
위해
일정
회전수
이상
시
슬립
발생:
유체
커플링식
냉각
팬은
기관의
과냉
및
소음
방지를
위해
일정
회전수
이상
시
슬립이
발생합니다.
이는
유체
커플링의
특성
때문입니다.
따라서,
정답은
(A)
입니다.'
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_railway_and_automotive_engineering
task
:
kmmlu_cot_hard_railway_and_automotive_engineering
tag
:
kmmlu_cot_hard_applied_science_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_real_estate.yaml
View file @
b2c090cc
...
...
@@ -88,4 +88,5 @@ fewshot_config:
따르면 개업공인중개사는 등록한 관할구역 외의 지역에 있는 중개대상물을 중개할 수 있습니다. 이 내용은 잘못된 내용입니다. 따라서, 정답은
(D) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_real_estate
task
:
kmmlu_cot_hard_real_estate
tag
:
kmmlu_cot_hard_other_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_refrigerating_machinery.yaml
View file @
b2c090cc
...
...
@@ -88,4 +88,5 @@ fewshot_config:
=
200.15K입니다.
그러므로,
W
=
1kJ
*
(300.15K
-
200.15K)
/
200.15K
=
0.5kJ입니다.
따라서,
정답은
(D)
입니다.'
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_refrigerating_machinery
task
:
kmmlu_cot_hard_refrigerating_machinery
tag
:
kmmlu_cot_hard_other_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_social_welfare.yaml
View file @
b2c090cc
...
...
@@ -90,4 +90,5 @@ fewshot_config:
이는 사회복지정책의 본질적인 목표와 원칙을 반영하지 못하고 있습니다. 사회복지정책은 능력이 아닌 필요에 따라 지원을 하는 것이 원칙이며,
이를 통해 사회적 불평등을 해소하고 모든 사람이 기본적인 생활을 유지할 수 있도록 지원하는 것이 목표입니다. 따라서, 정답은 (B) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_social_welfare
task
:
kmmlu_cot_hard_social_welfare
tag
:
kmmlu_cot_hard_humss_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_taxation.yaml
View file @
b2c090cc
...
...
@@ -104,4 +104,5 @@ fewshot_config:
국가의 안전보장 목적의 수행상 긴요하다고 인정하여 수입하는 물품을 의미합니다. 이 또한 국가의 안전보장을 위해 필요한 물품이므로 면세 대상에
해당할 것으로 보입니다. 따라서, 정답은 (A) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_taxation
task
:
kmmlu_cot_hard_taxation
tag
:
kmmlu_cot_hard_humss_tasks
lm_eval/tasks/kmmlu/cot_hard/kmmlu_cot_hard_telecommunications_and_wireless_technology.yaml
View file @
b2c090cc
...
...
@@ -83,4 +83,5 @@ fewshot_config:
증가하면, 전자기파의 세기는 1/r^2배 감소합니다. , 거리가 2배가 되면, 전자기파의 세기는 1/4배가 됩니다. 그리고 전력 밀도는
전기장과 자기장의 제곱에 비례하므로, 거리가 2배가 되면 전력 밀도는 1/4배가 됩니다. 따라서, 정답은 (D) 입니다.
include
:
_cot_kmmlu_yaml
task
:
kmmlu_hard_cot_telecommunications_and_wireless_technology
task
:
kmmlu_cot_hard_telecommunications_and_wireless_technology
tag
:
kmmlu_cot_hard_applied_science_tasks
lm_eval/tasks/kmmlu/direct/_direct_kmmlu_yaml
View file @
b2c090cc
tag:
- kmmlu
- kmmlu_direct
dataset_path: HAERAE-HUB/KMMLU
output_type: generate_until
test_split: test
...
...
lm_eval/tasks/kmmlu/direct/_kmmlu_direct.yaml
0 → 100644
View file @
b2c090cc
group
:
kmmlu_direct
task
:
-
kmmlu_direct_stem
-
kmmlu_direct_other
-
kmmlu_direct_applied_science
-
kmmlu_direct_humss
aggregate_metric_list
:
-
metric
:
exact_match
weight_by_size
:
True
metadata
:
version
:
2.0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_applied_science.yaml
0 → 100644
View file @
b2c090cc
group
:
kmmlu_direct_applied_science
task
:
-
kmmlu_direct_applied_science_tasks
aggregate_metric_list
:
-
metric
:
exact_match
weight_by_size
:
True
metadata
:
version
:
2.0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_humss.yaml
0 → 100644
View file @
b2c090cc
group
:
kmmlu_direct_humss
task
:
-
kmmlu_direct_humss_tasks
aggregate_metric_list
:
-
metric
:
exact_match
weight_by_size
:
True
metadata
:
version
:
2.0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_other.yaml
0 → 100644
View file @
b2c090cc
group
:
kmmlu_direct_other
task
:
-
kmmlu_direct_other_tasks
aggregate_metric_list
:
-
metric
:
exact_match
weight_by_size
:
True
metadata
:
version
:
2.0
lm_eval/tasks/kmmlu/direct/_kmmlu_direct_stem.yaml
0 → 100644
View file @
b2c090cc
group
:
kmmlu_direct_stem
task
:
-
kmmlu_direct_stem_tasks
aggregate_metric_list
:
-
metric
:
exact_match
weight_by_size
:
True
metadata
:
version
:
2.0
lm_eval/tasks/kmmlu/direct/kmmlu_direct_accounting.yaml
View file @
b2c090cc
dataset_name
:
Accounting
include
:
_direct_kmmlu_yaml
task
:
kmmlu_direct_accounting
tag
:
kmmlu_direct_humss_tasks
lm_eval/tasks/kmmlu/direct/kmmlu_direct_agricultural_sciences.yaml
View file @
b2c090cc
dataset_name
:
Agricultural-Sciences
include
:
_direct_kmmlu_yaml
task
:
kmmlu_direct_agricultural_sciences
tag
:
kmmlu_direct_other_tasks
lm_eval/tasks/kmmlu/direct/kmmlu_direct_aviation_engineering_and_maintenance.yaml
View file @
b2c090cc
dataset_name
:
Aviation-Engineering-and-Maintenance
include
:
_direct_kmmlu_yaml
task
:
kmmlu_direct_aviation_engineering_and_maintenance
tag
:
kmmlu_direct_applied_science_tasks
Prev
1
2
3
4
5
6
7
…
11
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment