Added KMMLU evaluation method and changed ReadMe (#1447)

* update kmmlu default formatting * Update _default_kmmlu_yaml * Delete lm_eval/tasks/kmmlu/utils.py * new tasks implemented * add direct tasks * update direct evaluate * update direct eval * add cot sample * update cot * add cot * Update _cot_kmmlu_yaml * add kmmlu90 * Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml * Create kmmlu90.yaml * Update _cot_kmmlu_yaml * add direct * Update _cot_kmmlu_yaml * Update and rename kmmlu90.yaml to kmmlu90_cot.yaml * Update kmmlu90_direct.yaml * add kmmlu hard * Update _cot_kmmlu_yaml * Update _cot_kmmlu_yaml * update cot * update cot * erase typo * Update _cot_kmmlu_yaml * update cot * Rename dataset to match k-mmlu-hard * removed kmmlu90 * fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README * applied pre-commit before pull requests * rename datasets and add notes * Remove DS_Store cache * Update lm_eval/tasks/kmmlu/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Change citations and reflect reviews on version * Added kmmlu_hard and fixed other errors * fixing minor errors * remove duplicated * Rename files * try ".index" * minor fix * minor fix again * fix revert. * minor fix. thank for hailey --------- Co-authored-by: GUIJIN SON <spthsrbwls123@yonsei.ac.kr> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

Added KMMLU evaluation method and changed ReadMe (#1447)
* update kmmlu default formatting * Update _default_kmmlu_yaml * Delete lm_eval/tasks/kmmlu/utils.py * new tasks implemented * add direct tasks * update direct evaluate * update direct eval * add cot sample * update cot * add cot * Update _cot_kmmlu_yaml * add kmmlu90 * Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml * Create kmmlu90.yaml * Update _cot_kmmlu_yaml * add direct * Update _cot_kmmlu_yaml * Update and rename kmmlu90.yaml to kmmlu90_cot.yaml * Update kmmlu90_direct.yaml * add kmmlu hard * Update _cot_kmmlu_yaml * Update _cot_kmmlu_yaml * update cot * update cot * erase typo * Update _cot_kmmlu_yaml * update cot * Rename dataset to match k-mmlu-hard * removed kmmlu90 * fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README * applied pre-commit before pull requests * rename datasets and add notes * Remove DS_Store cache * Update lm_eval/tasks/kmmlu/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Change citations and reflect reviews on version * Added kmmlu_hard and fixed other errors * fixing minor errors * remove duplicated * Rename files * try ".index" * minor fix * minor fix again * fix revert. * minor fix. thank for hailey --------- Co-authored-by: GUIJIN SON <spthsrbwls123@yonsei.ac.kr> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
c26a6ac7 · Hanwool Albert Lee · GitHub · 5ab295c8 · c26a6ac7 · c26a6ac7
Unverified Commit c26a6ac7 authored Feb 21, 2024 by Hanwool Albert Lee Committed by GitHub Feb 21, 2024
20 changed files
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_law.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_law.yaml
+dataset_name: law
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_law
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_machine_design_and_manufacturing.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_machine_design_and_manufacturing.yaml
+dataset_name: machine_design_and_manufacturing
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_machine_design_and_manufacturing
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_management.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_management.yaml
+dataset_name: management
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_management
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_maritime_engineering.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_maritime_engineering.yaml
+dataset_name: maritime_engineering
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_maritime_engineering
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_marketing.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_marketing.yaml
+dataset_name: marketing
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_marketing
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_materials_engineering.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_materials_engineering.yaml
+dataset_name: materials_engineering
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_materials_engineering
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_math.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_math.yaml
+dataset_name: math
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_math
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_mechanical_engineering.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_mechanical_engineering.yaml
+dataset_name: mechanical_engineering
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_mechanical_engineering
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_nondestructive_testing.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_nondestructive_testing.yaml
+dataset_name: nondestructive_testing
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_nondestructive_testing
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_patent.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_patent.yaml
+dataset_name: patent
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_patent
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_political_science_and_sociology.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_political_science_and_sociology.yaml
+dataset_name: political_science_and_sociology
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_political_science_and_sociology
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_psychology.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_psychology.yaml
+dataset_name: psychology
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_psychology
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_public_safety.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_public_safety.yaml
+dataset_name: public_safety
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_public_safety
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_railway_and_automotive_engineering.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_railway_and_automotive_engineering.yaml
+dataset_name: railway_and_automotive_engineering
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_railway_and_automotive_engineering
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_real_estate.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_real_estate.yaml
+dataset_name: real_estate
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_real_estate
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_refrigerating_machinery.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_refrigerating_machinery.yaml
+dataset_name: refrigerating_machinery
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_refrigerating_machinery
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_social_welfare.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_social_welfare.yaml
+dataset_name: social_welfare
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_social_welfare
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_taxation.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_taxation.yaml
+dataset_name: taxation
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_taxation
--- a/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_telecommunications_and_wireless_technology.yaml
+++ b/lm_eval/tasks/kmmlu/direct_hard/kmmlu_direct_hard_telecommunications_and_wireless_technology.yaml
+dataset_name: telecommunications_and_wireless_technology
+include: _direct_hard_kmmlu_yaml
+task: kmmlu_hard_direct_telecommunications_and_wireless_technology
--- a/lm_eval/tasks/kmmlu/_default_kmmlu_yaml
+++ b/lm_eval/tasks/kmmlu/_default_kmmlu_yaml
-group: kmmlu
-dataset_path: HAERAE-HUB/K-MMLU-Preview
+group:
+    - kmmlu
+    - kmmlu_hard
+dataset_path: HAERAE-HUB/KMMLU-HARD
 output_type: multiple_choice
-training_split: train
-validation_split: dev
 test_split: test
 fewshot_split: dev
-output_type: multiple_choice
 doc_to_text: "{{question.strip()}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n정답："
 doc_to_choice: ["A", "B", "C", "D"]
-doc_to_target: "{{['A', 'B', 'C', 'D'][answer-1]}}"
+doc_to_target: "{{answer-1}}"
 metric_list:
  - metric: acc
    aggregation: mean
@@ -17,4 +16,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  version: 1.1
+  version: 2.0