Unverified Commit c26a6ac7 authored by Hanwool Albert Lee's avatar Hanwool Albert Lee Committed by GitHub
Browse files

Added KMMLU evaluation method and changed ReadMe (#1447)



* update kmmlu default formatting

* Update _default_kmmlu_yaml

* Delete lm_eval/tasks/kmmlu/utils.py

* new tasks implemented

* add direct tasks

* update direct evaluate

* update direct eval

* add cot sample

* update cot

* add cot

* Update _cot_kmmlu_yaml

* add kmmlu90

* Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml

* Create kmmlu90.yaml

* Update _cot_kmmlu_yaml

* add direct

* Update _cot_kmmlu_yaml

* Update and rename kmmlu90.yaml to kmmlu90_cot.yaml

* Update kmmlu90_direct.yaml

* add kmmlu hard

* Update _cot_kmmlu_yaml

* Update _cot_kmmlu_yaml

* update cot

* update cot

* erase typo

* Update _cot_kmmlu_yaml

* update cot

* Rename dataset to match k-mmlu-hard

* removed kmmlu90

* fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README

* applied pre-commit before pull requests

* rename datasets and add notes

* Remove DS_Store cache

* Update lm_eval/tasks/kmmlu/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Change citations and reflect reviews on version

* Added kmmlu_hard and fixed other errors

* fixing minor errors

* remove duplicated

* Rename files

* try ".index"

* minor fix

* minor fix again

* fix revert.

* minor fix. thank for hailey

---------
Co-authored-by: default avatarGUIJIN SON <spthsrbwls123@yonsei.ac.kr>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 5ab295c8
dataset_name: law
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_law
dataset_name: machine_design_and_manufacturing
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_machine_design_and_manufacturing
dataset_name: management
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_management
dataset_name: maritime_engineering
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_maritime_engineering
dataset_name: marketing
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_marketing
dataset_name: materials_engineering
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_materials_engineering
dataset_name: math
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_math
dataset_name: mechanical_engineering
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_mechanical_engineering
dataset_name: nondestructive_testing
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_nondestructive_testing
dataset_name: patent
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_patent
dataset_name: political_science_and_sociology
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_political_science_and_sociology
dataset_name: psychology
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_psychology
dataset_name: public_safety
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_public_safety
dataset_name: railway_and_automotive_engineering
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_railway_and_automotive_engineering
dataset_name: real_estate
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_real_estate
dataset_name: refrigerating_machinery
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_refrigerating_machinery
dataset_name: social_welfare
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_social_welfare
dataset_name: taxation
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_taxation
dataset_name: telecommunications_and_wireless_technology
include: _direct_hard_kmmlu_yaml
task: kmmlu_hard_direct_telecommunications_and_wireless_technology
group: kmmlu
dataset_path: HAERAE-HUB/K-MMLU-Preview
group:
- kmmlu
- kmmlu_hard
dataset_path: HAERAE-HUB/KMMLU-HARD
output_type: multiple_choice
training_split: train
validation_split: dev
test_split: test
fewshot_split: dev
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n정답:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: "{{['A', 'B', 'C', 'D'][answer-1]}}"
doc_to_target: "{{answer-1}}"
metric_list:
- metric: acc
aggregation: mean
......@@ -17,4 +16,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
version: 1.1
version: 2.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment