Commit b29ef521 authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

update cmmlu readme

parent 09ce0c02
......@@ -23,151 +23,26 @@ Homepage: https://github.com/haonan-li/CMMLU
}
```
### Groups and Tasks
| Tasks |Version|Filter| Metric |Value | |Stderr|
|--------------------------------------------|-------|------|--------------------|-----:|---|-----:|
|cmmlu |N/A |none |acc |0.2480| | |
| | |none |acc(sample agg) |0.2494| | |
| | |none |acc_norm |0.2480| | |
| | |none |acc_norm(sample agg)|0.2494| | |
|-cmmlu_modern_chinese |Yaml |none |acc |0.2500|± |0.0404|
| | |none |acc_norm |0.2500|± |0.0404|
|-cmmlu_world_history |Yaml |none |acc |0.2484|± |0.0342|
| | |none |acc_norm |0.2484|± |0.0342|
|-cmmlu_college_education |Yaml |none |acc |0.2523|± |0.0422|
| | |none |acc_norm |0.2523|± |0.0422|
|-cmmlu_international_law |Yaml |none |acc |0.2486|± |0.0319|
| | |none |acc_norm |0.2486|± |0.0319|
|-cmmlu_philosophy |Yaml |none |acc |0.1905|± |0.0385|
| | |none |acc_norm |0.1905|± |0.0385|
|-cmmlu_professional_psychology |Yaml |none |acc |0.2457|± |0.0283|
| | |none |acc_norm |0.2457|± |0.0283|
|-cmmlu_college_engineering_hydrology |Yaml |none |acc |0.2830|± |0.0440|
| | |none |acc_norm |0.2830|± |0.0440|
|-cmmlu_electrical_engineering |Yaml |none |acc |0.2442|± |0.0329|
| | |none |acc_norm |0.2442|± |0.0329|
|-cmmlu_ancient_chinese |Yaml |none |acc |0.2378|± |0.0333|
| | |none |acc_norm |0.2378|± |0.0333|
|-cmmlu_chinese_food_culture |Yaml |none |acc |0.2353|± |0.0365|
| | |none |acc_norm |0.2353|± |0.0365|
|-cmmlu_chinese_literature |Yaml |none |acc |0.2598|± |0.0308|
| | |none |acc_norm |0.2598|± |0.0308|
|-cmmlu_legal_and_moral_basis |Yaml |none |acc |0.2477|± |0.0296|
| | |none |acc_norm |0.2477|± |0.0296|
|-cmmlu_construction_project_management |Yaml |none |acc |0.2374|± |0.0362|
| | |none |acc_norm |0.2374|± |0.0362|
|-cmmlu_ethnology |Yaml |none |acc |0.2519|± |0.0375|
| | |none |acc_norm |0.2519|± |0.0375|
|-cmmlu_high_school_geography |Yaml |none |acc |0.2542|± |0.0403|
| | |none |acc_norm |0.2542|± |0.0403|
|-cmmlu_professional_medicine |Yaml |none |acc |0.2500|± |0.0224|
| | |none |acc_norm |0.2500|± |0.0224|
|-cmmlu_global_facts |Yaml |none |acc |0.2349|± |0.0348|
| | |none |acc_norm |0.2349|± |0.0348|
|-cmmlu_astronomy |Yaml |none |acc |0.2303|± |0.0329|
| | |none |acc_norm |0.2303|± |0.0329|
|-cmmlu_machine_learning |Yaml |none |acc |0.2541|± |0.0396|
| | |none |acc_norm |0.2541|± |0.0396|
|-cmmlu_high_school_politics |Yaml |none |acc |0.2378|± |0.0357|
| | |none |acc_norm |0.2378|± |0.0357|
|-cmmlu_chinese_civil_service_exam |Yaml |none |acc |0.2562|± |0.0346|
| | |none |acc_norm |0.2562|± |0.0346|
|-cmmlu_professional_law |Yaml |none |acc |0.2512|± |0.0299|
| | |none |acc_norm |0.2512|± |0.0299|
|-cmmlu_college_medical_statistics |Yaml |none |acc |0.2453|± |0.0420|
| | |none |acc_norm |0.2453|± |0.0420|
|-cmmlu_computer_security |Yaml |none |acc |0.2573|± |0.0335|
| | |none |acc_norm |0.2573|± |0.0335|
|-cmmlu_food_science |Yaml |none |acc |0.2238|± |0.0350|
| | |none |acc_norm |0.2238|± |0.0350|
|-cmmlu_security_study |Yaml |none |acc |0.2519|± |0.0375|
| | |none |acc_norm |0.2519|± |0.0375|
|-cmmlu_high_school_physics |Yaml |none |acc |0.2545|± |0.0417|
| | |none |acc_norm |0.2545|± |0.0417|
|-cmmlu_management |Yaml |none |acc |0.2476|± |0.0299|
| | |none |acc_norm |0.2476|± |0.0299|
|-cmmlu_professional_accounting |Yaml |none |acc |0.2514|± |0.0329|
| | |none |acc_norm |0.2514|± |0.0329|
|-cmmlu_human_sexuality |Yaml |none |acc |0.2222|± |0.0372|
| | |none |acc_norm |0.2222|± |0.0372|
|-cmmlu_marxist_theory |Yaml |none |acc |0.2487|± |0.0315|
| | |none |acc_norm |0.2487|± |0.0315|
|-cmmlu_agronomy |Yaml |none |acc |0.2426|± |0.0331|
| | |none |acc_norm |0.2426|± |0.0331|
|-cmmlu_chinese_teacher_qualification |Yaml |none |acc |0.2626|± |0.0330|
| | |none |acc_norm |0.2626|± |0.0330|
|-cmmlu_genetics |Yaml |none |acc |0.2273|± |0.0317|
| | |none |acc_norm |0.2273|± |0.0317|
|-cmmlu_sports_science |Yaml |none |acc |0.2727|± |0.0348|
| | |none |acc_norm |0.2727|± |0.0348|
|-cmmlu_elementary_commonsense |Yaml |none |acc |0.2424|± |0.0305|
| | |none |acc_norm |0.2424|± |0.0305|
|-cmmlu_logical |Yaml |none |acc |0.1951|± |0.0359|
| | |none |acc_norm |0.1951|± |0.0359|
|-cmmlu_chinese_history |Yaml |none |acc |0.2508|± |0.0242|
| | |none |acc_norm |0.2508|± |0.0242|
|-cmmlu_traditional_chinese_medicine |Yaml |none |acc |0.2378|± |0.0314|
| | |none |acc_norm |0.2378|± |0.0314|
|-cmmlu_elementary_mathematics |Yaml |none |acc |0.2609|± |0.0290|
| | |none |acc_norm |0.2609|± |0.0290|
|-cmmlu_nutrition |Yaml |none |acc |0.2552|± |0.0363|
| | |none |acc_norm |0.2552|± |0.0363|
|-cmmlu_chinese_foreign_policy |Yaml |none |acc |0.1776|± |0.0371|
| | |none |acc_norm |0.1776|± |0.0371|
|-cmmlu_journalism |Yaml |none |acc |0.2616|± |0.0336|
| | |none |acc_norm |0.2616|± |0.0336|
|-cmmlu_jurisprudence |Yaml |none |acc |0.2506|± |0.0214|
| | |none |acc_norm |0.2506|± |0.0214|
|-cmmlu_sociology |Yaml |none |acc |0.2478|± |0.0288|
| | |none |acc_norm |0.2478|± |0.0288|
|-cmmlu_college_mathematics |Yaml |none |acc |0.2190|± |0.0406|
| | |none |acc_norm |0.2190|± |0.0406|
|-cmmlu_computer_science |Yaml |none |acc |0.2549|± |0.0306|
| | |none |acc_norm |0.2549|± |0.0306|
|-cmmlu_conceptual_physics |Yaml |none |acc |0.2517|± |0.0359|
| | |none |acc_norm |0.2517|± |0.0359|
|-cmmlu_elementary_chinese |Yaml |none |acc |0.2817|± |0.0284|
| | |none |acc_norm |0.2817|± |0.0284|
|-cmmlu_marketing |Yaml |none |acc |0.2500|± |0.0324|
| | |none |acc_norm |0.2500|± |0.0324|
|-cmmlu_high_school_chemistry |Yaml |none |acc |0.2576|± |0.0382|
| | |none |acc_norm |0.2576|± |0.0382|
|-cmmlu_college_law |Yaml |none |acc |0.2315|± |0.0408|
| | |none |acc_norm |0.2315|± |0.0408|
|-cmmlu_chinese_driving_rule |Yaml |none |acc |0.2595|± |0.0384|
| | |none |acc_norm |0.2595|± |0.0384|
|-cmmlu_clinical_knowledge |Yaml |none |acc |0.2532|± |0.0283|
| | |none |acc_norm |0.2532|± |0.0283|
|-cmmlu_education |Yaml |none |acc |0.2761|± |0.0351|
| | |none |acc_norm |0.2761|± |0.0351|
|-cmmlu_high_school_mathematics |Yaml |none |acc |0.2927|± |0.0356|
| | |none |acc_norm |0.2927|± |0.0356|
|-cmmlu_college_actuarial_science |Yaml |none |acc |0.2736|± |0.0435|
| | |none |acc_norm |0.2736|± |0.0435|
|-cmmlu_arts |Yaml |none |acc |0.2313|± |0.0334|
| | |none |acc_norm |0.2313|± |0.0334|
|-cmmlu_public_relations |Yaml |none |acc |0.2471|± |0.0328|
| | |none |acc_norm |0.2471|± |0.0328|
|-cmmlu_college_medicine |Yaml |none |acc |0.2418|± |0.0260|
| | |none |acc_norm |0.2418|± |0.0260|
|-cmmlu_economics |Yaml |none |acc |0.2453|± |0.0342|
| | |none |acc_norm |0.2453|± |0.0342|
|-cmmlu_elementary_information_and_technology|Yaml |none |acc |0.2731|± |0.0289|
| | |none |acc_norm |0.2731|± |0.0289|
|-cmmlu_anatomy |Yaml |none |acc |0.2432|± |0.0354|
| | |none |acc_norm |0.2432|± |0.0354|
|-cmmlu_world_religions |Yaml |none |acc |0.2875|± |0.0359|
| | |none |acc_norm |0.2875|± |0.0359|
|-cmmlu_virology |Yaml |none |acc |0.2485|± |0.0333|
| | |none |acc_norm |0.2485|± |0.0333|
|-cmmlu_high_school_biology |Yaml |none |acc |0.2485|± |0.0333|
| | |none |acc_norm |0.2485|± |0.0333|
|-cmmlu_business_ethics |Yaml |none |acc |0.2584|± |0.0304|
| | |none |acc_norm |0.2584|± |0.0304|
#### Groups
|Groups|Version|Filter| Metric |Value | |Stderr|
|------|-------|------|--------------------|-----:|---|------|
|cmmlu |N/A |none |acc |0.2480| | |
| | |none |acc(sample agg) |0.2494| | |
| | |none |acc_norm |0.2480| | |
| | |none |acc_norm(sample agg)|0.2494| | |
- `cmmlu`: All 67 subjects of the CMMLU dataset, evaluated following the methodology in MMLU's original implementation.
#### Tasks
The following tasks evaluate subjects in the CMMLU dataset using loglikelihood-based multiple-choice scoring:
- `cmmlu_{subject_english}`
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation?
* [x] Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment