Unverified Commit d5f39bf8 authored by SuperCat's avatar SuperCat Committed by GitHub
Browse files

Add new dataset MMLU-SR tasks (#2032)



* add mmlusr tasks

* renamed all tasks names in mmlusr

* edit format and readme

* added mmlu_sr

* mmlu_sr -> mmlusr

* update

---------
Co-authored-by: default avatarlintangsutawika <lintang@eleuther.ai>
parent cdd954f9
...@@ -67,6 +67,7 @@ ...@@ -67,6 +67,7 @@
| [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu | | [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
| [minerva_math](minerva_math/README.md) | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English | | [minerva_math](minerva_math/README.md) | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English |
| mmlu | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English | | mmlu | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigourous. | English |
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | | | model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
| [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English | | [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English |
| [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English | | [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English |
......
# MMLU-SR
## Paper
Title: [Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models](https://arxiv.org/abs/2406.15468v1)
We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that ``truly'' understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers.
Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.
Github Homepage: [https://github.com/Wang-ML-Lab/MMLU-SR](https://github.com/Wang-ML-Lab/MMLU-SR)
Huggingface Dataset: [https://huggingface.co/datasets/NiniCat/MMLU-SR]([https://huggingface.co/datasets/NiniCat/MMLU-SR)
## Citation
```bib
@misc{wang2024reasoningsimplytokenprediction,
title={Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models},
author={Wentian Wang and Paul Kantor and Jacob Feldman and Lazaros Gallos and Hao Wang},
year={2024},
eprint={2406.15468},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.15468},
}
```
### Groups and Tasks
#### Groups
- `mmlusr`: MMLU variant where the terminology in the question and answers are modified.
- `mmlusr_answer_only`: MMLU variant where the terminology in the answers are modified.
- `mmlusr_question_only`: MMLU variant where the terminology in the question is modified.
#### Tasks
There are 57 symbol replaced subjects in each group. You can run a single task by:
* `mmlusr_question_only_abstract_algebra`
Or by categories:
* `mmlusr_question_only_stem_tasks `
### Checklist
The checklist is the following:
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* The implementation in the original paper is one where the model is first fine-tuned on the data. They do have a few-shot evaluation for GPT-3, however the few-shot context used here is sourced from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is comparable to that provided in the paper, though not identical.
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
### Variant Wishlist
- [ ] zero-shot variant
group: mmlusr_answer_only
group_alias: MMLU-SR (Answer Only)
task:
- group: mmlusr_ao_stem
group_alias: STEM (Answer Only)
task:
- mmlusr_answer_only_stem_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
- group: mmlusr_ao_other
group_alias: Other (Answer Only)
task:
- mmlusr_answer_only_other_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
- group: mmlusr_ao_social_sciences
group_alias: Social Sciences (Answer Only)
task:
- mmlusr_answer_only_social_sciences_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
- group: mmlusr_ao_humanities
group_alias: Humanities (Answer Only)
task:
- mmlusr_answer_only_humanities_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 1
dataset_path: NiniCat/MMLU-SR
test_split: test
fewshot_split: train
fewshot_config:
sampler: first_n
output_type: multiple_choice
process_docs: !function utils.process_docs
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
"dataset_name": "answer_only_abstract_algebra"
"description": "The following are multiple choice questions (with answers) about abstract\
\ algebra.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_abstract_algebra"
"task_alias": "abstract algebra"
"dataset_name": "answer_only_anatomy"
"description": "The following are multiple choice questions (with answers) about anatomy.\n\
\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_anatomy"
"task_alias": "anatomy"
"dataset_name": "answer_only_astronomy"
"description": "The following are multiple choice questions (with answers) about astronomy.\n\
\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_astronomy"
"task_alias": "astronomy"
"dataset_name": "answer_only_business_ethics"
"description": "The following are multiple choice questions (with answers) about business\
\ ethics.\n\n"
"tag": "mmlusr_answer_only_other_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_business_ethics"
"task_alias": "business ethics"
"dataset_name": "answer_only_clinical_knowledge"
"description": "The following are multiple choice questions (with answers) about clinical\
\ knowledge.\n\n"
"tag": "mmlusr_answer_only_other_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_clinical_knowledge"
"task_alias": "clinical knowledge"
"dataset_name": "answer_only_college_biology"
"description": "The following are multiple choice questions (with answers) about college\
\ biology.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_college_biology"
"task_alias": "college biology"
"dataset_name": "answer_only_college_chemistry"
"description": "The following are multiple choice questions (with answers) about college\
\ chemistry.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_college_chemistry"
"task_alias": "college chemistry"
"dataset_name": "answer_only_college_computer_science"
"description": "The following are multiple choice questions (with answers) about college\
\ computer science.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_college_computer_science"
"task_alias": "college computer science"
"dataset_name": "answer_only_college_mathematics"
"description": "The following are multiple choice questions (with answers) about college\
\ mathematics.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_college_mathematics"
"task_alias": "college mathematics"
"dataset_name": "answer_only_college_medicine"
"description": "The following are multiple choice questions (with answers) about college\
\ medicine.\n\n"
"tag": "mmlusr_answer_only_other_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_college_medicine"
"task_alias": "college medicine"
"dataset_name": "answer_only_college_physics"
"description": "The following are multiple choice questions (with answers) about college\
\ physics.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_college_physics"
"task_alias": "college physics"
"dataset_name": "answer_only_computer_security"
"description": "The following are multiple choice questions (with answers) about computer\
\ security.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_computer_security"
"task_alias": "computer security"
"dataset_name": "answer_only_conceptual_physics"
"description": "The following are multiple choice questions (with answers) about conceptual\
\ physics.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_conceptual_physics"
"task_alias": "conceptual physics"
"dataset_name": "answer_only_econometrics"
"description": "The following are multiple choice questions (with answers) about econometrics.\n\
\n"
"tag": "mmlusr_answer_only_social_sciences_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_econometrics"
"task_alias": "econometrics"
"dataset_name": "answer_only_electrical_engineering"
"description": "The following are multiple choice questions (with answers) about electrical\
\ engineering.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_electrical_engineering"
"task_alias": "electrical engineering"
"dataset_name": "answer_only_elementary_mathematics"
"description": "The following are multiple choice questions (with answers) about elementary\
\ mathematics.\n\n"
"tag": "mmlusr_answer_only_stem_tasks"
"include": "_mmlusr_a_yml"
"task": "mmlusr_answer_only_elementary_mathematics"
"task_alias": "elementary mathematics"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment