Commit bf11ac93 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into llama

parents 83b1c564 ade01428
include: mbpp.yaml
task: mbpp_plus
dataset_path: evalplus/mbppplus
dataset_name: null
doc_to_text: "You are an expert Python programmer, and here is your task: {{prompt if prompt is defined else text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]\n"
...@@ -63,3 +63,6 @@ If other tasks on this dataset are already supported: ...@@ -63,3 +63,6 @@ If other tasks on this dataset are already supported:
### Variant Wishlist ### Variant Wishlist
- [ ] zero-shot variant - [ ] zero-shot variant
### Changelog
version 2.0: (21-Feb-2025); added math_verify (extraction) metric. For details [see](https://huggingface.co/blog/math_verify_leaderboard)
...@@ -19,9 +19,12 @@ metric_list: ...@@ -19,9 +19,12 @@ metric_list:
- metric: exact_match - metric: exact_match
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
- metric: math_verify
aggregation: mean
higher_is_better: true
num_fewshot: 4 num_fewshot: 4
metadata: metadata:
version: 1.0 version: 2.0
dataset_kwargs: dataset_kwargs:
trust_remote_code: true trust_remote_code: true
fewshot_config: fewshot_config:
......
import logging
import re import re
import signal import signal
from importlib.metadata import version
from typing import Dict, List, Optional from typing import Dict, List, Optional
import datasets import datasets
from lm_eval.utils import eval_logger
eval_logger = logging.getLogger(__name__)
try: try:
import antlr4
import sympy import sympy
from math_verify import parse, verify
from sympy.parsing.latex import parse_latex from sympy.parsing.latex import parse_latex
except ModuleNotFoundError:
raise ModuleNotFoundError( assert version("antlr4-python3-runtime").startswith("4.11")
"`sympy` is required for generating translation task prompt templates. \ except (ModuleNotFoundError, AssertionError) as e:
please install sympy via pip install lm-eval[math] or pip install -e .[math]", raise type(e)(
) "`sympy`, `math_verify` and `antlr4-python3-runtime==4.11` are required for generating translation task prompt templates. "
"Please install the required packages via pip install lm-eval[math] or pip install -e .[math]"
) from e
# taken from # taken from
...@@ -75,8 +82,13 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]: ...@@ -75,8 +82,13 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
else: else:
retval = 0 retval = 0
# math_verify
res = verify(parse(doc["answer"]), parse(candidates))
mathval = 1 if res else 0
results = { results = {
"exact_match": retval, "exact_match": retval,
"math_verify": mathval,
} }
return results return results
......
# mmlu_pro_plus
### Paper
Title: `MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs`
Abstract: `Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between
top-performing models, underscoring the need for more challenging evaluation frameworks.
We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut
learning and higher-order reasoning in LLMs. By incorporating questions with multiple
correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex
reasoning and resist simplistic problem-solving strategies. Our results show that
MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of
model discrimination, particularly in multi-correct answer scenarios.
We introduce novel metrics like shortcut selection ratio and correct pair identification
ratio, offering deeper insights into model behavior and anchoring bias.
Evaluations of six state-of-the-art LLMs reveal significant performance gaps,
highlighting variations in reasoning abilities and bias susceptibility.`
Homepage: https://github.com/asgsaeid/mmlu-pro-plus
### Citation
```bibtex
@article{taghanaki2024mmlu,
title={MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs},
author={Taghanaki, Saeid Asgari and Khani, Aliasgahr and Khasahmadi, Amir},
journal={arXiv preprint arXiv:2409.02257},
year={2024}
}
```
### Groups and Tasks
#### Groups
* `mmlu_pro_plus`: 'All 14 subjects of the mmlu_pro_plus dataset, evaluated following the methodology in mmlu's original implementation'
#### Tasks
The following tasks evaluate subjects in the mmlu_pro dataset
- `mmlu_pro_plus_biology`
- `mmlu_pro_plus_business`
- `mmlu_pro_plus_chemistry`
- `mmlu_pro_plus_computer_science`
- `mmlu_pro_plus_economics`
- `mmlu_pro_plus_engineering`
- `mmlu_pro_plus_health`
- `mmlu_pro_plus_history`
- `mmlu_pro_plus_law`
- `mmlu_pro_plus_math`
- `mmlu_pro_plus_other`
- `mmlu_pro_plus_philosophy`
- `mmlu_pro_plus_physics`
- `mmlu_pro_plus_psychology`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
### Changelog
dataset_path: saeidasgari/mmlu-pro-plus
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'answer is \(?([ABCDEFGHIJKL])\)?'
# regex_pattern: r".*[aA]nswer:\s*([A-L])",
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 1.0
group: mmlu_pro_plus
task:
- mmlu_pro_plus_biology
- mmlu_pro_plus_business
- mmlu_pro_plus_chemistry
- mmlu_pro_plus_computer_science
- mmlu_pro_plus_economics
- mmlu_pro_plus_engineering
- mmlu_pro_plus_health
- mmlu_pro_plus_history
- mmlu_pro_plus_law
- mmlu_pro_plus_math
- mmlu_pro_plus_other
- mmlu_pro_plus_philosophy
- mmlu_pro_plus_physics
- mmlu_pro_plus_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 1.0
description: "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_biology"
task_alias: "biology"
process_docs: !function utils.process_biology
description: "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_business"
task_alias: "business"
process_docs: !function utils.process_business
description: "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_chemistry"
task_alias: "chemistry"
process_docs: !function utils.process_chemistry
description: "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_computer_science"
task_alias: "computer_science"
process_docs: !function utils.process_computer_science
description: "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_economics"
task_alias: "economics"
process_docs: !function utils.process_economics
description: "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_engineering"
task_alias: "engineering"
process_docs: !function utils.process_engineering
description: "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_health"
task_alias: "health"
process_docs: !function utils.process_health
description: "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_history"
task_alias: "history"
process_docs: !function utils.process_history
description: "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_law"
task_alias: "law"
process_docs: !function utils.process_law
description: "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_math"
task_alias: "math"
process_docs: !function utils.process_math
description: "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_other"
task_alias: "other"
process_docs: !function utils.process_other
description: "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_philosophy"
task_alias: "philosophy"
process_docs: !function utils.process_philosophy
description: "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_physics"
task_alias: "physics"
process_docs: !function utils.process_physics
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment