Commit bf11ac93 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into llama

parents 83b1c564 ade01428
include: mbpp.yaml
task: mbpp_plus
dataset_path: evalplus/mbppplus
dataset_name: null
doc_to_text: "You are an expert Python programmer, and here is your task: {{prompt if prompt is defined else text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]\n"
......@@ -63,3 +63,6 @@ If other tasks on this dataset are already supported:
### Variant Wishlist
- [ ] zero-shot variant
### Changelog
version 2.0: (21-Feb-2025); added math_verify (extraction) metric. For details [see](https://huggingface.co/blog/math_verify_leaderboard)
......@@ -19,9 +19,12 @@ metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
- metric: math_verify
aggregation: mean
higher_is_better: true
num_fewshot: 4
metadata:
version: 1.0
version: 2.0
dataset_kwargs:
trust_remote_code: true
fewshot_config:
......
import logging
import re
import signal
from importlib.metadata import version
from typing import Dict, List, Optional
import datasets
from lm_eval.utils import eval_logger
eval_logger = logging.getLogger(__name__)
try:
import antlr4
import sympy
from math_verify import parse, verify
from sympy.parsing.latex import parse_latex
except ModuleNotFoundError:
raise ModuleNotFoundError(
"`sympy` is required for generating translation task prompt templates. \
please install sympy via pip install lm-eval[math] or pip install -e .[math]",
)
assert version("antlr4-python3-runtime").startswith("4.11")
except (ModuleNotFoundError, AssertionError) as e:
raise type(e)(
"`sympy`, `math_verify` and `antlr4-python3-runtime==4.11` are required for generating translation task prompt templates. "
"Please install the required packages via pip install lm-eval[math] or pip install -e .[math]"
) from e
# taken from
......@@ -75,8 +82,13 @@ def process_results(doc: dict, results: List[str]) -> Dict[str, int]:
else:
retval = 0
# math_verify
res = verify(parse(doc["answer"]), parse(candidates))
mathval = 1 if res else 0
results = {
"exact_match": retval,
"math_verify": mathval,
}
return results
......
# mmlu_pro_plus
### Paper
Title: `MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs`
Abstract: `Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between
top-performing models, underscoring the need for more challenging evaluation frameworks.
We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut
learning and higher-order reasoning in LLMs. By incorporating questions with multiple
correct answers across diverse domains, MMLU-Pro+ tests LLMs' ability to engage in complex
reasoning and resist simplistic problem-solving strategies. Our results show that
MMLU-Pro+ maintains MMLU-Pro's difficulty while providing a more rigorous test of
model discrimination, particularly in multi-correct answer scenarios.
We introduce novel metrics like shortcut selection ratio and correct pair identification
ratio, offering deeper insights into model behavior and anchoring bias.
Evaluations of six state-of-the-art LLMs reveal significant performance gaps,
highlighting variations in reasoning abilities and bias susceptibility.`
Homepage: https://github.com/asgsaeid/mmlu-pro-plus
### Citation
```bibtex
@article{taghanaki2024mmlu,
title={MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs},
author={Taghanaki, Saeid Asgari and Khani, Aliasgahr and Khasahmadi, Amir},
journal={arXiv preprint arXiv:2409.02257},
year={2024}
}
```
### Groups and Tasks
#### Groups
* `mmlu_pro_plus`: 'All 14 subjects of the mmlu_pro_plus dataset, evaluated following the methodology in mmlu's original implementation'
#### Tasks
The following tasks evaluate subjects in the mmlu_pro dataset
- `mmlu_pro_plus_biology`
- `mmlu_pro_plus_business`
- `mmlu_pro_plus_chemistry`
- `mmlu_pro_plus_computer_science`
- `mmlu_pro_plus_economics`
- `mmlu_pro_plus_engineering`
- `mmlu_pro_plus_health`
- `mmlu_pro_plus_history`
- `mmlu_pro_plus_law`
- `mmlu_pro_plus_math`
- `mmlu_pro_plus_other`
- `mmlu_pro_plus_philosophy`
- `mmlu_pro_plus_physics`
- `mmlu_pro_plus_psychology`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
### Changelog
dataset_path: saeidasgari/mmlu-pro-plus
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'answer is \(?([ABCDEFGHIJKL])\)?'
# regex_pattern: r".*[aA]nswer:\s*([A-L])",
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 1.0
group: mmlu_pro_plus
task:
- mmlu_pro_plus_biology
- mmlu_pro_plus_business
- mmlu_pro_plus_chemistry
- mmlu_pro_plus_computer_science
- mmlu_pro_plus_economics
- mmlu_pro_plus_engineering
- mmlu_pro_plus_health
- mmlu_pro_plus_history
- mmlu_pro_plus_law
- mmlu_pro_plus_math
- mmlu_pro_plus_other
- mmlu_pro_plus_philosophy
- mmlu_pro_plus_physics
- mmlu_pro_plus_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 1.0
description: "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_biology"
task_alias: "biology"
process_docs: !function utils.process_biology
description: "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_business"
task_alias: "business"
process_docs: !function utils.process_business
description: "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_chemistry"
task_alias: "chemistry"
process_docs: !function utils.process_chemistry
description: "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_computer_science"
task_alias: "computer_science"
process_docs: !function utils.process_computer_science
description: "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_economics"
task_alias: "economics"
process_docs: !function utils.process_economics
description: "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_engineering"
task_alias: "engineering"
process_docs: !function utils.process_engineering
description: "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_health"
task_alias: "health"
process_docs: !function utils.process_health
description: "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_history"
task_alias: "history"
process_docs: !function utils.process_history
description: "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_law"
task_alias: "law"
process_docs: !function utils.process_law
description: "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_math"
task_alias: "math"
process_docs: !function utils.process_math
description: "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_other"
task_alias: "other"
process_docs: !function utils.process_other
description: "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_philosophy"
task_alias: "philosophy"
process_docs: !function utils.process_philosophy
description: "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n"
include: "_default_template_yaml"
task: "mmlu_pro_plus_physics"
task_alias: "physics"
process_docs: !function utils.process_physics
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment