README.md 3.2 KB
Newer Older
baberabb's avatar
baberabb committed
1
# MATH
Baber Abbasi's avatar
Baber Abbasi committed
2

baberabb's avatar
baberabb committed
3
ℹ️ This is the 4-shot variant!
Baber Abbasi's avatar
Baber Abbasi committed
4

baberabb's avatar
baberabb committed
5
## Paper
Baber Abbasi's avatar
Baber Abbasi committed
6

baberabb's avatar
baberabb committed
7
8
9
Measuring Mathematical Problem Solving With the MATH Dataset
https://arxiv.org/abs/2103.03874

Baber Abbasi's avatar
Baber Abbasi committed
10
11
12
13
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of
computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging
competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach
models to generate answer derivations and explanations.
baberabb's avatar
baberabb committed
14

Baber Abbasi's avatar
Baber Abbasi committed
15
16
17
NOTE: The few-shot and the generated answer extraction is based on the [Minerva](https://arxiv.org/abs/2206.14858) and
exact match equivalence is calculated using the `sympy` library. This requires additional dependencies, which can be
installed via the `lm-eval[math]` extra.
baberabb's avatar
baberabb committed
18
19
20
21

Homepage: https://github.com/hendrycks/math

## Citation
Baber Abbasi's avatar
Baber Abbasi committed
22

baberabb's avatar
baberabb committed
23
24
25
26
27
28
29
```
@article{hendrycksmath2021,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
  journal={NeurIPS},
  year={2021}
}
30
31
32
33
34
35
36

@misc{2206.14858,
Author = {Aitor Lewkowycz and Anders Andreassen and David Dohan and Ethan Dyer and Henryk Michalewski and Vinay Ramasesh and Ambrose Slone and Cem Anil and Imanol Schlag and Theo Gutman-Solo and Yuhuai Wu and Behnam Neyshabur and Guy Gur-Ari and Vedant Misra},
Title = {Solving Quantitative Reasoning Problems with Language Models},
Year = {2022},
Eprint = {arXiv:2206.14858},
}
baberabb's avatar
baberabb committed
37
38
```

39
### Groups and Tasks
baberabb's avatar
baberabb committed
40
41
42

#### Groups

43
- `minerva_math`
baberabb's avatar
baberabb committed
44
45
46

#### Tasks

47
48
49
50
51
52
53
- `minerva_math_algebra`
- `minerva_math_counting_and_prob`
- `minerva_math_geometry`
- `minerva_math_intermediate_algebra`
- `minerva_math_num_theory`
- `minerva_math_prealgebra`
- `minerva_math_precalc`
baberabb's avatar
baberabb committed
54
55
56
57
58
59
60

### Checklist

The checklist is the following:

For adding novel benchmarks/datasets to the library:

Baber Abbasi's avatar
Baber Abbasi committed
61
62
63
64
65
66
67
68
* [x] Is the task an existing benchmark in the literature?
    * [x] Have you referenced the original paper that introduced the task?
    * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the
      reference implementation and documented how to run such a test?
        * The implementation in the original paper is one where the model is first fine-tuned on the data. They do have
          a few-shot evaluation for GPT-3, however the few-shot context used here is sourced
          from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is
          comparable to that provided in the paper, though not identical.
baberabb's avatar
baberabb committed
69
70

If other tasks on this dataset are already supported:
Baber Abbasi's avatar
Baber Abbasi committed
71

baberabb's avatar
baberabb committed
72
73
74
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
baberabb's avatar
baberabb committed
75
76
77
78

### Variant Wishlist

- [ ] zero-shot variant
79
80

### Changelog
Baber Abbasi's avatar
Baber Abbasi committed
81
82
83
84

- version 2.0: (21-Feb-2025); added math_verify (extraction) metric. For
  details [see](https://huggingface.co/blog/math_verify_leaderboard)
- version 3.0 (21-Aug-2025); pass the full solution and model generation to `math_verify`'s `parse`