README.md 1.9 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# GSM8k

## Paper
Training Verifiers to Solve Math Word Problems
https://arxiv.org/abs/2110.14168

State-of-the-art language models can match human performance on many tasks, but
they still struggle to robustly perform multi-step mathematical reasoning. To
diagnose the failures of current models and support research, we introduce GSM8K,
a dataset of 8.5K high quality linguistically diverse grade school math word problems.
We find that even the largest transformer models fail to achieve high test performance,
despite the conceptual simplicity of this problem distribution.

NOTE: See the official implementation of the task:
    https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py
for how to make use of the dataset's calculator annotations in your language
model's sample/generation function.

Homepage: https://github.com/openai/grade-school-math


## Citation
```
@misc{cobbe2021training,
      title={Training Verifiers to Solve Math Word Problems},
      author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
      year={2021},
      eprint={2110.14168},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
lintangsutawika's avatar
lintangsutawika committed
32
```
33

lintangsutawika's avatar
lintangsutawika committed
34
### Groups and Tasks
Lintang Sutawika's avatar
Lintang Sutawika committed
35

lintangsutawika's avatar
lintangsutawika committed
36
#### Groups
Lintang Sutawika's avatar
Lintang Sutawika committed
37
38
39
40

- `math_word_problems`
- `chain_of_thought`
- `self_consistency`
41

lintangsutawika's avatar
lintangsutawika committed
42
43
44
45
46
47
#### Tasks

- `gsm8k_yaml`
- `gsm8k_cot`: GSM8K with Chain-of-Thought
- `gsm8k_cot_self_consistency`: GSM8K with Chain-of-Thought and Self-Consistency

48
49
50
51
52
53
54
55
56
57
58
### Checklist

- [x] Is in Eval-harness v1.0 ?
- [ ] Has been checked for regression from v1.0?
- [ ] Has been checked for equivalence with original paper methodology?
- [ ] "Main" checked variant clearly denoted?

### Variant Wishlist

- [ ] Variant with Calculator (see https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py for example implementation)
- [ ] Using Verifiers
59
- [ ] Majority voting "without CoT"