Unverified Commit 11ac352d authored by Qubitium-ModelCloud's avatar Qubitium-ModelCloud Committed by GitHub
Browse files

Add GSM8K Platinum (#2771)

* add gsm8k platinum

* only test splits

* wrong dataset

* link to blog

* format
parent 1da9e4e8
# GSM8k Platinum
GSM8K Platinum is a revised version of the full test set of GSM8K (Grade School Math 8K), a dataset of grade school math word problems. To revise this dataset, we ran a variety of frontier models each individual example and manually re-annotated any example for which at least one model made an error. We revise the labels of mislabeled examples, and remove any question that we determine to be poorly written (most often due to ambiguity in the problem statement). See our paper for further details on the revision process and our criteria for "bad" questions.
## Paper
Do Large Language Model Benchmarks Test Reliability?
https://arxiv.org/abs/2502.03461
NOTE: See the official implementation of the task:
https://github.com/MadryLab/platinum-benchmarks/
for how to make use of the dataset's calculator annotations in your language
model's sample/generation function.
Blog: https://gradientscience.org/gsm8k-platinum/
Homepage: http://platinum-bench.csail.mit.edu/
## Citation
```
@misc{vendrow2025largelanguagemodelbenchmarks,
title={Do Large Language Model Benchmarks Test Reliability?},
author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry},
year={2025},
eprint={2502.03461},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.03461},
}
@misc{cobbe2021training,
title={Training Verifiers to Solve Math Word Problems},
author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
year={2021},
eprint={2110.14168},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
### Groups and Tasks
#### Groups
- `math_word_problems`
- `chain_of_thought`
- `self_consistency`
#### Tasks
- `gsm8k_platinum`
- `gsm8k_platinum_cot`: GSM8K Platinum with Chain-of-Thought
- `gsm8k_platinum_cot_self_consistency`: GSM8K Platinum with Chain-of-Thought and Self-Consistency
- `gsm8k_platinum_cot_llama`: GSM8K Platinum with prompt formatting modified to conform to the evaluation settings described by Meta here: https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals/viewer/Meta-Llama-3.1-8B-Instruct-evals__gsm8k__details?row=0
- Use this task with --fewshot_as_multiturn and --apply_chat_template to replicate Meta's reported performance.
task: gsm8k_platinum_cot_llama
dataset_name: main
dataset_path: madrylab/gsm8k-platinum
doc_to_target: '{{answer.split(''####'')[-1].strip() if answer is defined else target}}'
doc_to_text: "Given the following problem, reason and give a final answer to the problem.\nProblem: {{question}}\nYour response should end with \"The final answer is [answer]\" where [answer] is the response to the problem.\n"
fewshot_config:
sampler: first_n
samples:
- question: There are 15 trees in the grove. Grove workers will plant trees in the
grove today. After they are done, there will be 21 trees. How many trees did
the grove workers plant today?
target: There are 15 trees originally. Then there were 21 trees after some more
were planted. So there must have been 21 - 15 = 6. The final answer is 6
- question: If there are 3 cars in the parking lot and 2 more cars arrive, how many
cars are in the parking lot?
target: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The final answer
is 5
- question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many
pieces do they have left in total?
target: Originally, Leah had 32 chocolates. Her sister had 42. So in total they
had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The final answer is 39
- question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12
lollipops. How many lollipops did Jason give to Denny?
target: Jason started with 20 lollipops. Then he had 12 after giving some to Denny.
So he gave Denny 20 - 12 = 8. The final answer is 8
- question: Shawn has five toys. For Christmas, he got two toys each from his mom and
dad. How many toys does he have now?
target: Shawn started with 5 toys. If he got 2 toys each from his mom and dad,
then that is 4 more toys. 5 + 4 = 9. The final answer is 9
- question: There were nine computers in the server room. Five more computers were
installed each day, from monday to thursday. How many computers are now in the
server room?
target: There were originally 9 computers. For each of 4 days, 5 more computers
were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The final answer is
29
- question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday,
he lost 2 more. How many golf balls did he have at the end of wednesday?
target: Michael started with 58 golf balls. After losing 23 on tuesday, he had
58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The final answer
is 33
- question: Olivia has $23. She bought five bagels for $3 each. How much money does
she have left?
target: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15
dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The final answer is 8
filter_list:
- filter:
- function: regex
group_select: -1
regex_pattern: The final answer is ((-?[$0-9.,]{2,})|(-?[0-9]+))
- function: take_first
name: strict-match
- filter:
- function: regex
group_select: -1
regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
- function: take_first
name: flexible-extract
generation_kwargs:
do_sample: false
until:
- '<|eot_id|>'
- '<|start_header_id|>user<|end_header_id|>'
- 'Q:'
- </s>
- <|im_end|>
tag:
- chain_of_thought
metadata:
version: 3.0
metric_list:
- aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
metric: exact_match
regexes_to_ignore:
- ','
- \$
- '(?s).*#### '
- \.$
num_fewshot: 8
output_type: generate_until
repeats: 1
test_split: test
include: gsm8k-platinum-cot.yaml
tag:
- chain_of_thought
- self_consistency
task: gsm8k_platinum_cot_self_consistency
generation_kwargs:
until:
- "Q:"
- "\n\n"
do_sample: true
temperature: 0.2
repeats: 64
filter_list:
- name: "score-first" # pick only the first response, and report metrics on that
filter:
- function: "regex"
regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
- function: "take_first"
- name: "maj@64"
filter:
- function: "regex"
regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
- function: "majority_vote"
- function: "take_first"
- name: "maj@8" # get Maj@8 , via selecting the first 8 responses. Using a better estimator would be optimal.
filter:
- function: "take_first_k"
k: 8
- function: "regex"
regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
- function: "majority_vote"
- function: "take_first"
metadata:
version: 2.0
tag:
- math_word_problems
task: gsm8k_platinum_cot_zeroshot
dataset_path: madrylab/gsm8k-platinum
dataset_name: main
output_type: generate_until
training_split: test
fewshot_split: test
test_split: test
doc_to_text: "Q: {{question}}\nA: Let's think step by step."
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until:
- "Q:"
- "</s>"
- "<|im_end|>"
do_sample: false
repeats: 1
num_fewshot: 0
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)."
- function: "take_first"
- name: "flexible-extract"
filter:
- function: "regex"
group_select: -1
regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
- function: "take_first"
metadata:
version: 3.0
task: gsm8k_platinum_cot
dataset_name: main
dataset_path: madrylab/gsm8k-platinum
doc_to_target: '{{answer.split(''####'')[-1].strip() if answer is defined else target}}'
doc_to_text: 'Q: {{question}}
A:'
fewshot_config:
sampler: first_n
samples:
- question: There are 15 trees in the grove. Grove workers will plant trees in the
grove today. After they are done, there will be 21 trees. How many trees did
the grove workers plant today?
target: There are 15 trees originally. Then there were 21 trees after some more
were planted. So there must have been 21 - 15 = 6. The answer is 6.
- question: If there are 3 cars in the parking lot and 2 more cars arrive, how many
cars are in the parking lot?
target: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer
is 5.
- question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many
pieces do they have left in total?
target: Originally, Leah had 32 chocolates. Her sister had 42. So in total they
had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.
- question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12
lollipops. How many lollipops did Jason give to Denny?
target: Jason started with 20 lollipops. Then he had 12 after giving some to Denny.
So he gave Denny 20 - 12 = 8. The answer is 8.
- question: Shawn has five toys. For Christmas, he got two toys each from his mom and
dad. How many toys does he have now?
target: Shawn started with 5 toys. If he got 2 toys each from his mom and dad,
then that is 4 more toys. 5 + 4 = 9. The answer is 9.
- question: There were nine computers in the server room. Five more computers were
installed each day, from monday to thursday. How many computers are now in the
server room?
target: There were originally 9 computers. For each of 4 days, 5 more computers
were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is
29.
- question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday,
he lost 2 more. How many golf balls did he have at the end of wednesday?
target: Michael started with 58 golf balls. After losing 23 on tuesday, he had
58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer
is 33.
- question: Olivia has $23. She bought five bagels for $3 each. How much money does
she have left?
target: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15
dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.
filter_list:
- filter:
- function: regex
regex_pattern: The answer is (\-?[0-9\.\,]+).
- function: take_first
name: strict-match
- filter:
- function: regex
group_select: -1
regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
- function: take_first
name: flexible-extract
generation_kwargs:
do_sample: false
until:
- 'Q:'
- </s>
- <|im_end|>
tag:
- chain_of_thought
metadata:
version: 3.0
metric_list:
- aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
metric: exact_match
regexes_to_ignore:
- ','
- \$
- '(?s).*#### '
- \.$
num_fewshot: 8
output_type: generate_until
repeats: 1
test_split: test
tag:
- math_word_problems
task: gsm8k_platinum
dataset_path: madrylab/gsm8k-platinum
dataset_name: main
output_type: generate_until
training_split: test
fewshot_split: test
test_split: test
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until:
- "Question:"
- "</s>"
- "<|im_end|>"
do_sample: false
temperature: 0.0
repeats: 1
num_fewshot: 5
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
- function: "take_first"
- name: "flexible-extract"
filter:
- function: "regex"
group_select: -1
regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
- function: "take_first"
metadata:
version: 3.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment