For this, we're using the GSM8K dataset which has been a popular choice for experimentation for R1-style learning by
the community. Using the provided `stats.py`, we first analyze the dataset to determine the maximum tokenized length
of the system prompt + question. We find that for the Qwen 2.5 tokenizer, the maximum combined input length is 263.
For Llama 3.x series, this is 266. By combining this with the constraint that we only want the model to generate up to
512 tokens, we can set our sequence length to ~800.
## Reward functions
There has already been quite a lot of experimentation in the AI/ML community with the GSM8K dataset, so we're going to
use some existing reward functions that have shown promise by
[@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb). We've corrected some of the logic in the
reward functions such as the loose formatting function.
1.`int_reward_func` - As all gsm8k answers should be only integers, this provides a small reward to push the model to only respond with integers.
2.`correctness_reward_func` - This checks that the answer within the <answer></answer> tags is exactly the label.
3.`strict_format_reward_func` - This validates that the structure of the response exactly matches the prompt, inclusive of newlines.
4.`soft_format_reward_func` - This also checks the structure, but doesn't penalize the model if it doesn't match all the newlines exactly.
5.`xmlcount_reward_func` - This rewards the model to have exactly one of each of the xml tags in the structured response.
## GPU choice
With Qwen 1.5B, you'll need 2 GPUs. One for training and the other for inference with vLLM. For the gsm8k dataset with
16x rollouts, these should fit on GPUs with ~48GB VRAM, so we chose 2xL40S. While we could have chosen to do 8x rollouts
in order to fit both the trainer and vLLM models on the same GPU, we opted to stick with 16x rollouts similar to the
original paper.
## Training
For inference TRL loads vLLM onto the last GPU, so when we start the trainer, we need to tell accelerate to use `N - 1`
GPUs or processes for training, e.g. `--num-processes=1` in our case.
Putting it all together we have the `gsm8k_grpo.py` module providing the reward functions, and we've configured the
hyperparameters in `gsm8k.yaml`. To train the model, we simply need to execute:
```bash
axolotl train --num-processes=1 gsm8k.yaml
```
## Results
You can find the training metrics here: https://wandb.ai/axolotl-ai/gsm8k-grpo-r1

The training takes about 6 hours to complete at a commodity GPU cost of ~\$12-\$24 depending on your compute provider.
We can see from the metrics that the model quickly learns the loose formatting reward, but doesn't quite nail down the
strict formatting with newlines. The correctness reward hangs out at around ~1.4 which is about 70% accuracy as the max
reward for correctness is 2.0.
## Training with Modal
But what if you don't have two GPUs with 48GB of memory sitting around ready to run GRPO training?
`axolotl` makes it easy to run training jobs on GPUs in the cloud via the [`--cloud` flag](https://axolotl-ai-cloud.github.io/axolotl/docs/cli.html#remote-compute-with-modal-cloud). Using the [Modal](https://modal.com) serverless platform, this flag will handle the entire process of cloud deployment (and tear-down) for your training run, according to a provided [`cloud.yaml` file](https://axolotl-ai-cloud.github.io/axolotl/docs/cli.html#cloud-configuration-options). The GPUs will be automatically deallocated when training finishes and you will only be charged for the time your code spends running on GPUs. See [Modal's pricing page](https://modal.com/pricing) for rates.
If you have a Modal account, you can go straight to installing the CLI tools. Otherwise, you will need to [sign-up
for Modal](https://www.modal.com?utm_source=github&utm_medium=github&utm_campaign=axolotl). Modal provides $30 each
month in free credits.
### CLI
first install the `modal` and `axolotl` CLIs locally with:
This cookbook accompanies our [Training Process Reward Models in axolotl](https://axolotlai.substack.com/p/process-reward-models) blog-post, and provides details on reproducing the evaluation results for [axolotl-ai-co/Qwen2.5-Math-PRM-7B](https://huggingface.co/axolotl-ai-co/Qwen2.5-Math-PRM-7B)/
Example outputs can be seen in `bon_qwen1.5B-instruct_n=16_results.txt`, and `bon_qwen1.5B-instruct_n=1_results.txt`, for `n=16` and `n=1` respectively.
# example problems from the MATH-5O0 dataset https://huggingface.co/datasets/HuggingFaceH4/MATH-500
x={
"problem":[
r"Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$",
r"A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?",
r"The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).",
Problem: Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$
Predicted answer (BoN): To solve the given problem, we need to express the double sum \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\).
First, let's rewrite the double sum by changing the order of summation. We start by summing over \(m = j + k\). For a fixed \(m\), \(j\) and \(k\) can take any values from 1 to \(\infty\) such that \(j + k = m\).
The number of pairs \((j, k)\) that satisfy \(j + k = m\) is \(m-1\). Therefore, we can express the sum as:
Problem: A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?
Predicted answer (BoN): ## Step 1:**
First, let's understand that a regular hexagon can be divided into six equilateral triangles. The perimeter of one of these triangles is given as 21 inches. Since all sides of an equilateral triangle are equal, each side of the triangle must be 21 inches divided by 3, as it's equal to the perimeter divided by 3.
**Calculation for one side of the triangle:**
\[ \text{Side length of the triangle} = \frac{21}{3} = 7 \text{ inches} \]
## Step 2:**
Now, since the hexagon is made up of these equilateral triangles, each side of the hexagon will be equal to the side length of one of these triangles. Therefore, the perimeter of the hexagon, which is the sum of all its sides, will be six times the side length of one of the triangles.
**Calculation for the perimeter of the hexagon:**
\[ \text{Perimeter of the hexagon} = 6 \times 7 \]
## Step 3:**
Perform the multiplication to find the final perimeter.
\[ \text{Perimeter of the hexagon} = 6 \times 7 = 42 \text{ inches} \]
Problem: The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).
Predicted answer (BoN): To solve the problem of determining how many different values can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses, we need to consider the different ways we can group the terms using parentheses. Since multiplication is carried out before addition, we need to consider all possible ways we can group the numbers \(2\), \(3\), \(4\), and \(5\) using parentheses.
First, let's list all possible combinations of parentheses groups. Each group will have a different number of terms inside them. The possible groups are:
1. \(2, 3, 4, 5\)
2. \(2, 3 \cdot 4, 5\)
3. \(2, 3 \cdot (4 \cdot 5)\)
4. \(2 \cdot 3, 4, 5\)
5. \(2 \cdot 3 \cdot 4, 5\)
6. \(2 \cdot 3 \cdot (4 \cdot 5)\)
7. \(2, (3 \cdot 4) \cdot (5 + 1)\)
8. \(2 \cdot (3 \cdot 4), 5 + 1\)
9. \(2 \cdot (3 \cdot (4 \cdot 5))\)
10. \(2 \cdot (3 \cdot (4 \cdot (5 + 1)))\)
Now, let's evaluate each of these groups:
1. \(2, 3, 4, 5\) : This evaluates to \(2 \cdot 3 \cdot 4 \cdot 5 + 1 = 121\).
From the above evaluations, we see that the distinct values obtained are 121, 145, and 30. Therefore, the number of different values that can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses is \(\boxed{3}\).
Problem: Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$
Predicted answer (BoN): To solve the given problem, we need to express the double sum \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\).
First, let's analyze the inner sum \(\sum_{k=1}^\infty \frac{1}{(j+k)^3}\). This sum can be written as:
\[
\sum_{k=1}^\infty \frac{1}{(j+k)^3}
\]
We can change the order of summation. Instead of summing over \(k\) first, we sum over \(j\) first. For a fixed \(j\), \(k\) ranges from 1 to \(\infty\). When \(j\) is fixed, \(k\) can be written as \(j + (n-j)\) where \(n\) is an integer. So, the sum becomes:
Problem: A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?
Predicted answer (BoN): ## Step 1: ##
To solve this problem, we need to understand the relationship between the side length of the equilateral triangle and the side length of the regular hexagon it's divided into.
## Step 2: ##
Given that the perimeter of one equilateral triangle is 21 inches, we know that the perimeter of a regular hexagon composed of six such equilateral triangles is simply 6 times the perimeter of one triangle because all sides of the hexagon are equal in length and they come from equilateral triangles.
Therefore, the perimeter of the hexagon is 6 * 21.
## Step 3: ##
Calculating the perimeter of the hexagon, we get 6 * 21 = 126 inches.
## Step 4: ##
Thus, the final answer is 126 inches.
Therefore, the perimeter of the regular hexagon is $\boxed{126}$.
Problem: The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).
Predicted answer (BoN): To solve the problem of how many values can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses, we need to consider all possible ways to insert parentheses and evaluate the expression for each case.
Given the expression \( (a \cdot (b \cdot (c \cdot d) + e)) + f \), we need to evaluate it for all possible placements of parentheses.
By evaluating the expression for all possible placements of parentheses, we find that there are \( 14 \) distinct values that can be obtained. Therefore, the final answer is: