v1.0

4d4d8f59 · chenzk · 4d4d8f59 · 4d4d8f59 · 4d4d8f59 · 4d4d8f59
Commit 4d4d8f59 authored Jun 04, 2025 by chenzk
20 changed files
--- a/axolotl-cookbook/examples/talk_like_a_pirate/pirate.yaml
+++ b/axolotl-cookbook/examples/talk_like_a_pirate/pirate.yaml
+base_model: NousResearch/Meta-Llama-3.1-8B
+load_in_4bit: true
+strict: false
+chat_template: llama3
+datasets:
+  - path: winglian/pirate-ultrachat-10k
+    type: chat_template
+    message_field_role: role
+    message_field_content: content
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.005
+output_dir: ./outputs/lora-out
+sequence_len: 4096
+sample_packing: true
+eval_sample_packing: false
+pad_to_sequence_len: true
+adapter: qlora
+lora_r: 64
+lora_alpha: 32
+lora_dropout: 0.05
+lora_target_linear: true
+lora_modules_to_save:
+  - embed_tokens
+  - lm_head
+peft_use_dora: true
+wandb_project: pirate-ultrachat-llama31
+wandb_entity: axolotl-ai
+gradient_accumulation_steps: 2
+micro_batch_size: 2
+num_epochs: 2
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+train_on_inputs: false
+bf16: true
+tf32: true
+gradient_checkpointing: true
+logging_steps: 1
+flash_attention: true
+warmup_ration: 0.1
+evals_per_epoch: 1
+saves_per_epoch: 1
+weight_decay: 0.0
+deepspeed: deepspeed_configs/zero2.json
+special_tokens:
+  pad_token: "<|finetune_right_pad_id|>"
\ No newline at end of file
--- a/axolotl-cookbook/grpo/README.md
+++ b/axolotl-cookbook/grpo/README.md
+# Training an R1-style Large Language Model using GRPO w/ Axolotl
+In this example, we will demonstrate how to train Qwen 2.5 1.5B Instruct to reason using R1-style post-training with 
+the GRPO technique.
+## Installation
+To get started you'll need to install vllm==0.7.2 and both trl and axolotl from source as GRPO support is still in flux.
+```bash
+pip install --no-build-isolation "axolotl[vllm] @ git+https://github.com/axolotl-ai-cloud/axolotl.git@grpo"
+pip uninstall trl
+pip install --no-deps "trl @ git+https://github.com/huggingface/trl.git@main"
+```
+## Knowing your dataset
+For this, we're using the GSM8K dataset which has been a popular choice for experimentation for R1-style learning by 
+the community. Using the provided `stats.py`, we first analyze the dataset to determine the maximum tokenized length
+of the  system prompt + question. We find that for the Qwen 2.5 tokenizer, the maximum combined input length is 263. 
+For Llama 3.x series, this is 266. By combining this with the constraint that we only want the model to generate up to
+512 tokens, we can set our sequence length to ~800.
+## Reward functions
+There has already been quite a lot of experimentation in the AI/ML community with the GSM8K dataset, so we're going to
+use some existing reward functions that have shown promise by 
+[@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb). We've corrected some of the logic in the
+reward functions such as the loose formatting function.
+1. `int_reward_func` - As all gsm8k answers should be only integers, this provides a small reward to push the model to only respond with integers. 
+2. `correctness_reward_func` - This checks that the answer within the <answer></answer> tags is exactly the label.
+3. `strict_format_reward_func` - This validates that the structure of the response exactly matches the prompt, inclusive of newlines.
+4. `soft_format_reward_func` - This also checks the structure, but doesn't penalize the model if it doesn't match all the newlines exactly.
+5. `xmlcount_reward_func` - This rewards the model to have exactly one of each of the xml tags in the structured response.
+## GPU choice
+With Qwen 1.5B, you'll need 2 GPUs. One for training and the other for inference with vLLM. For the gsm8k dataset with 
+16x rollouts, these should fit on GPUs with ~48GB VRAM, so we chose 2xL40S. While we could have chosen to do 8x rollouts
+in order to fit both the trainer and vLLM models on the same GPU, we opted to stick with 16x rollouts similar to the 
+original paper.
+## Training
+For inference TRL loads vLLM onto the last GPU, so when we start the trainer, we need to tell accelerate to use `N - 1`
+GPUs or processes for training, e.g. `--num-processes=1` in our case.
+Putting it all together we have the `gsm8k_grpo.py` module providing the reward functions, and we've configured the 
+hyperparameters in `gsm8k.yaml`. To train the model, we simply need to execute:
+```bash
+axolotl train --num-processes=1 gsm8k.yaml
+```
+## Results
+You can find the training metrics here: https://wandb.ai/axolotl-ai/gsm8k-grpo-r1
+![wandb.png](assets/wandb.png)
+The training takes about 6 hours to complete at a commodity GPU cost of ~\$12-\$24 depending on your compute provider.
+We can see from the metrics that the model quickly learns the loose formatting reward, but doesn't quite nail down the
+strict formatting with newlines. The correctness reward hangs out at around ~1.4 which is about 70% accuracy as the max
+reward for correctness is 2.0.
+## Training with Modal
+But what if you don't have two GPUs with 48GB of memory sitting around ready to run GRPO training?
+`axolotl` makes it easy to run training jobs on GPUs in the cloud via the [`--cloud` flag](https://axolotl-ai-cloud.github.io/axolotl/docs/cli.html#remote-compute-with-modal-cloud). Using the [Modal](https://modal.com) serverless platform, this flag will handle the entire process of cloud deployment (and tear-down) for your training run, according to a provided [`cloud.yaml` file](https://axolotl-ai-cloud.github.io/axolotl/docs/cli.html#cloud-configuration-options). The GPUs will be automatically deallocated when training finishes and you will only be charged for the time your code spends running on GPUs. See [Modal's pricing page](https://modal.com/pricing) for rates.
+If you have a Modal account, you can go straight to installing the CLI tools. Otherwise, you will need to [sign-up
+for Modal](https://www.modal.com?utm_source=github&utm_medium=github&utm_campaign=axolotl). Modal provides $30 each
+month in free credits.
+### CLI
+first install the `modal` and `axolotl` CLIs locally with:
+```bash
+pip install modal
+pip install "axolotl @ git+https://github.com/axolotl-ai-cloud/axolotl.git@grpo"
+```
+You can then authenticate to Modal with:
+```bash
+modal token new
+```
+### Training
+In the provided `cloud.yaml`, we start Modal with 2 GPUs. Because we reserve one GPU for generations with vLLM,
+we need to set the `num-processes` to `N - 1` GPUs.
+```bash
+axolotl train --num-processes=1 gsm8k.yaml --cloud cloud.yaml
+```
+### Single GPU
+To train on a single L40S GPU, we can do a few things like decrease the number of generations and reduce the vLLM 
+footprint on the remainder of the VRAM. In this configuration, training still takes ~6 hours at a commodity
+cost of \$6-$12.
+```yaml
+trl:
+  vllm_device: "cuda:0"  # force vllm to the only GPU instead of the "next GPU"
+  vllm_gpu_memory_utilization: 0.2  # reduce the footprint of vllm to use ~9GB VRAM
+  num_generations: 8
+micro_batch_size: 8  # decrease the per gpu batch size to match the number of generations per prompt 
+```
+If you're using Modal to launch your post-training, you'll want to update the `cloud.yaml` appropriately as well.
+```yaml
+gpu_count: 1
+```
+You can find the results of our experiment with single GPU post-training here: 
+https://wandb.ai/axolotl-ai/gsm8k-grpo-r1/runs/w3gpyhj2
\ No newline at end of file
--- a/axolotl-cookbook/grpo/__pycache__/gsm8k_grpo.cpython-310.pyc
+++ b/axolotl-cookbook/grpo/__pycache__/gsm8k_grpo.cpython-310.pyc
--- a/axolotl-cookbook/grpo/assets/wandb.png
+++ b/axolotl-cookbook/grpo/assets/wandb.png
--- a/axolotl-cookbook/grpo/cloud.yaml
+++ b/axolotl-cookbook/grpo/cloud.yaml
+project_name:
+volumes:
+  - name: axolotl-data
+    mount: /workspace/data
+  - name: axolotl-artifacts
+    mount: /workspace/artifacts
+# environment variables from local to set as secrets
+secrets:
+  - HF_TOKEN
+  - WANDB_API_KEY
+# Which branch of axolotl to use remotely
+branch: grpo
+spawn:
+dockerfile_commands:
+  - 'RUN pip install "vllm==0.7.2"'
+  - 'RUN pip uninstall -y trl && pip install --no-deps "trl @ git+https://github.com/huggingface/trl.git@main"'
+gpu: L40S
+gpu_count: 2
+# Train specific configurations
+memory: 128
+timeout: 86400
+# Preprocess specific configurations
+memory_preprocess: 32
+timeout_preprocess: 14400
--- a/axolotl-cookbook/grpo/gsm8k.yaml
+++ b/axolotl-cookbook/grpo/gsm8k.yaml
+base_model: Qwen/Qwen2.5-1.5B-Instruct
+# Automatically upload checkpoint and final model to HF
+# hub_model_id: username/custom_model_name
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+torch_compile: true
+rl: grpo
+trl:
+  beta: 0.001
+  max_completion_length: 512
+  use_vllm: true
+  reward_funcs:
+    - gsm8k_grpo.correctness_reward_func
+    - gsm8k_grpo.int_reward_func
+    - gsm8k_grpo.strict_format_reward_func
+    - gsm8k_grpo.soft_format_reward_func
+    - gsm8k_grpo.xmlcount_reward_func
+  vllm_gpu_memory_utilization: 0.9
+  vllm_max_model_len: 800
+  num_generations: 16
+chat_template: qwen_25
+datasets:
+  - path: skrishna/gsm8k_only_answer
+    type: gsm8k_grpo.axo_gsm8k_transform
+dataset_prepared_path: /workspace/data/last_run_prepared
+skip_prepare_dataset: true
+val_set_size: 0.0
+output_dir: /workspace/data/axolotl-artifacts/r1-outputs
+dataloader_prefetch_factor: 32
+dataloader_num_workers: 2
+dataloader_pin_memory: true
+gc_steps: 1
+sequence_len: 800
+sample_packing: false
+eval_sample_packing: false
+pad_to_sequence_len: false
+wandb_project: gsm8k-grpo-r1
+wandb_entity:
+wandb_name:
+gradient_accumulation_steps: 8
+micro_batch_size: 16  # should match num_generations / num_gpus
+num_epochs: 1
+optimizer: adamw_torch_fused
+lr_scheduler: constant_with_warmup
+learning_rate: 1.0e-6
+max_grad_norm: 1.0
+weight_decay: 0.1
+bf16: true
+tf32: true
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+flash_attention: true
+logging_steps: 1
+warmup_steps: 100
+evals_per_epoch: 1
+saves_per_epoch: 4
--- a/axolotl-cookbook/grpo/gsm8k_grpo.py
+++ b/axolotl-cookbook/grpo/gsm8k_grpo.py
+# adapted from https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb
+import re
+SYSTEM_PROMPT = """Respond in the following format:
+<reasoning>
+...
+</reasoning>
+<answer>
+...
+</answer>
+"""
+def axo_gsm8k_transform(cfg, *args, **kwargs):
+    # skrishna/gsm8k_only_answer dataset
+    def transform_fn(example, tokenizer=None):
+        label = example["label"].replace(",", "")  # remove commas, e.g. thousands separators
+        return {
+            "prompt": [
+                # improves adherence to the system prompt by having it in the user context
+                {"role": "user", "content": SYSTEM_PROMPT + "\n\n" + example["text"]},
+            ],
+            "answer": label,
+        }
+    return transform_fn, {"remove_columns": ["text", "label"]}
+def extract_xml_answer(text: str) -> str:
+    # collect the answer between the last <answer></answer> tag
+    answer = text.split("<answer>")[-1]
+    answer = answer.split("</answer>")[0]
+    return answer.strip()
+# Reward functions
+def correctness_reward_func(completions, answer, **kwargs) -> list[float]:
+    responses = [completion[0]['content'] for completion in completions]
+    extracted_responses = [extract_xml_answer(r) for r in responses]
+    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
+def int_reward_func(completions, **kwargs) -> list[float]:
+    """
+    gsm8k answers are ints, so rewarding for ints should help steer the model performance
+    """
+    responses = [completion[0]['content'] for completion in completions]
+    extracted_responses = [extract_xml_answer(r) for r in responses]
+    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
+def strict_format_reward_func(completions, **kwargs) -> list[float]:
+    """Reward function that checks if the completion has a specific format."""
+    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\s?$"
+    responses = [completion[0]["content"] for completion in completions]
+    matches = [re.match(pattern, r) for r in responses]
+    return [0.5 if match else 0.0 for match in matches]
+def soft_format_reward_func(completions, **kwargs) -> list[float]:
+    """
+    Reward function that loosely checks if the completion has a specific format,
+    without penalizing adherence to newlines.
+    """
+    pattern = r"<reasoning>.*?</reasoning>\s?<answer>.*?</answer>"
+    responses = [completion[0]["content"] for completion in completions]
+    matches = [re.match(pattern, r, re.S) for r in responses]
+    return [0.25 if match else 0.0 for match in matches]
+def count_xml(text) -> float:
+    count = 0.0
+    if text.count("<reasoning>") == 1:
+        count += 0.125
+    if text.count("</reasoning>") == 1:
+        count += 0.125
+    if text.count("<answer>") == 1:
+        count += 0.125
+    if text.count("</answer>") == 1:
+        count += 0.125
+        # penalize extra tokens after the answer tag
+        count -= (len(text.split("</answer>")[-1]) - 1)*0.001
+    return count
+def xmlcount_reward_func(completions, **kwargs) -> list[float]:
+    """
+    Reward function for having exactly one of each <reasoning>, </reasoning>, <answer>, and </answer> tag.
+    """
+    contents = [completion[0]["content"] for completion in completions]
+    return [count_xml(c) for c in contents]
\ No newline at end of file
--- a/axolotl-cookbook/grpo/stats.py
+++ b/axolotl-cookbook/grpo/stats.py
+from functools import partial
+import numpy as np
+from datasets import load_dataset
+from transformers import AutoTokenizer
+SYSTEM_PROMPT = """Respond in the following format:
+<reasoning>
+...
+</reasoning>
+<answer>
+...
+</answer>
+"""
+def get_tokenization_stats(example, tokenizer=None):
+    messages = {
+        "prompt": [
+            # improves adherence to the system prompt by having it in the user context
+            {"role": "user", "content": SYSTEM_PROMPT + "\n\n" + example["text"]},
+        ],
+    }
+    inputs = tokenizer.apply_chat_template(
+        messages["prompt"], tokenize=True, add_generation_prompt=True
+    )
+    return {
+        "input_ids": inputs,
+    }
+def get_dataset_lengths(dataset):
+    input_ids = dataset.data.column("input_ids")
+    lengths = np.vectorize(len)(np.array(input_ids, dtype=object))
+    return lengths
+def main():
+    ds = load_dataset("skrishna/gsm8k_only_answer", split="train")
+    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
+    stats = partial(get_tokenization_stats, tokenizer=tokenizer)
+    ds = ds.map(stats, remove_columns=["text", "label"])
+    max_input_len = np.max(get_dataset_lengths(ds))
+    print(f"Max input length: {max_input_len}")
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/axolotl-cookbook/lambda/configs/llama-3_1-405b-fft.yaml
+++ b/axolotl-cookbook/lambda/configs/llama-3_1-405b-fft.yaml
+base_model: $PATH_STORAGE/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-405B/snapshots/222de096204587406c7cadb3e0a101aade116279
+tokenizer_type: AutoTokenizer
+strict: false
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_swiglu: true
+liger_fused_linear_cross_entropy: true
+chat_template: llama3
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train
+dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
+val_set_size: 0.0
+output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/llama3_1-405b-finetome
+save_safetensors: false  # saving final sharded dict may not work with safetensors
+wandb_project: llama-3.1-405b-fft-finetome
+wandb_entity:
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+gradient_accumulation_steps: 1
+micro_batch_size: 1
+num_epochs: 2
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 1.0e-5
+train_on_inputs: false
+group_by_length: false
+bf16: true
+tf32: true
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+logging_steps: 1
+flash_attention: true
+warmup_steps: 100
+saves_per_epoch: 1
+weight_decay: 0.1
+fsdp_final_state_dict_type: SHARDED_STATE_DICT
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: true
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+special_tokens:
+  pad_token: <|finetune_right_pad_id|>
+  eos_token: <|eot_id|>
--- a/axolotl-cookbook/lambda/configs/llama-3_1-405b-qlora.yaml
+++ b/axolotl-cookbook/lambda/configs/llama-3_1-405b-qlora.yaml
+base_model: meta-llama/Meta-Llama-3.1-405B
+tokenizer_type: AutoTokenizer
+load_in_4bit: true
+strict: false
+chat_template: llama3
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train
+dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
+val_set_size: 0.0
+output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/qlora-llama3_1-405b-finetome
+save_safetensors: true
+adapter: qlora
+wandb_project: qlora-405b-finetome
+wandb_entity: oaaic
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+lora_r: 16
+lora_alpha: 32
+lora_dropout: 0.05
+lora_target_modules:
+lora_target_linear: true
+gradient_accumulation_steps: 4
+micro_batch_size: 1
+num_epochs: 2
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 0.00001
+train_on_inputs: false
+group_by_length: false
+bf16: true
+tf32: true
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: true
+logging_steps: 1
+flash_attention: true
+warmup_steps: 10
+evals_per_epoch: 4
+saves_per_epoch: 1
+weight_decay: 0.0
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: true
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+special_tokens:
+  pad_token: <|finetune_right_pad_id|>
\ No newline at end of file
--- a/axolotl-cookbook/lambda/configs/llama-3_1-70b-fft.yaml
+++ b/axolotl-cookbook/lambda/configs/llama-3_1-70b-fft.yaml
+base_model: meta-llama/Meta-Llama-3.1-70B
+tokenizer_type: AutoTokenizer
+strict: false
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_swiglu: true
+liger_fused_linear_cross_entropy: true
+chat_template: llama3
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train
+dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
+val_set_size: 0.0
+output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/llama3_1-70b-finetome
+save_safetensors: false  # saving final sharded dict may not work with safetensors
+wandb_project: llama-3.1-70b-fft-finetome
+wandb_entity:
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+gradient_accumulation_steps: 1
+# 8x Node can support a batch size of up to 3
+micro_batch_size: 3
+num_epochs: 2
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 3.0e-5
+train_on_inputs: false
+group_by_length: false
+bf16: true
+tf32: true
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+logging_steps: 1
+flash_attention: true
+warmup_steps: 30
+saves_per_epoch: 1
+weight_decay: 0.1
+fsdp_final_state_dict_type: SHARDED_STATE_DICT
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: true
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_backward_prefetch: BACKWARD_PRE
+special_tokens:
+  pad_token: <|finetune_right_pad_id|>
+  eos_token: <|eot_id|>
--- a/axolotl-cookbook/lambda/configs/llama-3_1-8b.yaml
+++ b/axolotl-cookbook/lambda/configs/llama-3_1-8b.yaml
+base_model: meta-llama/Meta-Llama-3.1-8B
+tokenizer_type: AutoTokenizer
+strict: false
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_swiglu: true
+liger_fused_linear_cross_entropy: true
+chat_template: llama3
+datasets:
+  - path: mlabonne/FineTome-100k
+    type: chat_template
+    split: train
+dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
+val_set_size: 0.0
+output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/llama3_1-8b-finetome
+save_safetensors: false  # saving final sharded dict may not work with safetensors
+wandb_project: llama-3.1-8b-fft
+wandb_entity:
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+gradient_accumulation_steps: 1
+micro_batch_size: 4
+num_epochs: 1
+optimizer: adamw_torch
+lr_scheduler: cosine
+learning_rate: 1.0e-6
+train_on_inputs: false
+group_by_length: false
+bf16: true
+tf32: true
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+logging_steps: 1
+flash_attention: true
+warmup_steps: 100
+evals_per_epoch: 4
+saves_per_epoch: 1
+weight_decay: 0.1
+fsdp_final_state_dict_type: SHARDED_STATE_DICT
+fsdp:
+  - full_shard
+  - auto_wrap
+fsdp_config:
+  fsdp_limit_all_gathers: true
+  fsdp_sync_module_states: true
+  fsdp_offload_params: true
+  fsdp_use_orig_params: false
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sharding_strategy: FULL_SHARD
+special_tokens:
+  pad_token: <|finetune_right_pad_id|>
+  eos_token: <|eot_id|>
--- a/axolotl-cookbook/lambda/setup.sh
+++ b/axolotl-cookbook/lambda/setup.sh
+#!/bin/bash
+# Check if the argument is provided
+if [ -z "$1" ]; then
+  echo "Usage: $0 <path_storage>"
+  exit 1
+fi
+# Exit immediately if a command exits with a non-zero status
+set -e
+PATH_STORAGE=$1
+# Check if Miniconda is already installed
+if [ ! -d "/home/ubuntu/miniconda3" ]; then
+    # Download and install Miniconda
+    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
+    bash /tmp/miniconda.sh -b -p /home/ubuntu/miniconda3 -u
+    /home/ubuntu/miniconda3/bin/conda init bash
+    echo 'export PATH="/home/ubuntu/miniconda3/bin:$PATH"' >> ~/.bashrc
+    source ~/.bashrc
+else
+    /home/ubuntu/miniconda3/bin/conda init bash
+    echo 'export PATH="/home/ubuntu/miniconda3/bin:$PATH"' >> ~/.bashrc
+    source ~/.bashrc
+fi
+# Configure Conda and create PyTorch environment
+/home/ubuntu/miniconda3/bin/conda remove -n pytorch --all -y
+/home/ubuntu/miniconda3/bin/conda install python=3.11 -y
+/home/ubuntu/miniconda3/bin/conda install -c conda-forge libstdcxx-ng -y
+/home/ubuntu/miniconda3/bin/conda create -n pytorch python=3.11 -y
+echo 'conda activate pytorch' >> ~/.bashrc
+# Install PyTorch
+/home/ubuntu/miniconda3/bin/conda run -n pytorch conda install pytorch==2.3.1 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
+# Verify PyTorch installation
+/home/ubuntu/miniconda3/bin/conda run -n pytorch python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
+# Add Git LFS repository and install
+sudo add-apt-repository -y ppa:git-core/ppa
+sudo apt-get update
+sudo apt-get install -y git-lfs
+git lfs install --skip-repo
+# Install awscli and pydantic in PyTorch environment
+/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install awscli==1.33.13 packaging
+/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install -U --no-cache-dir pydantic==1.10.10
+# Install system packages
+sudo apt-get install -y vim curl nano rsync s3fs net-tools nvtop infiniband-diags pdsh libaio-dev
+# Add ubuntu user to root group
+sudo usermod -aG root ubuntu
+mkdir -p ${PATH_STORAGE}/axolotl-artifacts/{configs,outputs}
+# Clone axolotl repository
+CLONE_DIR=${PATH_STORAGE}/axolotl
+if [ ! -d "$CLONE_DIR" ]; then
+    git clone --single-branch https://github.com/axolotl-ai-cloud/axolotl.git $CLONE_DIR
+fi
+# Install causal_conv1d and axolotl in PyTorch environment
+export PATH="/home/ubuntu/miniconda3/envs/pytorch/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/home/ubuntu/.local/bin:/usr/lib/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
+export LD_LIBRARY_PATH="/usr/lib/cuda/lib64"
+export CUDA_HOME="/usr/lib/nvidia-cuda-toolkit"
+/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install causal_conv1d
+/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install --no-cache -e $CLONE_DIR[deepspeed,flash-attn,optimizers]
+# Configure git credential helper
+git config --global credential.helper store
+# Install and configure tmux
+sudo apt-get install -y tmux
+echo '# Run tmux only in interactive shells
+if [[ $- == *i* ]] && [[ -z "$TMUX" ]]; then
+  tmux attach-session -t ssh_tmux || tmux new-session -s ssh_tmux
+fi' >> ~/.bashrc
+# Create huggingface cache directory and set environment variables
+mkdir -p ${PATH_STORAGE}/.cache/huggingface
+echo "export HF_HOME=\"${PATH_STORAGE}/.cache/huggingface\"" >> ~/.bashrc
+echo 'export HF_HUB_ENABLE_HF_TRANSFER="1"' >> ~/.bashrc
+echo 'export PATH="/home/ubuntu/miniconda3/envs/pytorch/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/home/ubuntu/.local/bin:/usr/lib/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"' >> ~/.bashrc
+echo 'export LD_LIBRARY_PATH="/usr/lib/cuda/lib64"' >> ~/.bashrc
+echo 'export CUDA_HOME="/usr/lib/nvidia-cuda-toolkit"' >> ~/.bashrc
+# default .profile adds ~/.local/bin which breaks pip/python
+sed -i '/if \[ -d "\$HOME\/\.local\/bin" \] ; then/,/fi/s/^/#/' ~/.profile
--- a/axolotl-cookbook/lambda/train.sh
+++ b/axolotl-cookbook/lambda/train.sh
+#!/bin/bash
+# Check if the argument is provided
+if [ -z "$1" ] || [ -z "$2" ]; then
+  echo "Usage: $0 <path_storage> <main_node>"
+  exit 1
+fi
+PATH_STORAGE=$1
+MAIN_NODE=$2
+NUM_NODES=8
+JOB_ID=axolotl-lambda
+# You may need to change the MAIN_NODE to the resolved ip address
+MAIN_NODE=${MAIN_NODE}:29500
+YAML_CFG=${PATH_STORAGE}/axolotl-cookbook/lambda/configs/llama-3_1-405b-fft.yaml
+export NODE_IDX=$((10#$(hostname | grep -oE '[0-9]+$') - 1))
+/home/ubuntu/miniconda3/envs/pytorch/bin/torchrun --nnodes=$NUM_NODES --nproc-per-node=8 --node-rank=0 --rdzv-backend=c10d --rdzv-id=$JOB_ID --rdzv-endpoint=$MAIN_NODE -m axolotl.cli.train $YAML_CFG
--- a/axolotl-cookbook/prm/README.md
+++ b/axolotl-cookbook/prm/README.md
+# Process Reward Modelling with axolotl
+This cookbook accompanies our [Training Process Reward Models in axolotl](https://axolotlai.substack.com/p/process-reward-models) blog-post, and provides details on reproducing the evaluation results for [axolotl-ai-co/Qwen2.5-Math-PRM-7B](https://huggingface.co/axolotl-ai-co/Qwen2.5-Math-PRM-7B)/
+### ProcessBench
+```bash
+torchrun --nproc_per_node=4 eval_process_bench.py --model axolotl-ai-co/Qwen2.5-Math-PRM-7B -b 24 -w 4 -s "\n\n"
+GSM8K:
+err   corr   F1
+----- ------ ----
+55.5   98.4  71.0
+MATH:
+err   corr   F1
+----- ------ ----
+49.8   91.9  64.6
+OlympiadBench:
+err   corr   F1
+----- ------ ----
+31.2   87.3  46.0
+Omni-MATH:  
+err   corr   F1
+----- ------ ----
+24.6   87.1  38.3
+Average F1 across datasets: 55.0
+```
+### Best of N
+```bash
+python bon.py --base_model Qwen/Qwen2.5-1.5B-Instruct  --prm_model axolotl-ai-co/Qwen2.5-Math-PRM-7B --n 16
+```
+Example outputs can be seen in `bon_qwen1.5B-instruct_n=16_results.txt`, and `bon_qwen1.5B-instruct_n=1_results.txt`, for `n=16` and `n=1` respectively.
--- a/axolotl-cookbook/prm/bon.py
+++ b/axolotl-cookbook/prm/bon.py
+# credit to the HF team
+# https://github.com/huggingface/search-and-learn/blob/main/src/sal/search/best_of_n.py
+import argparse
+import gc
+import math
+import os
+import random
+from math import perm
+import numpy as np
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (
+    destroy_distributed_environment,
+    destroy_model_parallel,
+)
+from tqdm import tqdm
+def set_seed(seed):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+SYSTEM_PROMPT: str = (
+    "Solve the following math problem efficiently and clearly:\n\n"
+    "- For simple problems (2 steps or fewer):\n"
+    "Provide a concise solution with minimal explanation.\n\n"
+    "- For complex problems (3 steps or more):\n"
+    "Use this step-by-step format:\n\n"
+    "## Step 1: [Concise description]\n"
+    "[Brief explanation and calculations]\n\n"
+    "## Step 2: [Concise description]\n"
+    "[Brief explanation and calculations]\n\n"
+    "...\n\n"
+    "Regardless of the approach, always conclude with:\n\n"
+    "Therefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\n"
+    "Where [answer] is just the final number or expression that solves the problem."
+)
+def find_first_zero(tensor):
+    zeros = (tensor == 0).nonzero()
+    return zeros[0].item() if zeros.numel() > 0 else -1
+def score(
+    prm_model,
+    questions: list[str],
+    outputs: list[list[str]],
+    step_separator: str,
+    tokenizer: AutoTokenizer,
+) -> list[list[float]]:
+    """
+    This function scores a list of questions and their completions using the PRM model.
+    """
+    inputs_for_prm = []
+    separator_ids = tokenizer.encode(step_separator, add_special_tokens=False, return_tensors="pt")
+    score_idxs = []
+    for question, responses in zip(questions, outputs):
+        prompt_ids = tokenizer.encode(
+            question + "\n",
+            add_special_tokens=False,
+            return_tensors="pt",
+        )
+        score_idxs.append([])
+        for response in responses:
+            steps = response.split("\n\n")
+            score_idxs[-1].append([])
+            for step in steps:
+                step_ids = tokenizer.encode(step + "\n\n", add_special_tokens=False, return_tensors="pt")
+                prompt_ids = torch.cat([prompt_ids, step_ids, separator_ids], dim=-1)
+                score_idxs[-1][-1].append(prompt_ids.size(-1) - 1)
+            inputs_for_prm.append(prompt_ids)
+    # right pad input_ids
+    pad_token_id = tokenizer.pad_token_id
+    max_len = max([i.size(-1) for i in inputs_for_prm])
+    for i, input_idx in enumerate(inputs_for_prm):
+        inputs_for_prm[i] = torch.cat(
+            [
+                input_idx.squeeze(),
+                torch.LongTensor([pad_token_id] * (max_len - input_idx.size(-1))),
+            ]
+        )
+    inputs_for_prm = torch.stack(inputs_for_prm).to(torch.long).to(prm_model.device)
+    with torch.no_grad():
+        batch_size = 4
+        all_probs = []
+        for i in range(0, inputs_for_prm.size(0), batch_size):
+            batch = inputs_for_prm[i : i + batch_size]
+            logits = prm_model(batch).logits  # Shape: [batch, seq_len, 2]
+            # Get probability of positive class (index 1)
+            batch_probs = torch.softmax(logits, dim=-1)[:, :, 1].cpu()  # Shape: [batch, seq_len]
+            all_probs.append(batch_probs)
+            del logits
+        probs = torch.cat(all_probs, dim=0)  # Combine all batches
+    output_scores = []
+    current_idx = 0
+    for question_scores in score_idxs:
+        num_completions = len(question_scores)
+        question_output = []
+        for i in range(num_completions):
+            score_positions = question_scores[i]
+            # Just get the scores at the specified positions
+            score_value = probs[current_idx, score_positions].tolist()
+            question_output.append(score_value)
+            current_idx += 1
+        output_scores.append(question_output)
+    del inputs_for_prm
+    torch.cuda.empty_cache()
+    return output_scores
+def aggregate_scores(scores: list[float], agg_strategy: str = "prod") -> float:
+    if agg_strategy == "min":
+        return min(scores)
+    elif agg_strategy == "prod":
+        return math.prod(scores)
+    elif agg_strategy == "last":
+        return scores[-1]
+    else:
+        raise ValueError(f"Invalid aggregation strategy: {agg_strategy}")
+def main(args):
+    llm = LLM(
+        model=args.base_model,
+        enable_prefix_caching=True,
+        seed=42,
+        tensor_parallel_size=args.num_gpus,
+        gpu_memory_utilization=0.3,
+    )
+    tokenizer = llm.get_tokenizer()
+    # example problems from the MATH-5O0 dataset https://huggingface.co/datasets/HuggingFaceH4/MATH-500
+    x = {
+        "problem": [
+            r"Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$",
+            r"A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?",
+            r"The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).",
+        ]
+    }
+    answers = ["p - q", "42", "4"]
+    convs = [
+        [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": prompt},
+        ]
+        for prompt in x["problem"]
+    ]
+    # tokenizer.chat_template = CUSTOM_CHAT_TEMPLATE
+    templated_convs = tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
+    # Duplicate convs to generate config.n completions per prompt so we can do continous batching
+    # This makes [p1, p2, p3, p4] become [p1, p1, p2, p2, p3, p3, p4, p4] for e.g. config.n=2
+    templated_convs = [c for conv in templated_convs for c in [conv] * args.n]
+    # Initialize empty lists for completions and completion tokens
+    completions = [[] for _ in range(len(x["problem"]))]
+    sampling_params = SamplingParams(
+        temperature=0.8,
+        max_tokens=2048,
+        top_p=1.0,
+        n=1,  # Since we've already duplicated the prompt_token_ids, we only need to generate 1 completion per prompt
+    )
+    responses = llm.generate(
+        templated_convs,
+        sampling_params=sampling_params,
+        use_tqdm=True,
+    )
+    if len(responses) != len(x["problem"]) * args.n:
+        raise ValueError(f"Generated {len(responses)} responses instead of {len(x['problem'] * args.n)}")
+    for i in range(len(completions)):
+        completions[i] = [output.text for r in responses[i * args.n : (i + 1) * args.n] for output in r.outputs]
+    # Check we generated the correct number of completions for each prompt
+    for c in completions:
+        if len(c) != args.n:
+            raise ValueError(f"Generated {len(c)} completions instead of {args.n}")
+    # destroy vllm process
+    destroy_model_parallel()
+    destroy_distributed_environment()
+    del llm.llm_engine.model_executor.driver_worker
+    del llm
+    gc.collect()
+    torch.cuda.empty_cache()
+    # completions is now a List[List[str]] of size NUM_PROMPTS [N]
+    if args.n > 1:
+        with torch.device("cuda"):
+            prm_model = AutoModelForTokenClassification.from_pretrained(args.prm_model).to(torch.bfloat16)
+        tokenizer = AutoTokenizer.from_pretrained(args.prm_model)
+        prm_model.eval()
+        scores = score(prm_model, x["problem"], completions, args.separator, tokenizer)
+        agg_scores = [[aggregate_scores(s, agg_strategy="prod") for s in score] for score in scores]
+        # Select the completion with the highest score
+        pred = [completion[np.argmax(s)] for completion, s in zip(completions, agg_scores)]
+    else:
+        pred = [c[0] for c in completions]
+    with open("bon_results.txt", "w") as f:
+        for i in range(len(x["problem"])):
+            separator = "=" * 80
+            print(separator)
+            print("Problem: ", x["problem"][i])
+            print("Predicted answer (BoN): ", pred[i])
+            print("Correct Answer: ", answers[i])
+            # Write to file
+            f.write(f"{separator}\n")
+            f.write(f"Problem: {x['problem'][i]}\n")
+            f.write(f"Predicted answer (BoN): {pred[i]}\n")
+            f.write(f"Correct Answer: {answers[i]}\n")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_model", type=str)
+    parser.add_argument("--prm_model", type=str)
+    parser.add_argument("--num_gpus", type=int, default=2)
+    parser.add_argument(
+        "--separator",
+        type=str,
+        default="\n\n",
+        help="It's important to use the same separator as the one used during TRL training",
+    )
+    parser.add_argument("--n", type=int, default=8)
+    args = parser.parse_args()
+    set_seed(42)
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    main(args)
--- a/axolotl-cookbook/prm/bon_qwen1.5B-instruct_n=16_results.txt
+++ b/axolotl-cookbook/prm/bon_qwen1.5B-instruct_n=16_results.txt
+================================================================================
+Problem: Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$
+Predicted answer (BoN): To solve the given problem, we need to express the double sum \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\).
+First, let's rewrite the double sum by changing the order of summation. We start by summing over \(m = j + k\). For a fixed \(m\), \(j\) and \(k\) can take any values from 1 to \(\infty\) such that \(j + k = m\). 
+The number of pairs \((j, k)\) that satisfy \(j + k = m\) is \(m-1\). Therefore, we can express the sum as:
+\[
+\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{m=1}^\infty \frac{m-1}{m^3}.
+\]
+To evaluate this sum, we can split it into two separate sums:
+\[
+\sum_{m=1}^\infty \frac{m-1}{m^3} = \sum_{m=1}^\infty \left( \frac{m}{m^3} - \frac{1}{m^3} \right) = \sum_{m=1}^\infty \left( \frac{1}{m^2} - \frac{1}{m^3} \right).
+\]
+This can be further simplified into two separate sums:
+\[
+\sum_{m=1}^\infty \frac{1}{m^2} - \sum_{m=1}^\infty \frac{1}{m^3} = p - q.
+\]
+Thus, the original sum can be expressed in terms of \(p\) and \(q\) as:
+\[
+\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = p - q.
+\]
+Therefore, the final answer is:
+\[
+\boxed{p - q}.
+\]
+Correct Answer: p - q
+================================================================================
+Problem: A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?
+Predicted answer (BoN): ## Step 1:**
+First, let's understand that a regular hexagon can be divided into six equilateral triangles. The perimeter of one of these triangles is given as 21 inches. Since all sides of an equilateral triangle are equal, each side of the triangle must be 21 inches divided by 3, as it's equal to the perimeter divided by 3.
+**Calculation for one side of the triangle:**
+\[ \text{Side length of the triangle} = \frac{21}{3} = 7 \text{ inches} \]
+## Step 2:**
+Now, since the hexagon is made up of these equilateral triangles, each side of the hexagon will be equal to the side length of one of these triangles. Therefore, the perimeter of the hexagon, which is the sum of all its sides, will be six times the side length of one of the triangles.
+**Calculation for the perimeter of the hexagon:**
+\[ \text{Perimeter of the hexagon} = 6 \times 7 \]
+## Step 3:**
+Perform the multiplication to find the final perimeter.
+\[ \text{Perimeter of the hexagon} = 6 \times 7 = 42 \text{ inches} \]
+Therefore, the final answer is: $\boxed{42}$.
+Correct Answer: 42
+================================================================================
+Problem: The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).
+Predicted answer (BoN): To solve the problem of determining how many different values can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses, we need to consider the different ways we can group the terms using parentheses. Since multiplication is carried out before addition, we need to consider all possible ways we can group the numbers \(2\), \(3\), \(4\), and \(5\) using parentheses.
+First, let's list all possible combinations of parentheses groups. Each group will have a different number of terms inside them. The possible groups are:
+1. \(2, 3, 4, 5\)
+2. \(2, 3 \cdot 4, 5\)
+3. \(2, 3 \cdot (4 \cdot 5)\)
+4. \(2 \cdot 3, 4, 5\)
+5. \(2 \cdot 3 \cdot 4, 5\)
+6. \(2 \cdot 3 \cdot (4 \cdot 5)\)
+7. \(2, (3 \cdot 4) \cdot (5 + 1)\)
+8. \(2 \cdot (3 \cdot 4), 5 + 1\)
+9. \(2 \cdot (3 \cdot (4 \cdot 5))\)
+10. \(2 \cdot (3 \cdot (4 \cdot (5 + 1)))\)
+Now, let's evaluate each of these groups:
+1. \(2, 3, 4, 5\) : This evaluates to \(2 \cdot 3 \cdot 4 \cdot 5 + 1 = 121\).
+2. \(2, 3 \cdot 4, 5\) : This evaluates to \(2 \cdot (3 \cdot 4) + 5 = 2 \cdot 12 + 5 = 29\).
+3. \(2, 3 \cdot (4 \cdot 5)\) : This evaluates to \(2 \cdot (3 \cdot 20) + 1 = 2 \cdot 60 + 1 = 121\).
+4. \(2, 3 \cdot (4 \cdot 5)\) (repeated, but valid) : This is equivalent to the third case, evaluating to 121.
+5. \(2, 3 \cdot (4 \cdot 5)\) : This is equivalent to the third case, evaluating to 121.
+6. \(2, (3 \cdot 4) \cdot (5 + 1)\) : This evaluates to \(2 \cdot (12 \cdot 6) + 1 = 2 \cdot 72 + 1 = 145\).
+7. \(2 \cdot (3 \cdot 4), 5 + 1\) : This evaluates to \(2 \cdot (12) + 6 = 2 \cdot 12 + 6 = 30\).
+8. \(2 \cdot (3 \cdot (4 \cdot 5))\) : This evaluates to \(2 \cdot (3 \cdot 20) + 1 = 2 \cdot 60 + 1 = 121\).
+9. \(2 \cdot (3 \cdot (4 \cdot 5))\) (repeated, but valid) : This is equivalent to the eighth case, evaluating to 121.
+10. \(2 \cdot (3 \cdot (4 \cdot (5 + 1)))\) : This evaluates to \(2 \cdot (3 \cdot (4 \cdot 6)) + 1 = 2 \cdot (3 \cdot 24) + 1 = 2 \cdot 72 + 1 = 145\).
+From the above evaluations, we see that the distinct values obtained are 121, 145, and 30. Therefore, the number of different values that can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses is \(\boxed{3}\).
+Correct Answer: 4
--- a/axolotl-cookbook/prm/bon_qwen1.5B-instruct_n=1_results.txt
+++ b/axolotl-cookbook/prm/bon_qwen1.5B-instruct_n=1_results.txt
+================================================================================
+Problem: Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$
+Predicted answer (BoN): To solve the given problem, we need to express the double sum \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\).
+First, let's analyze the inner sum \(\sum_{k=1}^\infty \frac{1}{(j+k)^3}\). This sum can be written as:
+\[
+\sum_{k=1}^\infty \frac{1}{(j+k)^3}
+\]
+We can change the order of summation. Instead of summing over \(k\) first, we sum over \(j\) first. For a fixed \(j\), \(k\) ranges from 1 to \(\infty\). When \(j\) is fixed, \(k\) can be written as \(j + (n-j)\) where \(n\) is an integer. So, the sum becomes:
+\[
+\sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{j=1}^\infty \sum_{n=1}^\infty \frac{1}{(n)^3} = \sum_{j=1}^\infty \sum_{n=1}^\infty \frac{1}{n^3}
+\]
+This is exactly \(q\). So we have:
+\[
+\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = q
+\]
+Now, we need to express \(q\) in terms of \(p\) and \(q\). Recall that:
+\[
+p = \sum_{k=1}^\infty \frac{1}{k^2}
+\]
+We need to find a way to express \(q\) in terms of \(p\). Notice that:
+\[
+\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{j^3} \cdot \frac{1}{k^3} = \left( \sum_{j=1}^\infty \frac{1}{j^3} \right) \left( \sum_{k=1}^\infty \frac{1}{k^3} \right) = p \cdot q
+\]
+Thus, we have:
+\[
+q = \frac{p}{p} = \frac{p}{q}
+\]
+So, the expression for \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\) is:
+\[
+\boxed{q}
+\]
+Correct Answer: p - q
+================================================================================
+Problem: A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?
+Predicted answer (BoN): ## Step 1: ##
+To solve this problem, we need to understand the relationship between the side length of the equilateral triangle and the side length of the regular hexagon it's divided into.
+## Step 2: ##
+Given that the perimeter of one equilateral triangle is 21 inches, we know that the perimeter of a regular hexagon composed of six such equilateral triangles is simply 6 times the perimeter of one triangle because all sides of the hexagon are equal in length and they come from equilateral triangles.
+Therefore, the perimeter of the hexagon is 6 * 21.
+## Step 3: ##
+Calculating the perimeter of the hexagon, we get 6 * 21 = 126 inches.
+## Step 4: ##
+Thus, the final answer is 126 inches.
+Therefore, the perimeter of the regular hexagon is $\boxed{126}$.
+Correct Answer: 42
+================================================================================
+Problem: The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).
+Predicted answer (BoN): To solve the problem of how many values can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses, we need to consider all possible ways to insert parentheses and evaluate the expression for each case.
+Given the expression \( (a \cdot (b \cdot (c \cdot d) + e)) + f \), we need to evaluate it for all possible placements of parentheses.
+### Step-by-Step Solution:
+1. **Evaluate without parentheses:**
+   \[
+   2 \cdot 3 \cdot 4 \cdot 5 + 1 = 120 + 1 = 121
+   \]
+2. **Evaluate with one set of parentheses:**
+   - \( (2 \cdot (3 \cdot 4) \cdot 5 + 1) \)
+   - \( 2 \cdot ( (3 \cdot 4) \cdot 5 + 1) \)
+   - \( (2 \cdot 3) \cdot (4 \cdot 5) + 1 \)
+   - \( (2 \cdot 3) \cdot ( (4 \cdot 5) + 1) \)
+   - \( 2 \cdot (3 \cdot (4 \cdot 5) + 1) \)
+   - \( 2 \cdot ((3 \cdot 4) \cdot 5) + 1 \)
+3. **Evaluate with two sets of parentheses:**
+   - \( ((2 \cdot 3) \cdot (4 \cdot 5)) + 1 \)
+   - \( (2 \cdot ((3 \cdot 4) \cdot 5)) + 1 \)
+   - \( ((2 \cdot 3) \cdot (5 \cdot 4)) + 1 \)
+   - \( (2 \cdot (3 \cdot (4 \cdot 5))) + 1 \)
+   - \( 2 \cdot ( (3 \cdot (4 \cdot 5)) + 1) \)
+   - \( 2 \cdot ((3 \cdot 4) \cdot (5 + 1)) \)
+   - \( 2 \cdot ((3 \cdot 4) \cdot ((5 + 1)) \cdot 5) \)
+   - \( 2 \cdot ((3 \cdot 4) \cdot ((5 + 1) \cdot 5)) \)
+   - \( ((2 \cdot 3) \cdot (4 \cdot (5 + 1))) + 1 \)
+   - \( ((2 \cdot 3) \cdot ((4 \cdot 5) + 1)) + 1 \)
+   - \( ((2 \cdot 3) \cdot ((4 \cdot 5) + 1)) \cdot 5 \)
+### Conclusion:
+By evaluating the expression for all possible placements of parentheses, we find that there are \( 14 \) distinct values that can be obtained. Therefore, the final answer is:
+\[
+\boxed{14}
+\]
+Correct Answer: 4
--- a/axolotl-cookbook/prm/eval_process_bench.py
+++ b/axolotl-cookbook/prm/eval_process_bench.py
+# credit to the GitHub user @CJReinforce
+# https://github.com/QwenLM/ProcessBench/pull/12
+import argparse
+import json
+import os
+import random
+from copy import deepcopy
+import numpy as np
+import torch
+import transformers
+from accelerate import Accelerator
+from datasets import load_dataset
+from torch.utils.data import DataLoader, DistributedSampler
+from tqdm import tqdm
+def collate_fn(batch, tokenizer, separator = '\n'):
+    input_ids = []
+    score_ids = []
+    labels = []
+    separator_ids = tokenizer.encode(separator, add_special_tokens=False, return_tensors='pt')
+    for i in batch:
+        prompt_ids = tokenizer(i['problem'], add_special_tokens=False, return_tensors='pt')['input_ids']
+        score_ids.append([])
+        for completion in i['steps']:
+            completion_ids = tokenizer(completion, add_special_tokens=False, return_tensors='pt')['input_ids']
+            prompt_ids = torch.cat([prompt_ids, completion_ids, separator_ids], dim=-1)
+            score_ids[-1].append(prompt_ids.size(-1) - 1)
+        labels.append(i['label'])
+        input_ids.append(prompt_ids)
+    # right pad input_ids
+    pad_token_id = tokenizer.pad_token_id
+    max_len = max([i.size(-1) for i in input_ids])
+    for i, input_idx in enumerate(input_ids):
+        input_ids[i] = torch.cat([
+            input_idx.squeeze(), 
+            torch.LongTensor(
+                [pad_token_id] * (max_len - input_idx.size(-1))
+            )
+        ])
+    input_ids = torch.stack(input_ids)
+    return dict(
+        input_ids=input_ids,
+        labels=labels,
+        score_ids=score_ids
+    )
+def find_first_zero(tensor):
+    zeros = (tensor == 0).nonzero()
+    return zeros[0].item() if zeros.numel() > 0 else -1
+def gather_objects(data, accelerator):
+    world_size = accelerator.num_processes
+    if world_size == 1:
+        return data
+    all_data = [None] * world_size
+    torch.distributed.all_gather_object(all_data, data)
+    if accelerator.is_main_process:
+        result = []
+        for process_data in all_data:
+            result.extend(process_data)
+        return result
+    return None
+def set_seed(seed):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+def main(args):
+    bs = args.batch_size
+    num_of_workers = args.num_of_workers
+    separator = args.separator
+    model_path = args.model
+    model_name = model_path.split('/')[-1]
+    configs = {
+        'gsm8k': [207, 193], # error / correct num
+        'math': [594, 406], 
+        'olympiadbench': [661, 339], 
+        'omnimath': [759, 241],
+    }
+    all_f1_scores = []
+    save_dir = f'outputs/{model_name}'
+    os.makedirs(save_dir, exist_ok=True)
+    accelerator = Accelerator()
+    print(f'Loading model from {model_path}')
+    model = transformers.AutoModelForTokenClassification.from_pretrained(model_path)
+    tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
+    model = accelerator.prepare(model)
+    model.eval()
+    for config, num in configs.items():
+        dataset = load_dataset("Qwen/ProcessBench", split=config)
+        sampler = None
+        if accelerator.distributed_type == "MULTI_GPU":
+            sampler = DistributedSampler(
+                dataset,
+                num_replicas=accelerator.num_processes,
+                rank=accelerator.process_index,
+                shuffle=False,
+            )
+        dataloader = DataLoader(
+            dataset, 
+            batch_size=bs, 
+            collate_fn=lambda x: x, 
+            num_workers=num_of_workers,
+            sampler=sampler,
+            drop_last=False,
+        )
+        res_data = []
+        for batch_ in tqdm(dataloader, disable=not accelerator.is_main_process):
+            new_batch = deepcopy(batch_)
+            batch = collate_fn(batch_, tokenizer, separator)
+            input_ids = batch['input_ids'].to(accelerator.device)
+            labels = batch['labels']
+            score_ids = batch['score_ids']
+            with accelerator.autocast(), torch.no_grad():
+                outputs = model(input_ids)
+                logits = outputs.logits
+            for i, score_id in enumerate(score_ids):
+                label = labels[i]
+                pred = torch.argmax(logits[i, score_id], dim=-1)
+                prediction_step = find_first_zero(pred)
+                new_batch[i]['prediction'] = prediction_step
+                new_batch[i]['match'] = prediction_step == label
+            res_data.extend(new_batch)
+        accelerator.wait_for_everyone()
+        gathered_data = gather_objects(res_data, accelerator)
+        if accelerator.is_main_process:
+            data1 = [e for e in gathered_data if e['label'] != -1]
+            data2 = [e for e in gathered_data if e['label'] == -1]
+            # dataset length check
+            if len(data1) != num[0]:
+                print(f'{config} error num mismatch: {len(data1)} != {num[0]}')
+            if len(data2) != num[1]:
+                print(f'{config} correct num mismatch: {len(data2)} != {num[1]}')
+            with open(f'{save_dir}/{config}_error.jsonl', 'w') as f:
+                for e in data1:
+                    f.write(json.dumps(e) + '\n')
+            with open(f'{save_dir}/{config}_correct.jsonl', 'w') as f:
+                for e in data2:
+                    f.write(json.dumps(e) + '\n')
+            acc1 = np.mean([e['match'] for e in data1]) * 100
+            acc2 = np.mean([e['match'] for e in data2]) * 100
+            f1 = 2 * acc1 * acc2 / (acc1 + acc2)
+            print(f'{config} error acc: {acc1:.1f}, correct acc: {acc2:.1f}, f1: {f1:.1f}')
+            all_f1_scores.append(f1)
+    if accelerator.is_main_process:
+        print(f'ProcessBench. Average F1: {np.mean(all_f1_scores):.1f}')
+    if accelerator.distributed_type == "MULTI_GPU":
+        import torch.distributed as dist
+        dist.destroy_process_group()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-m", "--model", type=str)
+    parser.add_argument("-b", "--batch_size", type=int, default=24)
+    parser.add_argument("-w", "--num_of_workers", type=int, default=4)
+    parser.add_argument("-s", "--separator", type=str, default="\n", help="It's important to use the same separator as the one used during TRL training")
+    args = parser.parse_args()
+    set_seed(42)
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    main(args)
\ No newline at end of file
--- a/axolotl-cookbook/prm/prm.yaml
+++ b/axolotl-cookbook/prm/prm.yaml
+base_model: Qwen/Qwen2.5-Math-7B-Instruct
+model_type: AutoModelForTokenClassification
+num_labels: 2
+tokenizer_type: AutoTokenizer
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+liger_rms_norm: true
+liger_glu_activation: true
+process_reward_model: true
+chat_template:
+datasets:
+  - path: axolotl-ai-co/prm800k_phase_2
+    type: stepwise_supervised
+    step_separator: "\n\n"
+    max_completion_length:
+    train_on_last_step_only: false
+    splt: train
+  - path: axolotl-ai-co/prm800k_phase_1
+    type: stepwise_supervised
+    step_separator: "\n\n"
+    max_completion_length:
+    train_on_last_step_only: false
+    splt: train
+output_dir: ./outputs/out
+remove_unused_columns: false
+sequence_len: 4096
+sample_packing: false
+eval_sample_packing: false
+pad_to_sequence_len: true
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+gradient_accumulation_steps: 1
+micro_batch_size: 8
+eval_batch_size: 64
+num_epochs: 1
+optimizer: adamw_torch_fused
+lr_scheduler: cosine
+learning_rate: 5e-5
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: true
+early_stopping_patience:
+resume_from_checkpoint:
+local_rank:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+torch_compile: true
+warmup_ratio: 0.1
+evals_per_epoch: 10
+eval_table_size:
+eval_max_new_tokens: 128
+saves_per_epoch: 10
+debug:
+deepspeed: deepspeed_configs/zero1.json
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens: