Commit 4d4d8f59 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #2741 canceled with stages
base_model: NousResearch/Meta-Llama-3.1-8B
load_in_4bit: true
strict: false
chat_template: llama3
datasets:
- path: winglian/pirate-ultrachat-10k
type: chat_template
message_field_role: role
message_field_content: content
dataset_prepared_path: last_run_prepared
val_set_size: 0.005
output_dir: ./outputs/lora-out
sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
adapter: qlora
lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_modules_to_save:
- embed_tokens
- lm_head
peft_use_dora: true
wandb_project: pirate-ultrachat-llama31
wandb_entity: axolotl-ai
gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false
bf16: true
tf32: true
gradient_checkpointing: true
logging_steps: 1
flash_attention: true
warmup_ration: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0
deepspeed: deepspeed_configs/zero2.json
special_tokens:
pad_token: "<|finetune_right_pad_id|>"
\ No newline at end of file
# Training an R1-style Large Language Model using GRPO w/ Axolotl
In this example, we will demonstrate how to train Qwen 2.5 1.5B Instruct to reason using R1-style post-training with
the GRPO technique.
## Installation
To get started you'll need to install vllm==0.7.2 and both trl and axolotl from source as GRPO support is still in flux.
```bash
pip install --no-build-isolation "axolotl[vllm] @ git+https://github.com/axolotl-ai-cloud/axolotl.git@grpo"
pip uninstall trl
pip install --no-deps "trl @ git+https://github.com/huggingface/trl.git@main"
```
## Knowing your dataset
For this, we're using the GSM8K dataset which has been a popular choice for experimentation for R1-style learning by
the community. Using the provided `stats.py`, we first analyze the dataset to determine the maximum tokenized length
of the system prompt + question. We find that for the Qwen 2.5 tokenizer, the maximum combined input length is 263.
For Llama 3.x series, this is 266. By combining this with the constraint that we only want the model to generate up to
512 tokens, we can set our sequence length to ~800.
## Reward functions
There has already been quite a lot of experimentation in the AI/ML community with the GSM8K dataset, so we're going to
use some existing reward functions that have shown promise by
[@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb). We've corrected some of the logic in the
reward functions such as the loose formatting function.
1. `int_reward_func` - As all gsm8k answers should be only integers, this provides a small reward to push the model to only respond with integers.
2. `correctness_reward_func` - This checks that the answer within the <answer></answer> tags is exactly the label.
3. `strict_format_reward_func` - This validates that the structure of the response exactly matches the prompt, inclusive of newlines.
4. `soft_format_reward_func` - This also checks the structure, but doesn't penalize the model if it doesn't match all the newlines exactly.
5. `xmlcount_reward_func` - This rewards the model to have exactly one of each of the xml tags in the structured response.
## GPU choice
With Qwen 1.5B, you'll need 2 GPUs. One for training and the other for inference with vLLM. For the gsm8k dataset with
16x rollouts, these should fit on GPUs with ~48GB VRAM, so we chose 2xL40S. While we could have chosen to do 8x rollouts
in order to fit both the trainer and vLLM models on the same GPU, we opted to stick with 16x rollouts similar to the
original paper.
## Training
For inference TRL loads vLLM onto the last GPU, so when we start the trainer, we need to tell accelerate to use `N - 1`
GPUs or processes for training, e.g. `--num-processes=1` in our case.
Putting it all together we have the `gsm8k_grpo.py` module providing the reward functions, and we've configured the
hyperparameters in `gsm8k.yaml`. To train the model, we simply need to execute:
```bash
axolotl train --num-processes=1 gsm8k.yaml
```
## Results
You can find the training metrics here: https://wandb.ai/axolotl-ai/gsm8k-grpo-r1
![wandb.png](assets/wandb.png)
The training takes about 6 hours to complete at a commodity GPU cost of ~\$12-\$24 depending on your compute provider.
We can see from the metrics that the model quickly learns the loose formatting reward, but doesn't quite nail down the
strict formatting with newlines. The correctness reward hangs out at around ~1.4 which is about 70% accuracy as the max
reward for correctness is 2.0.
## Training with Modal
But what if you don't have two GPUs with 48GB of memory sitting around ready to run GRPO training?
`axolotl` makes it easy to run training jobs on GPUs in the cloud via the [`--cloud` flag](https://axolotl-ai-cloud.github.io/axolotl/docs/cli.html#remote-compute-with-modal-cloud). Using the [Modal](https://modal.com) serverless platform, this flag will handle the entire process of cloud deployment (and tear-down) for your training run, according to a provided [`cloud.yaml` file](https://axolotl-ai-cloud.github.io/axolotl/docs/cli.html#cloud-configuration-options). The GPUs will be automatically deallocated when training finishes and you will only be charged for the time your code spends running on GPUs. See [Modal's pricing page](https://modal.com/pricing) for rates.
If you have a Modal account, you can go straight to installing the CLI tools. Otherwise, you will need to [sign-up
for Modal](https://www.modal.com?utm_source=github&utm_medium=github&utm_campaign=axolotl). Modal provides $30 each
month in free credits.
### CLI
first install the `modal` and `axolotl` CLIs locally with:
```bash
pip install modal
pip install "axolotl @ git+https://github.com/axolotl-ai-cloud/axolotl.git@grpo"
```
You can then authenticate to Modal with:
```bash
modal token new
```
### Training
In the provided `cloud.yaml`, we start Modal with 2 GPUs. Because we reserve one GPU for generations with vLLM,
we need to set the `num-processes` to `N - 1` GPUs.
```bash
axolotl train --num-processes=1 gsm8k.yaml --cloud cloud.yaml
```
### Single GPU
To train on a single L40S GPU, we can do a few things like decrease the number of generations and reduce the vLLM
footprint on the remainder of the VRAM. In this configuration, training still takes ~6 hours at a commodity
cost of \$6-$12.
```yaml
trl:
vllm_device: "cuda:0" # force vllm to the only GPU instead of the "next GPU"
vllm_gpu_memory_utilization: 0.2 # reduce the footprint of vllm to use ~9GB VRAM
num_generations: 8
micro_batch_size: 8 # decrease the per gpu batch size to match the number of generations per prompt
```
If you're using Modal to launch your post-training, you'll want to update the `cloud.yaml` appropriately as well.
```yaml
gpu_count: 1
```
You can find the results of our experiment with single GPU post-training here:
https://wandb.ai/axolotl-ai/gsm8k-grpo-r1/runs/w3gpyhj2
\ No newline at end of file
project_name:
volumes:
- name: axolotl-data
mount: /workspace/data
- name: axolotl-artifacts
mount: /workspace/artifacts
# environment variables from local to set as secrets
secrets:
- HF_TOKEN
- WANDB_API_KEY
# Which branch of axolotl to use remotely
branch: grpo
spawn:
dockerfile_commands:
- 'RUN pip install "vllm==0.7.2"'
- 'RUN pip uninstall -y trl && pip install --no-deps "trl @ git+https://github.com/huggingface/trl.git@main"'
gpu: L40S
gpu_count: 2
# Train specific configurations
memory: 128
timeout: 86400
# Preprocess specific configurations
memory_preprocess: 32
timeout_preprocess: 14400
base_model: Qwen/Qwen2.5-1.5B-Instruct
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name
load_in_8bit: false
load_in_4bit: false
strict: false
torch_compile: true
rl: grpo
trl:
beta: 0.001
max_completion_length: 512
use_vllm: true
reward_funcs:
- gsm8k_grpo.correctness_reward_func
- gsm8k_grpo.int_reward_func
- gsm8k_grpo.strict_format_reward_func
- gsm8k_grpo.soft_format_reward_func
- gsm8k_grpo.xmlcount_reward_func
vllm_gpu_memory_utilization: 0.9
vllm_max_model_len: 800
num_generations: 16
chat_template: qwen_25
datasets:
- path: skrishna/gsm8k_only_answer
type: gsm8k_grpo.axo_gsm8k_transform
dataset_prepared_path: /workspace/data/last_run_prepared
skip_prepare_dataset: true
val_set_size: 0.0
output_dir: /workspace/data/axolotl-artifacts/r1-outputs
dataloader_prefetch_factor: 32
dataloader_num_workers: 2
dataloader_pin_memory: true
gc_steps: 1
sequence_len: 800
sample_packing: false
eval_sample_packing: false
pad_to_sequence_len: false
wandb_project: gsm8k-grpo-r1
wandb_entity:
wandb_name:
gradient_accumulation_steps: 8
micro_batch_size: 16 # should match num_generations / num_gpus
num_epochs: 1
optimizer: adamw_torch_fused
lr_scheduler: constant_with_warmup
learning_rate: 1.0e-6
max_grad_norm: 1.0
weight_decay: 0.1
bf16: true
tf32: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
flash_attention: true
logging_steps: 1
warmup_steps: 100
evals_per_epoch: 1
saves_per_epoch: 4
# adapted from https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb
import re
SYSTEM_PROMPT = """Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
def axo_gsm8k_transform(cfg, *args, **kwargs):
# skrishna/gsm8k_only_answer dataset
def transform_fn(example, tokenizer=None):
label = example["label"].replace(",", "") # remove commas, e.g. thousands separators
return {
"prompt": [
# improves adherence to the system prompt by having it in the user context
{"role": "user", "content": SYSTEM_PROMPT + "\n\n" + example["text"]},
],
"answer": label,
}
return transform_fn, {"remove_columns": ["text", "label"]}
def extract_xml_answer(text: str) -> str:
# collect the answer between the last <answer></answer> tag
answer = text.split("<answer>")[-1]
answer = answer.split("</answer>")[0]
return answer.strip()
# Reward functions
def correctness_reward_func(completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def int_reward_func(completions, **kwargs) -> list[float]:
"""
gsm8k answers are ints, so rewarding for ints should help steer the model performance
"""
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
def strict_format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\s?$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def soft_format_reward_func(completions, **kwargs) -> list[float]:
"""
Reward function that loosely checks if the completion has a specific format,
without penalizing adherence to newlines.
"""
pattern = r"<reasoning>.*?</reasoning>\s?<answer>.*?</answer>"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r, re.S) for r in responses]
return [0.25 if match else 0.0 for match in matches]
def count_xml(text) -> float:
count = 0.0
if text.count("<reasoning>") == 1:
count += 0.125
if text.count("</reasoning>") == 1:
count += 0.125
if text.count("<answer>") == 1:
count += 0.125
if text.count("</answer>") == 1:
count += 0.125
# penalize extra tokens after the answer tag
count -= (len(text.split("</answer>")[-1]) - 1)*0.001
return count
def xmlcount_reward_func(completions, **kwargs) -> list[float]:
"""
Reward function for having exactly one of each <reasoning>, </reasoning>, <answer>, and </answer> tag.
"""
contents = [completion[0]["content"] for completion in completions]
return [count_xml(c) for c in contents]
\ No newline at end of file
from functools import partial
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
SYSTEM_PROMPT = """Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
def get_tokenization_stats(example, tokenizer=None):
messages = {
"prompt": [
# improves adherence to the system prompt by having it in the user context
{"role": "user", "content": SYSTEM_PROMPT + "\n\n" + example["text"]},
],
}
inputs = tokenizer.apply_chat_template(
messages["prompt"], tokenize=True, add_generation_prompt=True
)
return {
"input_ids": inputs,
}
def get_dataset_lengths(dataset):
input_ids = dataset.data.column("input_ids")
lengths = np.vectorize(len)(np.array(input_ids, dtype=object))
return lengths
def main():
ds = load_dataset("skrishna/gsm8k_only_answer", split="train")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
stats = partial(get_tokenization_stats, tokenizer=tokenizer)
ds = ds.map(stats, remove_columns=["text", "label"])
max_input_len = np.max(get_dataset_lengths(ds))
print(f"Max input length: {max_input_len}")
if __name__ == "__main__":
main()
\ No newline at end of file
base_model: $PATH_STORAGE/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-405B/snapshots/222de096204587406c7cadb3e0a101aade116279
tokenizer_type: AutoTokenizer
strict: false
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
chat_template: llama3
datasets:
- path: mlabonne/FineTome-100k
type: chat_template
split: train
dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
val_set_size: 0.0
output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/llama3_1-405b-finetome
save_safetensors: false # saving final sharded dict may not work with safetensors
wandb_project: llama-3.1-405b-fft-finetome
wandb_entity:
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 1.0e-5
train_on_inputs: false
group_by_length: false
bf16: true
tf32: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
logging_steps: 1
flash_attention: true
warmup_steps: 100
saves_per_epoch: 1
weight_decay: 0.1
fsdp_final_state_dict_type: SHARDED_STATE_DICT
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
special_tokens:
pad_token: <|finetune_right_pad_id|>
eos_token: <|eot_id|>
base_model: meta-llama/Meta-Llama-3.1-405B
tokenizer_type: AutoTokenizer
load_in_4bit: true
strict: false
chat_template: llama3
datasets:
- path: mlabonne/FineTome-100k
type: chat_template
split: train
dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
val_set_size: 0.0
output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/qlora-llama3_1-405b-finetome
save_safetensors: true
adapter: qlora
wandb_project: qlora-405b-finetome
wandb_entity: oaaic
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00001
train_on_inputs: false
group_by_length: false
bf16: true
tf32: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
logging_steps: 1
flash_attention: true
warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
special_tokens:
pad_token: <|finetune_right_pad_id|>
\ No newline at end of file
base_model: meta-llama/Meta-Llama-3.1-70B
tokenizer_type: AutoTokenizer
strict: false
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
chat_template: llama3
datasets:
- path: mlabonne/FineTome-100k
type: chat_template
split: train
dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
val_set_size: 0.0
output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/llama3_1-70b-finetome
save_safetensors: false # saving final sharded dict may not work with safetensors
wandb_project: llama-3.1-70b-fft-finetome
wandb_entity:
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
gradient_accumulation_steps: 1
# 8x Node can support a batch size of up to 3
micro_batch_size: 3
num_epochs: 2
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 3.0e-5
train_on_inputs: false
group_by_length: false
bf16: true
tf32: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
logging_steps: 1
flash_attention: true
warmup_steps: 30
saves_per_epoch: 1
weight_decay: 0.1
fsdp_final_state_dict_type: SHARDED_STATE_DICT
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
fsdp_backward_prefetch: BACKWARD_PRE
special_tokens:
pad_token: <|finetune_right_pad_id|>
eos_token: <|eot_id|>
base_model: meta-llama/Meta-Llama-3.1-8B
tokenizer_type: AutoTokenizer
strict: false
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
chat_template: llama3
datasets:
- path: mlabonne/FineTome-100k
type: chat_template
split: train
dataset_prepared_path: $PATH_STORAGE/axolotl/last_run_prepared
val_set_size: 0.0
output_dir: $PATH_STORAGE/axolotl-artifacts/outputs/llama3_1-8b-finetome
save_safetensors: false # saving final sharded dict may not work with safetensors
wandb_project: llama-3.1-8b-fft
wandb_entity:
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
gradient_accumulation_steps: 1
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 1.0e-6
train_on_inputs: false
group_by_length: false
bf16: true
tf32: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
logging_steps: 1
flash_attention: true
warmup_steps: 100
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.1
fsdp_final_state_dict_type: SHARDED_STATE_DICT
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
special_tokens:
pad_token: <|finetune_right_pad_id|>
eos_token: <|eot_id|>
#!/bin/bash
# Check if the argument is provided
if [ -z "$1" ]; then
echo "Usage: $0 <path_storage>"
exit 1
fi
# Exit immediately if a command exits with a non-zero status
set -e
PATH_STORAGE=$1
# Check if Miniconda is already installed
if [ ! -d "/home/ubuntu/miniconda3" ]; then
# Download and install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p /home/ubuntu/miniconda3 -u
/home/ubuntu/miniconda3/bin/conda init bash
echo 'export PATH="/home/ubuntu/miniconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
else
/home/ubuntu/miniconda3/bin/conda init bash
echo 'export PATH="/home/ubuntu/miniconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
fi
# Configure Conda and create PyTorch environment
/home/ubuntu/miniconda3/bin/conda remove -n pytorch --all -y
/home/ubuntu/miniconda3/bin/conda install python=3.11 -y
/home/ubuntu/miniconda3/bin/conda install -c conda-forge libstdcxx-ng -y
/home/ubuntu/miniconda3/bin/conda create -n pytorch python=3.11 -y
echo 'conda activate pytorch' >> ~/.bashrc
# Install PyTorch
/home/ubuntu/miniconda3/bin/conda run -n pytorch conda install pytorch==2.3.1 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
# Verify PyTorch installation
/home/ubuntu/miniconda3/bin/conda run -n pytorch python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
# Add Git LFS repository and install
sudo add-apt-repository -y ppa:git-core/ppa
sudo apt-get update
sudo apt-get install -y git-lfs
git lfs install --skip-repo
# Install awscli and pydantic in PyTorch environment
/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install awscli==1.33.13 packaging
/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install -U --no-cache-dir pydantic==1.10.10
# Install system packages
sudo apt-get install -y vim curl nano rsync s3fs net-tools nvtop infiniband-diags pdsh libaio-dev
# Add ubuntu user to root group
sudo usermod -aG root ubuntu
mkdir -p ${PATH_STORAGE}/axolotl-artifacts/{configs,outputs}
# Clone axolotl repository
CLONE_DIR=${PATH_STORAGE}/axolotl
if [ ! -d "$CLONE_DIR" ]; then
git clone --single-branch https://github.com/axolotl-ai-cloud/axolotl.git $CLONE_DIR
fi
# Install causal_conv1d and axolotl in PyTorch environment
export PATH="/home/ubuntu/miniconda3/envs/pytorch/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/home/ubuntu/.local/bin:/usr/lib/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
export LD_LIBRARY_PATH="/usr/lib/cuda/lib64"
export CUDA_HOME="/usr/lib/nvidia-cuda-toolkit"
/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install causal_conv1d
/home/ubuntu/miniconda3/bin/conda run -n pytorch pip install --no-cache -e $CLONE_DIR[deepspeed,flash-attn,optimizers]
# Configure git credential helper
git config --global credential.helper store
# Install and configure tmux
sudo apt-get install -y tmux
echo '# Run tmux only in interactive shells
if [[ $- == *i* ]] && [[ -z "$TMUX" ]]; then
tmux attach-session -t ssh_tmux || tmux new-session -s ssh_tmux
fi' >> ~/.bashrc
# Create huggingface cache directory and set environment variables
mkdir -p ${PATH_STORAGE}/.cache/huggingface
echo "export HF_HOME=\"${PATH_STORAGE}/.cache/huggingface\"" >> ~/.bashrc
echo 'export HF_HUB_ENABLE_HF_TRANSFER="1"' >> ~/.bashrc
echo 'export PATH="/home/ubuntu/miniconda3/envs/pytorch/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/home/ubuntu/.local/bin:/usr/lib/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="/usr/lib/cuda/lib64"' >> ~/.bashrc
echo 'export CUDA_HOME="/usr/lib/nvidia-cuda-toolkit"' >> ~/.bashrc
# default .profile adds ~/.local/bin which breaks pip/python
sed -i '/if \[ -d "\$HOME\/\.local\/bin" \] ; then/,/fi/s/^/#/' ~/.profile
#!/bin/bash
# Check if the argument is provided
if [ -z "$1" ] || [ -z "$2" ]; then
echo "Usage: $0 <path_storage> <main_node>"
exit 1
fi
PATH_STORAGE=$1
MAIN_NODE=$2
NUM_NODES=8
JOB_ID=axolotl-lambda
# You may need to change the MAIN_NODE to the resolved ip address
MAIN_NODE=${MAIN_NODE}:29500
YAML_CFG=${PATH_STORAGE}/axolotl-cookbook/lambda/configs/llama-3_1-405b-fft.yaml
export NODE_IDX=$((10#$(hostname | grep -oE '[0-9]+$') - 1))
/home/ubuntu/miniconda3/envs/pytorch/bin/torchrun --nnodes=$NUM_NODES --nproc-per-node=8 --node-rank=0 --rdzv-backend=c10d --rdzv-id=$JOB_ID --rdzv-endpoint=$MAIN_NODE -m axolotl.cli.train $YAML_CFG
# Process Reward Modelling with axolotl
This cookbook accompanies our [Training Process Reward Models in axolotl](https://axolotlai.substack.com/p/process-reward-models) blog-post, and provides details on reproducing the evaluation results for [axolotl-ai-co/Qwen2.5-Math-PRM-7B](https://huggingface.co/axolotl-ai-co/Qwen2.5-Math-PRM-7B)/
### ProcessBench
```bash
torchrun --nproc_per_node=4 eval_process_bench.py --model axolotl-ai-co/Qwen2.5-Math-PRM-7B -b 24 -w 4 -s "\n\n"
GSM8K:
err corr F1
----- ------ ----
55.5 98.4 71.0
MATH:
err corr F1
----- ------ ----
49.8 91.9 64.6
OlympiadBench:
err corr F1
----- ------ ----
31.2 87.3 46.0
Omni-MATH:
err corr F1
----- ------ ----
24.6 87.1 38.3
Average F1 across datasets: 55.0
```
### Best of N
```bash
python bon.py --base_model Qwen/Qwen2.5-1.5B-Instruct --prm_model axolotl-ai-co/Qwen2.5-Math-PRM-7B --n 16
```
Example outputs can be seen in `bon_qwen1.5B-instruct_n=16_results.txt`, and `bon_qwen1.5B-instruct_n=1_results.txt`, for `n=16` and `n=1` respectively.
# credit to the HF team
# https://github.com/huggingface/search-and-learn/blob/main/src/sal/search/best_of_n.py
import argparse
import gc
import math
import os
import random
from math import perm
import numpy as np
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
destroy_distributed_environment,
destroy_model_parallel,
)
from tqdm import tqdm
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
SYSTEM_PROMPT: str = (
"Solve the following math problem efficiently and clearly:\n\n"
"- For simple problems (2 steps or fewer):\n"
"Provide a concise solution with minimal explanation.\n\n"
"- For complex problems (3 steps or more):\n"
"Use this step-by-step format:\n\n"
"## Step 1: [Concise description]\n"
"[Brief explanation and calculations]\n\n"
"## Step 2: [Concise description]\n"
"[Brief explanation and calculations]\n\n"
"...\n\n"
"Regardless of the approach, always conclude with:\n\n"
"Therefore, the final answer is: $\\boxed{answer}$. I hope it is correct.\n\n"
"Where [answer] is just the final number or expression that solves the problem."
)
def find_first_zero(tensor):
zeros = (tensor == 0).nonzero()
return zeros[0].item() if zeros.numel() > 0 else -1
def score(
prm_model,
questions: list[str],
outputs: list[list[str]],
step_separator: str,
tokenizer: AutoTokenizer,
) -> list[list[float]]:
"""
This function scores a list of questions and their completions using the PRM model.
"""
inputs_for_prm = []
separator_ids = tokenizer.encode(step_separator, add_special_tokens=False, return_tensors="pt")
score_idxs = []
for question, responses in zip(questions, outputs):
prompt_ids = tokenizer.encode(
question + "\n",
add_special_tokens=False,
return_tensors="pt",
)
score_idxs.append([])
for response in responses:
steps = response.split("\n\n")
score_idxs[-1].append([])
for step in steps:
step_ids = tokenizer.encode(step + "\n\n", add_special_tokens=False, return_tensors="pt")
prompt_ids = torch.cat([prompt_ids, step_ids, separator_ids], dim=-1)
score_idxs[-1][-1].append(prompt_ids.size(-1) - 1)
inputs_for_prm.append(prompt_ids)
# right pad input_ids
pad_token_id = tokenizer.pad_token_id
max_len = max([i.size(-1) for i in inputs_for_prm])
for i, input_idx in enumerate(inputs_for_prm):
inputs_for_prm[i] = torch.cat(
[
input_idx.squeeze(),
torch.LongTensor([pad_token_id] * (max_len - input_idx.size(-1))),
]
)
inputs_for_prm = torch.stack(inputs_for_prm).to(torch.long).to(prm_model.device)
with torch.no_grad():
batch_size = 4
all_probs = []
for i in range(0, inputs_for_prm.size(0), batch_size):
batch = inputs_for_prm[i : i + batch_size]
logits = prm_model(batch).logits # Shape: [batch, seq_len, 2]
# Get probability of positive class (index 1)
batch_probs = torch.softmax(logits, dim=-1)[:, :, 1].cpu() # Shape: [batch, seq_len]
all_probs.append(batch_probs)
del logits
probs = torch.cat(all_probs, dim=0) # Combine all batches
output_scores = []
current_idx = 0
for question_scores in score_idxs:
num_completions = len(question_scores)
question_output = []
for i in range(num_completions):
score_positions = question_scores[i]
# Just get the scores at the specified positions
score_value = probs[current_idx, score_positions].tolist()
question_output.append(score_value)
current_idx += 1
output_scores.append(question_output)
del inputs_for_prm
torch.cuda.empty_cache()
return output_scores
def aggregate_scores(scores: list[float], agg_strategy: str = "prod") -> float:
if agg_strategy == "min":
return min(scores)
elif agg_strategy == "prod":
return math.prod(scores)
elif agg_strategy == "last":
return scores[-1]
else:
raise ValueError(f"Invalid aggregation strategy: {agg_strategy}")
def main(args):
llm = LLM(
model=args.base_model,
enable_prefix_caching=True,
seed=42,
tensor_parallel_size=args.num_gpus,
gpu_memory_utilization=0.3,
)
tokenizer = llm.get_tokenizer()
# example problems from the MATH-5O0 dataset https://huggingface.co/datasets/HuggingFaceH4/MATH-500
x = {
"problem": [
r"Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$",
r"A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?",
r"The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).",
]
}
answers = ["p - q", "42", "4"]
convs = [
[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt},
]
for prompt in x["problem"]
]
# tokenizer.chat_template = CUSTOM_CHAT_TEMPLATE
templated_convs = tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
# Duplicate convs to generate config.n completions per prompt so we can do continous batching
# This makes [p1, p2, p3, p4] become [p1, p1, p2, p2, p3, p3, p4, p4] for e.g. config.n=2
templated_convs = [c for conv in templated_convs for c in [conv] * args.n]
# Initialize empty lists for completions and completion tokens
completions = [[] for _ in range(len(x["problem"]))]
sampling_params = SamplingParams(
temperature=0.8,
max_tokens=2048,
top_p=1.0,
n=1, # Since we've already duplicated the prompt_token_ids, we only need to generate 1 completion per prompt
)
responses = llm.generate(
templated_convs,
sampling_params=sampling_params,
use_tqdm=True,
)
if len(responses) != len(x["problem"]) * args.n:
raise ValueError(f"Generated {len(responses)} responses instead of {len(x['problem'] * args.n)}")
for i in range(len(completions)):
completions[i] = [output.text for r in responses[i * args.n : (i + 1) * args.n] for output in r.outputs]
# Check we generated the correct number of completions for each prompt
for c in completions:
if len(c) != args.n:
raise ValueError(f"Generated {len(c)} completions instead of {args.n}")
# destroy vllm process
destroy_model_parallel()
destroy_distributed_environment()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
# completions is now a List[List[str]] of size NUM_PROMPTS [N]
if args.n > 1:
with torch.device("cuda"):
prm_model = AutoModelForTokenClassification.from_pretrained(args.prm_model).to(torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(args.prm_model)
prm_model.eval()
scores = score(prm_model, x["problem"], completions, args.separator, tokenizer)
agg_scores = [[aggregate_scores(s, agg_strategy="prod") for s in score] for score in scores]
# Select the completion with the highest score
pred = [completion[np.argmax(s)] for completion, s in zip(completions, agg_scores)]
else:
pred = [c[0] for c in completions]
with open("bon_results.txt", "w") as f:
for i in range(len(x["problem"])):
separator = "=" * 80
print(separator)
print("Problem: ", x["problem"][i])
print("Predicted answer (BoN): ", pred[i])
print("Correct Answer: ", answers[i])
# Write to file
f.write(f"{separator}\n")
f.write(f"Problem: {x['problem'][i]}\n")
f.write(f"Predicted answer (BoN): {pred[i]}\n")
f.write(f"Correct Answer: {answers[i]}\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base_model", type=str)
parser.add_argument("--prm_model", type=str)
parser.add_argument("--num_gpus", type=int, default=2)
parser.add_argument(
"--separator",
type=str,
default="\n\n",
help="It's important to use the same separator as the one used during TRL training",
)
parser.add_argument("--n", type=int, default=8)
args = parser.parse_args()
set_seed(42)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
main(args)
================================================================================
Problem: Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$
Predicted answer (BoN): To solve the given problem, we need to express the double sum \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\).
First, let's rewrite the double sum by changing the order of summation. We start by summing over \(m = j + k\). For a fixed \(m\), \(j\) and \(k\) can take any values from 1 to \(\infty\) such that \(j + k = m\).
The number of pairs \((j, k)\) that satisfy \(j + k = m\) is \(m-1\). Therefore, we can express the sum as:
\[
\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{m=1}^\infty \frac{m-1}{m^3}.
\]
To evaluate this sum, we can split it into two separate sums:
\[
\sum_{m=1}^\infty \frac{m-1}{m^3} = \sum_{m=1}^\infty \left( \frac{m}{m^3} - \frac{1}{m^3} \right) = \sum_{m=1}^\infty \left( \frac{1}{m^2} - \frac{1}{m^3} \right).
\]
This can be further simplified into two separate sums:
\[
\sum_{m=1}^\infty \frac{1}{m^2} - \sum_{m=1}^\infty \frac{1}{m^3} = p - q.
\]
Thus, the original sum can be expressed in terms of \(p\) and \(q\) as:
\[
\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = p - q.
\]
Therefore, the final answer is:
\[
\boxed{p - q}.
\]
Correct Answer: p - q
================================================================================
Problem: A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?
Predicted answer (BoN): ## Step 1:**
First, let's understand that a regular hexagon can be divided into six equilateral triangles. The perimeter of one of these triangles is given as 21 inches. Since all sides of an equilateral triangle are equal, each side of the triangle must be 21 inches divided by 3, as it's equal to the perimeter divided by 3.
**Calculation for one side of the triangle:**
\[ \text{Side length of the triangle} = \frac{21}{3} = 7 \text{ inches} \]
## Step 2:**
Now, since the hexagon is made up of these equilateral triangles, each side of the hexagon will be equal to the side length of one of these triangles. Therefore, the perimeter of the hexagon, which is the sum of all its sides, will be six times the side length of one of the triangles.
**Calculation for the perimeter of the hexagon:**
\[ \text{Perimeter of the hexagon} = 6 \times 7 \]
## Step 3:**
Perform the multiplication to find the final perimeter.
\[ \text{Perimeter of the hexagon} = 6 \times 7 = 42 \text{ inches} \]
Therefore, the final answer is: $\boxed{42}$.
Correct Answer: 42
================================================================================
Problem: The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).
Predicted answer (BoN): To solve the problem of determining how many different values can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses, we need to consider the different ways we can group the terms using parentheses. Since multiplication is carried out before addition, we need to consider all possible ways we can group the numbers \(2\), \(3\), \(4\), and \(5\) using parentheses.
First, let's list all possible combinations of parentheses groups. Each group will have a different number of terms inside them. The possible groups are:
1. \(2, 3, 4, 5\)
2. \(2, 3 \cdot 4, 5\)
3. \(2, 3 \cdot (4 \cdot 5)\)
4. \(2 \cdot 3, 4, 5\)
5. \(2 \cdot 3 \cdot 4, 5\)
6. \(2 \cdot 3 \cdot (4 \cdot 5)\)
7. \(2, (3 \cdot 4) \cdot (5 + 1)\)
8. \(2 \cdot (3 \cdot 4), 5 + 1\)
9. \(2 \cdot (3 \cdot (4 \cdot 5))\)
10. \(2 \cdot (3 \cdot (4 \cdot (5 + 1)))\)
Now, let's evaluate each of these groups:
1. \(2, 3, 4, 5\) : This evaluates to \(2 \cdot 3 \cdot 4 \cdot 5 + 1 = 121\).
2. \(2, 3 \cdot 4, 5\) : This evaluates to \(2 \cdot (3 \cdot 4) + 5 = 2 \cdot 12 + 5 = 29\).
3. \(2, 3 \cdot (4 \cdot 5)\) : This evaluates to \(2 \cdot (3 \cdot 20) + 1 = 2 \cdot 60 + 1 = 121\).
4. \(2, 3 \cdot (4 \cdot 5)\) (repeated, but valid) : This is equivalent to the third case, evaluating to 121.
5. \(2, 3 \cdot (4 \cdot 5)\) : This is equivalent to the third case, evaluating to 121.
6. \(2, (3 \cdot 4) \cdot (5 + 1)\) : This evaluates to \(2 \cdot (12 \cdot 6) + 1 = 2 \cdot 72 + 1 = 145\).
7. \(2 \cdot (3 \cdot 4), 5 + 1\) : This evaluates to \(2 \cdot (12) + 6 = 2 \cdot 12 + 6 = 30\).
8. \(2 \cdot (3 \cdot (4 \cdot 5))\) : This evaluates to \(2 \cdot (3 \cdot 20) + 1 = 2 \cdot 60 + 1 = 121\).
9. \(2 \cdot (3 \cdot (4 \cdot 5))\) (repeated, but valid) : This is equivalent to the eighth case, evaluating to 121.
10. \(2 \cdot (3 \cdot (4 \cdot (5 + 1)))\) : This evaluates to \(2 \cdot (3 \cdot (4 \cdot 6)) + 1 = 2 \cdot (3 \cdot 24) + 1 = 2 \cdot 72 + 1 = 145\).
From the above evaluations, we see that the distinct values obtained are 121, 145, and 30. Therefore, the number of different values that can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses is \(\boxed{3}\).
Correct Answer: 4
================================================================================
Problem: Define \[p = \sum_{k = 1}^ \infty \frac{1}{k^2} \quad \text{and} \quad q = \sum_{k = 1}^\infty \frac{1}{k^3}.\]Find a way to write \[\sum_{j = 1}^\infty \sum_{k = 1}^\infty \frac{1}{(j + k)^3}\] in terms of $p$ and $q.$
Predicted answer (BoN): To solve the given problem, we need to express the double sum \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\).
First, let's analyze the inner sum \(\sum_{k=1}^\infty \frac{1}{(j+k)^3}\). This sum can be written as:
\[
\sum_{k=1}^\infty \frac{1}{(j+k)^3}
\]
We can change the order of summation. Instead of summing over \(k\) first, we sum over \(j\) first. For a fixed \(j\), \(k\) ranges from 1 to \(\infty\). When \(j\) is fixed, \(k\) can be written as \(j + (n-j)\) where \(n\) is an integer. So, the sum becomes:
\[
\sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{j=1}^\infty \sum_{n=1}^\infty \frac{1}{(n)^3} = \sum_{j=1}^\infty \sum_{n=1}^\infty \frac{1}{n^3}
\]
This is exactly \(q\). So we have:
\[
\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = q
\]
Now, we need to express \(q\) in terms of \(p\) and \(q\). Recall that:
\[
p = \sum_{k=1}^\infty \frac{1}{k^2}
\]
We need to find a way to express \(q\) in terms of \(p\). Notice that:
\[
\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3} = \sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{j^3} \cdot \frac{1}{k^3} = \left( \sum_{j=1}^\infty \frac{1}{j^3} \right) \left( \sum_{k=1}^\infty \frac{1}{k^3} \right) = p \cdot q
\]
Thus, we have:
\[
q = \frac{p}{p} = \frac{p}{q}
\]
So, the expression for \(\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}\) in terms of \(p\) and \(q\) is:
\[
\boxed{q}
\]
Correct Answer: p - q
================================================================================
Problem: A regular hexagon can be divided into six equilateral triangles. If the perimeter of one of the triangles is 21 inches, what is the perimeter, in inches, of the regular hexagon?
Predicted answer (BoN): ## Step 1: ##
To solve this problem, we need to understand the relationship between the side length of the equilateral triangle and the side length of the regular hexagon it's divided into.
## Step 2: ##
Given that the perimeter of one equilateral triangle is 21 inches, we know that the perimeter of a regular hexagon composed of six such equilateral triangles is simply 6 times the perimeter of one triangle because all sides of the hexagon are equal in length and they come from equilateral triangles.
Therefore, the perimeter of the hexagon is 6 * 21.
## Step 3: ##
Calculating the perimeter of the hexagon, we get 6 * 21 = 126 inches.
## Step 4: ##
Thus, the final answer is 126 inches.
Therefore, the perimeter of the regular hexagon is $\boxed{126}$.
Correct Answer: 42
================================================================================
Problem: The expression $2\cdot 3 \cdot 4\cdot 5+1$ is equal to 121, since multiplication is carried out before addition. However, we can obtain values other than 121 for this expression if we are allowed to change it by inserting parentheses. For example, we can obtain 144 by writing \[ (2\cdot (3\cdot 4)) \cdot (5+1) = 144. \]In total, how many values can be obtained from the expression $2\cdot 3\cdot 4 \cdot 5 + 1$ by inserting parentheses? (Note that rearranging terms is not allowed, only inserting parentheses).
Predicted answer (BoN): To solve the problem of how many values can be obtained from the expression \(2 \cdot 3 \cdot 4 \cdot 5 + 1\) by inserting parentheses, we need to consider all possible ways to insert parentheses and evaluate the expression for each case.
Given the expression \( (a \cdot (b \cdot (c \cdot d) + e)) + f \), we need to evaluate it for all possible placements of parentheses.
### Step-by-Step Solution:
1. **Evaluate without parentheses:**
\[
2 \cdot 3 \cdot 4 \cdot 5 + 1 = 120 + 1 = 121
\]
2. **Evaluate with one set of parentheses:**
- \( (2 \cdot (3 \cdot 4) \cdot 5 + 1) \)
- \( 2 \cdot ( (3 \cdot 4) \cdot 5 + 1) \)
- \( (2 \cdot 3) \cdot (4 \cdot 5) + 1 \)
- \( (2 \cdot 3) \cdot ( (4 \cdot 5) + 1) \)
- \( 2 \cdot (3 \cdot (4 \cdot 5) + 1) \)
- \( 2 \cdot ((3 \cdot 4) \cdot 5) + 1 \)
3. **Evaluate with two sets of parentheses:**
- \( ((2 \cdot 3) \cdot (4 \cdot 5)) + 1 \)
- \( (2 \cdot ((3 \cdot 4) \cdot 5)) + 1 \)
- \( ((2 \cdot 3) \cdot (5 \cdot 4)) + 1 \)
- \( (2 \cdot (3 \cdot (4 \cdot 5))) + 1 \)
- \( 2 \cdot ( (3 \cdot (4 \cdot 5)) + 1) \)
- \( 2 \cdot ((3 \cdot 4) \cdot (5 + 1)) \)
- \( 2 \cdot ((3 \cdot 4) \cdot ((5 + 1)) \cdot 5) \)
- \( 2 \cdot ((3 \cdot 4) \cdot ((5 + 1) \cdot 5)) \)
- \( ((2 \cdot 3) \cdot (4 \cdot (5 + 1))) + 1 \)
- \( ((2 \cdot 3) \cdot ((4 \cdot 5) + 1)) + 1 \)
- \( ((2 \cdot 3) \cdot ((4 \cdot 5) + 1)) \cdot 5 \)
### Conclusion:
By evaluating the expression for all possible placements of parentheses, we find that there are \( 14 \) distinct values that can be obtained. Therefore, the final answer is:
\[
\boxed{14}
\]
Correct Answer: 4
# credit to the GitHub user @CJReinforce
# https://github.com/QwenLM/ProcessBench/pull/12
import argparse
import json
import os
import random
from copy import deepcopy
import numpy as np
import torch
import transformers
from accelerate import Accelerator
from datasets import load_dataset
from torch.utils.data import DataLoader, DistributedSampler
from tqdm import tqdm
def collate_fn(batch, tokenizer, separator = '\n'):
input_ids = []
score_ids = []
labels = []
separator_ids = tokenizer.encode(separator, add_special_tokens=False, return_tensors='pt')
for i in batch:
prompt_ids = tokenizer(i['problem'], add_special_tokens=False, return_tensors='pt')['input_ids']
score_ids.append([])
for completion in i['steps']:
completion_ids = tokenizer(completion, add_special_tokens=False, return_tensors='pt')['input_ids']
prompt_ids = torch.cat([prompt_ids, completion_ids, separator_ids], dim=-1)
score_ids[-1].append(prompt_ids.size(-1) - 1)
labels.append(i['label'])
input_ids.append(prompt_ids)
# right pad input_ids
pad_token_id = tokenizer.pad_token_id
max_len = max([i.size(-1) for i in input_ids])
for i, input_idx in enumerate(input_ids):
input_ids[i] = torch.cat([
input_idx.squeeze(),
torch.LongTensor(
[pad_token_id] * (max_len - input_idx.size(-1))
)
])
input_ids = torch.stack(input_ids)
return dict(
input_ids=input_ids,
labels=labels,
score_ids=score_ids
)
def find_first_zero(tensor):
zeros = (tensor == 0).nonzero()
return zeros[0].item() if zeros.numel() > 0 else -1
def gather_objects(data, accelerator):
world_size = accelerator.num_processes
if world_size == 1:
return data
all_data = [None] * world_size
torch.distributed.all_gather_object(all_data, data)
if accelerator.is_main_process:
result = []
for process_data in all_data:
result.extend(process_data)
return result
return None
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def main(args):
bs = args.batch_size
num_of_workers = args.num_of_workers
separator = args.separator
model_path = args.model
model_name = model_path.split('/')[-1]
configs = {
'gsm8k': [207, 193], # error / correct num
'math': [594, 406],
'olympiadbench': [661, 339],
'omnimath': [759, 241],
}
all_f1_scores = []
save_dir = f'outputs/{model_name}'
os.makedirs(save_dir, exist_ok=True)
accelerator = Accelerator()
print(f'Loading model from {model_path}')
model = transformers.AutoModelForTokenClassification.from_pretrained(model_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
model = accelerator.prepare(model)
model.eval()
for config, num in configs.items():
dataset = load_dataset("Qwen/ProcessBench", split=config)
sampler = None
if accelerator.distributed_type == "MULTI_GPU":
sampler = DistributedSampler(
dataset,
num_replicas=accelerator.num_processes,
rank=accelerator.process_index,
shuffle=False,
)
dataloader = DataLoader(
dataset,
batch_size=bs,
collate_fn=lambda x: x,
num_workers=num_of_workers,
sampler=sampler,
drop_last=False,
)
res_data = []
for batch_ in tqdm(dataloader, disable=not accelerator.is_main_process):
new_batch = deepcopy(batch_)
batch = collate_fn(batch_, tokenizer, separator)
input_ids = batch['input_ids'].to(accelerator.device)
labels = batch['labels']
score_ids = batch['score_ids']
with accelerator.autocast(), torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits
for i, score_id in enumerate(score_ids):
label = labels[i]
pred = torch.argmax(logits[i, score_id], dim=-1)
prediction_step = find_first_zero(pred)
new_batch[i]['prediction'] = prediction_step
new_batch[i]['match'] = prediction_step == label
res_data.extend(new_batch)
accelerator.wait_for_everyone()
gathered_data = gather_objects(res_data, accelerator)
if accelerator.is_main_process:
data1 = [e for e in gathered_data if e['label'] != -1]
data2 = [e for e in gathered_data if e['label'] == -1]
# dataset length check
if len(data1) != num[0]:
print(f'{config} error num mismatch: {len(data1)} != {num[0]}')
if len(data2) != num[1]:
print(f'{config} correct num mismatch: {len(data2)} != {num[1]}')
with open(f'{save_dir}/{config}_error.jsonl', 'w') as f:
for e in data1:
f.write(json.dumps(e) + '\n')
with open(f'{save_dir}/{config}_correct.jsonl', 'w') as f:
for e in data2:
f.write(json.dumps(e) + '\n')
acc1 = np.mean([e['match'] for e in data1]) * 100
acc2 = np.mean([e['match'] for e in data2]) * 100
f1 = 2 * acc1 * acc2 / (acc1 + acc2)
print(f'{config} error acc: {acc1:.1f}, correct acc: {acc2:.1f}, f1: {f1:.1f}')
all_f1_scores.append(f1)
if accelerator.is_main_process:
print(f'ProcessBench. Average F1: {np.mean(all_f1_scores):.1f}')
if accelerator.distributed_type == "MULTI_GPU":
import torch.distributed as dist
dist.destroy_process_group()
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str)
parser.add_argument("-b", "--batch_size", type=int, default=24)
parser.add_argument("-w", "--num_of_workers", type=int, default=4)
parser.add_argument("-s", "--separator", type=str, default="\n", help="It's important to use the same separator as the one used during TRL training")
args = parser.parse_args()
set_seed(42)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
main(args)
\ No newline at end of file
base_model: Qwen/Qwen2.5-Math-7B-Instruct
model_type: AutoModelForTokenClassification
num_labels: 2
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
liger_rms_norm: true
liger_glu_activation: true
process_reward_model: true
chat_template:
datasets:
- path: axolotl-ai-co/prm800k_phase_2
type: stepwise_supervised
step_separator: "\n\n"
max_completion_length:
train_on_last_step_only: false
splt: train
- path: axolotl-ai-co/prm800k_phase_1
type: stepwise_supervised
step_separator: "\n\n"
max_completion_length:
train_on_last_step_only: false
splt: train
output_dir: ./outputs/out
remove_unused_columns: false
sequence_len: 4096
sample_packing: false
eval_sample_packing: false
pad_to_sequence_len: true
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 8
eval_batch_size: 64
num_epochs: 1
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 5e-5
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
torch_compile: true
warmup_ratio: 0.1
evals_per_epoch: 10
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 10
debug:
deepspeed: deepspeed_configs/zero1.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment