Initial commit

f87b35b2 · jerrrrry · f87b35b2 · f87b35b2 · f87b35b2 · f87b35b2
Commit f87b35b2 authored Apr 17, 2025 by jerrrrry
20 changed files
--- a/docs/workers/megatron_workers.rst
+++ b/docs/workers/megatron_workers.rst
+Megatron-LM Backend
+=====================
+
+We support Megatron Backend by implementing various workers for actor,
+critic, reference, rollout and reward models. We also implement the
+``3DHybridEngine`` using Megatron-LM and vLLM in `megatron_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_vllm.py>`_.
+
+**Pros**
+
+- Support 3D parallelism and sequence parallelism for best scalablility
+  and throughput.
+- 3D HybridEngine can significantly reduce peak memory usage and reduce
+  weight synchronize overhead between actor and rollout.
+
+**Cons**
+
+- Users should implement their own models for Megatron-LM
+- Users should implement the corresponding weight_loader to
+
+  - synchronize the model weight between actor (in Megatron) and rollout
+    (in vLLM).
+  - load weights from checkpoints to corresponding model in Megatron-LM
+
+Megatron Workers
+----------------
+
+MegatronWorker
+^^^^^^^^^^^^^^
+
+``MegatronWorker`` is the base class of different megatron worker
+classes. In this class, ``get_megatron_global_info`` and
+``get_megatron_rank_info`` function to retrive the 3D parallel world
+size and rank of each ``Worker`` running on specific GPU. These information
+will be used in transfer protocol for Megatron Backend.
+
+The following ``Worker`` class for different models will be utilized to
+construct the ``WorkerGroup`` .
+
+We implement various of APIs for each ``Worker`` class decorated by the
+``@register(dispatch_mode=)`` . These APIs can be called by the ray
+driver process. The data can be correctly collect and dispatch following
+the ``dispatch_mode`` on each function. The supported dispatch_model
+(i.e., transfer protocols) can be found in `decorator.py <https://github.com/volcengine/verl/blob/main/verl/single_controller/base/decorator.py>`_.
+
+ActorRolloutRefWorker
+^^^^^^^^^^^^^^^^^^^^^
+
+This class is implemented for Actor/Rollout HybridEngine or for the
+reference model to initialize their model and perform computation.
+
+Actor/Rollout HybridEngine
+''''''''''''''''''''''''''
+
+1. HybridEngine, Actor and Rollout initialization API.
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
+   def init_model(self):
+
+``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
+process, each worker (on a GPU) will execute the following model
+initialization process.
+
+The initialization details of HybridEngine, Actor and Rollout are
+highlighted below:
+
+1. ``AllGatherPPModel`` holds memory buffer for both Actor and Rollout
+   and support weight resharding between actor and rollout.
+2. ``MegatronPPOActor`` implements the simple PPO computation logics
+   when the model is built with Megatron, including compute log prob,
+   model update.
+3. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
+   Engine and make it executed under SPMD to fit into our
+   ``WorkerGroup`` design.
+4. ``MegatronVLLMShardingManager`` a context manager to perform actual
+   resharding between actor and rollout.
+
+See `source code <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py#L63>`_ for more information.
+
+.. code:: python
+
+   # Initialize the 3D HybridEngine
+   hybrid_engine = AllGatherPPModel(model_provider=megatron_actor_model_provider)
+   # Fetch the model at current rank
+   actor_module = hybrid_engine.this_rank_models
+   ...
+
+   # build actor model
+   self.actor = MegatronPPOActor(config=self.config.actor,
+                                 model_config=self.actor_model_config,
+                                 megatron_config=megatron_config,
+                                 actor_module=self.actor_module,
+                                 actor_optimizer=self.actor_optimizer,
+                                 actor_optimizer_config=self.actor_optim_config)
+
+   # build rollout
+   # rollout initialization
+   rollout = vLLMRollout(actor_module=params,
+                        config=self.config.rollout,
+                        tokenizer=self.tokenizer,
+                        model_hf_config=self.actor_model_config,
+                        train_tp=mpu.get_tensor_model_parallel_world_size())
+   # perform weight resharding between actor and rollout
+   sharding_manager = MegatronVLLMShardingManager(module=self.hybrid_engine,
+                                                  inference_engine=rollout.inference_engine,
+                                                  model_config=self.actor_model_config,
+                                                  layer_name_mapping=layer_name_mapping)
+   ...
+
+2. Generate sequence and recompute log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
+   def generate_sequences(self, prompts: DataProto):
+
+- ``Dispatch.MEGATRON_PP_AS_DP_PROTO``: The PP dimension of the actor
+  model will be regarded as DP dimension. Then the driver process will
+  dispatch and collect the data according to this reorganization. This
+  is because, in HybridEngine, the actor weight, which usually applied
+  larger 3D parallel sizes, will be gathered along the PP dimension and
+  TP dimension. Therefore, the corresponding data should be dispatched
+  and collected through the 3D parallel group of the rollout model,
+  rather than the actor model. However, the world_size and rank
+  information can only be retrived from ``get_megatron_global_info`` and
+  ``get_megatron_rank_info``, which records the 3D information for the
+  actor model. Moreover, the data resharding inside TP dimension will be
+  processed within the HybridEngine.
+
+- In this function, the rollout model will perform auto-regressive
+  generation and the actor model will recompute the old log prob for the
+  generated response.
+
+3. Update actor model
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def update_actor(self, data: DataProto):
+
+- ``Dispatch.MEGATRON_COMPUTE_PROTO``: User passes the data partitioned
+  by DP dimension. The data is dispatched to all tp/pp ranks within the
+  same dp group, and ultimately only collects output data from tp=0 and
+  the last pp.
+- Update the actor model weight using PPO & entropy loss.
+
+ReferenceModel
+''''''''''''''
+
+1. Reference model initialization
+
+The reference model is initialized using the same function as the actor
+model without initializing the HybridEngine and Optimizer. Then the
+actor model is also wrapped by the ``MegatronPPOActor``.
+
+2. Compute reference log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_ref_log_prob(self, data: DataProto):
+
+- In this function, the reference model will call the compute log prob
+  function in ``MegatronPPOActor`` to compute the reference log prob.
+
+CriticWorker and RewardWorker
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. Model initialization
+
+Quite similar to reference model. The CriticWorker will perform
+additional initialization for the Optimizer.
+
+2. Compute Values for CriticWorker
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_values(self, data: DataProto):
+
+3. Update Critic
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def update_critic(self, data: DataProto):
+
+4. Compute Reward
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_rm_score(self, data: DataProto):
+
+Context Parallel
+----------------
+
+Currently we can only use LLaMa and Qwen models implemented in verl, and context parallel is not supported by far.
+
+We are working in progress to support Megatron implementation of GPTModel, with TransformerEngine support. So if the itegration goes well, we can support Ulysses, Ring and AllGather context parallel in the future.
+
+Now we support Megatron checkpointing save/load function with original models. Please check the :ref:`config-explain-page` page to see how to use the APIs.
\ No newline at end of file
--- a/docs/workers/ray_trainer.rst
+++ b/docs/workers/ray_trainer.rst
+PPO Ray Trainer
+===============
+
+We implement the RayPPOTrainer, which is a trainer runs on the driver
+process on a single CPU/GPU node (default is CPU).
+
+The PPORayTrainer include 3 core functions for data preparation,
+WorkerGroup initialization and PPO training loop.
+
+Data Preparation
+----------------
+
+The ``PPORayTrainer``, as a single process, is responsible for loading a
+complete batch of samples (prompts) from the dataset and then dispatch
+to different worker_groups running on different GPUs.
+
+To generalize the data loading, we implement the ``RLHFDataset`` class
+to load the preprocessed parquet files, apply chat templates to the
+prompts, add padding, truncate prompts that exceed max prompt length and
+then tokenize.
+
+.. code:: python
+
+   self.train_dataset = RLHFDataset(data_files=self.config.data.train_files,
+                                       tokenizer=self.tokenizer,
+                                       config=self.config.data)
+
+Then, the dataloader will iterate the dataset under PPO mini batch size.
+
+WorkerGroup Initialization
+--------------------------
+
+We first introduce a basic implementation of initializing the
+``WorkerGroup`` of the actor model on a given set of GPUs.
+
+.. code:: python
+
+   # max_colocate_count means the number of WorkerGroups (i.e. processes) in each RayResourcePool
+   # For FSDP backend, we recommend using max_colocate_count=1 that merge all WorkerGroups into one.
+   # For Megatron backend, we recommend using max_colocate_count>1 that can utilize different WorkerGroup for differnt models
+   resource_pool = RayResourcePool(process_on_nodes=[config.trainer.n_gpus_per_node] * config.trainer.nnodes,
+                                   use_gpu=True,
+                                   max_colocate_count=1)
+   # define actor rollout cls to be init on remote
+   actor_rollout_cls = RayClassWithInitArgs(cls=ActorRolloutWorker)
+   # define actor_rollout worker group
+   actor_rollout_worker_group = MegatronRayWorkerGroup(resource_pool=resource_pool,
+                                                       ray_cls_with_init=actor_rollout_cls,
+                                                       default_megatron_kwargs=config.actor_rollout.megatron)
+
+Different WorkerGroups, like ``actor_rollout_worker_group`` ,
+``critic_worker_group`` and ``ref_worker_group`` lies on a separate
+process in the above implementation.
+
+The driver process can then call the distributed compute function within
+the ``actor_rollout_worker_group`` and other roles to construct the RL
+training loop.
+
+For models colocated in the same set of GPUs, we further provide a
+fine-grain optimization, which merge the ``worker_group`` of different roles
+in the same process. This optimization can save the redundant
+CUDA/distributed context in different processes.
+
+.. code:: python
+
+   # initialize WorkerGroup
+   # NOTE: if you want to use a different resource pool for each role, which can support different parallel size,
+   # you should not use `create_colocated_worker_cls`. Instead, directly pass different resource pool to different worker groups.
+   # See TODO(url) for more information.
+   all_wg = {}
+   for resource_pool, class_dict in self.resource_pool_to_cls.items():
+       worker_dict_cls = create_colocated_worker_cls(class_dict=class_dict)
+       wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
+       spawn_wg = wg_dict.spawn(prefix_set=class_dict.keys())
+       all_wg.update(spawn_wg)
+
+   if self.use_critic:
+       self.critic_wg = all_wg['critic']
+       self.critic_wg.init_model()
+
+   if self.use_reference_policy:
+       self.ref_policy_wg = all_wg['ref']
+       self.ref_policy_wg.init_model()
+
+   if self.use_rm:
+       self.rm_wg = all_wg['rm']
+       self.rm_wg.init_model()
+
+   # we should create rollout at the end so that vllm can have a better estimation of kv cache memory
+   self.actor_rollout_wg = all_wg['actor_rollout']
+   self.actor_rollout_wg.init_model()
+
+.. note:: For megatron backend, if we merge the ``worker_groups`` into the same processes, all the roles will utilize the same 3D parallel size. To optimize this, we may need to maintain several 3D process groups for each role in the same distributed context. If you want to use different 3D parallel size for different roles, please follow the similar architecture of the first code block to initialize each role's ``worker_group``
+
+
+PPO Training Loop
+-----------------
+
+We implement the PPO training loop by calling the functions in
+worker_group of each role. The input and output data of each function is
+a ``DataProto`` object implemented in `protocol.py <https://github.com/volcengine/verl/blob/main/verl/protocol.py>`_. In the training
+loop, trainer will dispatch/collect the data to/from different GPUs
+following the transfer protocols wrapped in the workers' functions. The
+computation of PPO micro batches is processed in ``update_actor`` and
+``update_critic`` functions.
+
+To extend to other RLHF algorithms, such as DPO, GRPO, please refer to
+:doc:`../advance/dpo_extension`.
+
+.. code:: python
+
+   def fit(self):
+       """
+       The training loop of PPO.
+       The driver process only need to call the compute functions of the worker group through RPC to construct the PPO dataflow.
+       The light-weight advantage computation is done on the driver process.
+       """
+       from verl.utils.tracking import Tracking
+       from omegaconf import OmegaConf
+
+       logger = Tracking(project_name=self.config.trainer.project_name,
+                           experiment_name=self.config.trainer.experiment_name,
+                           default_backend=self.config.trainer.logger,
+                           config=OmegaConf.to_container(self.config, resolve=True))
+
+       global_steps = 0
+
+       # perform validation before training
+       # currently, we only support validation using the reward_function.
+       if self.val_reward_fn is not None:
+           val_metrics = self._validate()
+           pprint(f'Initial validation metrics: {val_metrics}')
+
+       for epoch in range(self.config.trainer.total_epochs):
+           for batch_dict in self.train_dataloader:
+               metrics = {}
+
+               batch: DataProto = DataProto.from_single_dict(batch_dict)
+               # batch = batch.to('cuda')
+
+               # pop those keys for generation
+               gen_batch = batch.pop(batch_keys=['input_ids', 'attention_mask', 'position_ids'])
+
+               # generate a batch
+               with Timer(name='gen', logger=None) as timer:
+                   gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
+               metrics['timing/gen'] = timer.last
+
+               batch = batch.union(gen_batch_output)
+
+               if self.use_reference_policy:
+                   # compute reference log_prob
+                   with Timer(name='ref', logger=None) as timer:
+                       ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
+                       batch = batch.union(ref_log_prob)
+                   metrics['timing/ref'] = timer.last
+
+               # compute values
+               with Timer(name='values', logger=None) as timer:
+                   values = self.critic_wg.compute_values(batch)
+                   batch = batch.union(values)
+               metrics['timing/values'] = timer.last
+
+               with Timer(name='adv', logger=None) as timer:
+                   # compute scores. Support both model and function-based.
+                   # We first compute the scores using reward model. Then, we call reward_fn to combine
+                   # the results from reward model and rule-based results.
+                   if self.use_rm:
+                       # we first compute reward model score
+                       reward_tensor = self.rm_wg.compute_rm_score(batch)
+                       batch = batch.union(reward_tensor)
+
+                   # we combine with rule-based rm
+                   reward_tensor = self.reward_fn(batch)
+                   batch.batch['token_level_scores'] = reward_tensor
+
+                   # compute rewards. apply_kl_penalty if available
+                   batch, kl_metrics = apply_kl_penalty(batch,
+                                                           kl_ctrl=self.kl_ctrl_in_reward,
+                                                           kl_penalty=self.config.algorithm.kl_penalty)
+                   metrics.update(kl_metrics)
+
+                   # compute advantages, executed on the driver process
+                   batch = compute_advantage(batch,
+                                               self.config.algorithm.gamma,
+                                               self.config.algorithm.lam,
+                                               adv_estimator=self.config.algorithm.adv_estimator)
+               metrics['timing/adv'] = timer.last
+
+               # update critic
+               if self.use_critic:
+                   with Timer(name='update_critic', logger=None) as timer:
+                       critic_output = self.critic_wg.update_critic(batch)
+                   metrics['timing/update_critic'] = timer.last
+                   critic_output_metrics = reduce_metrics(critic_output.meta_info['metrics'])
+                   metrics.update(critic_output_metrics)
+
+               # implement critic warmup
+               if self.config.trainer.critic_warmup <= global_steps:
+                   # update actor
+                   with Timer(name='update_actor', logger=None) as timer:
+                       actor_output = self.actor_rollout_wg.update_actor(batch)
+                   metrics['timing/update_actor'] = timer.last
+                   actor_output_metrics = reduce_metrics(actor_output.meta_info['metrics'])
+                   metrics.update(actor_output_metrics)
+
+               # validate
+               if self.val_reward_fn is not None and (global_steps + 1) % self.config.trainer.test_freq == 0:
+                   with Timer(name='testing', logger=None) as timer:
+                       val_metrics: dict = self._validate()
+                       val_metrics = {f'val/{key}': val for key, val in val_metrics.items()}
+                   metrics['timing/testing'] = timer.last
+                   metrics.update(val_metrics)
+
+               # collect metrics
+               data_metrics = compute_data_metrics(batch=batch)
+               metrics.update(data_metrics)
+
+               # TODO: make a canonical logger that supports various backend
+               logger.log(data=metrics, step=global_steps)
+
+               if self.config.trainer.save_freq > 0 and (global_steps + 1) % self.config.trainer.save_freq == 0:
+                   actor_local_path = os.path.join(self.config.trainer.default_local_dir, 'actor',
+                                                   f'global_step_{global_steps}')
+                   actor_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'actor')
+                   self.actor_rollout_wg.save_checkpoint(actor_local_path, actor_remote_path)
+
+                   if self.use_critic:
+                       critic_local_path = os.path.join(self.config.trainer.default_local_dir, 'critic',
+                                                           f'global_step_{global_steps}')
+                       critic_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'critic')
+                       self.critic_wg.save_checkpoint(critic_local_path, critic_remote_path)
+
+               global_steps += 1
+
+       # perform validation after training
+       if self.val_reward_fn is not None:
+           val_metrics = self._validate()
+           pprint(f'Final validation metrics: {val_metrics}')
--- a/docs/workers/sglang_worker.rst
+++ b/docs/workers/sglang_worker.rst
+SGLang Backend
+==============
+**Authored By SGLang RL Team and listed alphabetically by last name**
+
+`Jingyi Chen <https://github.com/fzyzcjy>`_, `Yitong Guan <https://github.com/minleminzui>`_, `Zhuobin Huang <https://zobinhuang.github.io/sec_about/>`_, `Jiajun Li <https://github.com/guapisolo>`_, `Ji Li <https://github.com/GeLee-Q>`_, `Shenggui Li <https://franklee.xyz/about>`_, `Junrong Lin <https://github.com/ocss884>`_, `Xiang Long <https://github.com/SwordFaith>`_, `Rui Lu <https://scholar.google.com/citations?user=-MGuqDcAAAAJ>`_, `Jin Pan <https://jhinpan.github.io/>`_, `Shuai Shi <https://github.com/shuaills>`_, `Yushen Su <https://yushengsu-thu.github.io/>`_, `Xinyuan Tong <https://github.com/JustinTong0323>`_, `Chendong Wang <https://github.com/cedricbeta>`_, `Hanchen Zhang <https://scholar.google.com/citations?user=pGcJcagAAAAJ>`_, `Haoran Wang <https://ubecc.github.io/about/>`_, `Yongan Xiang <https://github.com/BearBiscuit05>`_, `Chengxing Xie <https://yitianlian.github.io/>`_, `Yuhao Yang <https://github.com/yhyang201>`_, `Jinwei Yao <https://kivi-yao.github.io/>`_, `Qiaolin Yu <https://github.com/Qiaolin-Yu>`_, `Yuzhen Zhou <https://github.com/zyzshishui>`_, `Chenyang Zhao <https://github.com/zhaochenyang20>`_
+
+
+
+Introduction
+------------
+`SGLang <https://github.com/sgl-project/sglang>`_ is an open-source state-of-the-art inference service engine, fully adopted by xAI to support all inference needs of Grok during research and serving processes.
+
+Currently, verl fully supports using SGLang as the inference engine during the rollout phase. As a rollout engine, SGLang provides the same feature coverage as vLLM., including memory saving and multi-node rollout features. After installing verl and SGLang, simply add ``actor_rollout_ref.rollout.name=sglang`` at startup to seamlessly switch between the two inference frameworks.
+
+In addition, the SGLang team is actively working on supporting features such as Multi-Turn Agentic RL, VLM RLHF, Server-Based RLHF, and Partial Rollout. You can track the related development progress in the `Tracking Roadmap <https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/74>`_.
+
+Installation
+------------
+First, follow the requirements outlined in `Install SGLang as rollout backend <https://verl.readthedocs.io/en/latest/start/install.html#install-sglang-as-rollout-backend>`_ for installation, and ensure that the version requirements are met. Generally, using the latest `SGLang <https://github.com/sgl-project/sglang>`_ from the main branch will allow stable training startup without needing to target a specific version.
+
+.. code-block:: bash
+
+    # Currently 0.4.5, subject to updates at any time, please refer to the latest version
+    pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
+
+Using SGLang as the Inference Backend for PPO Training on a Single Machine
+-------------------------------------------------------------------------
+We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
+
+1. Run the following command to prepare the gsm8k dataset:
+
+.. code-block:: bash
+
+    python3 examples/data_preprocess/gsm8k.py
+
+2. Run the following script to conduct a PPO experiment on a single machine with 4 GPUs:
+
+.. code-block:: bash
+
+    PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
+        data.train_files=$HOME/data/gsm8k/train.parquet \
+        data.val_files=$HOME/data/gsm8k/test.parquet \
+        data.train_batch_size=4096 \
+        data.max_prompt_length=4096 \
+        data.max_response_length=4096 \
+        actor_rollout_ref.rollout.name=sglang \
+        actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=True \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+        critic.optim.lr=1e-5 \
+        critic.model.path=Qwen/Qwen2-7B-Instruct \
+        critic.ppo_micro_batch_size_per_gpu=4 \
+        critic.model.fsdp_config.param_offload=True \
+        critic.model.fsdp_config.optimizer_offload=True \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.logger=['console'] \
+        trainer.val_before_train=False \
+        trainer.default_hdfs_dir=null \
+        trainer.n_gpus_per_node=4 \
+        trainer.nnodes=1 \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15 2>&1 | tee verl_demo.log
+
+Using SGLang as the Inference Backend for PPO Training Across Multiple Machines
+------------------------------------------------------------------------------
+SGLang also supports running verl's RAY-based cross-machine inference in IPv4 and IPv6 scenarios. In the script below, we use TP=16 for cross-machine inference. Suppose we have two interconnected machines: node0 with IP 10.94.16.4 and node1 with IP 10.94.16.5.
+
+1. Start Ray on node0:
+
+.. code-block:: bash
+
+    ray start --head --dashboard-host=0.0.0.0
+
+You will see the following prompt:
+
+.. code-block:: bash
+
+    Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
+
+    Local node IP: 10.94.16.4
+
+    --------------------
+    Ray runtime started.
+    --------------------
+
+    Next steps
+    To add another node to this Ray cluster, run
+        ray start --address='10.94.16.4:6379'
+
+2. Have node1 join the Ray cluster:
+
+Run the following command on node1:
+
+.. code-block:: bash
+
+    ray start --address='10.94.16.4:6379'
+
+Run the following command to confirm that the Ray cluster now has two nodes:
+
+.. code-block:: bash
+
+    ray status
+
+You can see that the cluster has two nodes with 16 GPUs:
+
+.. code-block:: bash
+
+    ======== Autoscaler status: 2025-04-09 09:25:37.694016 ========
+    Node status
+    ---------------------------------------------------------------
+    Active:
+     1 node_ef382ffd687d8f6b060c1b68e63ada7341b936fe5b1901dd04de1027
+     1 node_1eb4d7d07e793114c23a89d1a41f1f76acf6ef5b35af844a4ee8e4ba
+    Pending:
+     (no pending nodes)
+    Recent failures:
+     (no failures)
+
+    Resources
+    ---------------------------------------------------------------
+    Usage:
+     0.0/360.0 CPU
+     0.0/16.0 GPU
+     0B/3.39TiB memory
+     0B/372.53GiB object_store_memory
+
+3. Run the following script to train meta-llama/Llama-3.1-8B-Instruct with TP=16 across 2 machines using 16 GPUs:
+
+.. code-block:: bash
+
+    DATA_DIR=$HOME/data/gsm8k
+
+    python3 -m verl.trainer.main_ppo \
+        actor_rollout_ref.rollout.name=sglang \
+        data.train_files=$DATA_DIR/train.parquet \
+        data.val_files=$DATA_DIR/test.parquet \
+        data.train_batch_size=4096 \
+        data.max_prompt_length=4096 \
+        data.max_response_length=4096 \
+        actor_rollout_ref.model.path=meta-llama/Llama-3.1-8B-Instruct \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.model.use_remove_padding=True \
+        actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=True \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=16 \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+        actor_rollout_ref.rollout.free_cache_engine=True \
+        actor_rollout_ref.ref.log_prob_micro_batch_size=16 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=True \
+        critic.optim.lr=1e-5 \
+        critic.model.use_remove_padding=True \
+        critic.model.path=meta-llama/Llama-3.1-8B-Instruct \
+        critic.model.enable_gradient_checkpointing=True \
+        critic.ppo_micro_batch_size=16 \
+        critic.model.fsdp_config.param_offload=True \
+        critic.model.fsdp_config.optimizer_offload=True \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.critic_warmup=0 \
+        trainer.logger=['console'] \
+        trainer.val_before_train=True \
+        trainer.default_hdfs_dir=null \
+        trainer.n_gpus_per_node=8 \
+        trainer.nnodes=2 \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15 2>&1 | tee verl_demo.log
--- a/examples/checkpoint/run_deepseek_megatron_ckpt.sh
+++ b/examples/checkpoint/run_deepseek_megatron_ckpt.sh
+set -x
+
+# the config file used: verl/trainer/main_ppo/config/ppo_megatron_trainer.yaml
+
+huggingface-cli download deepseek-ai/deepseek-llm-7b-chat
+
+export VLLM_ATTENTION_BACKEND=XFORMERS
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml'\
+    algorithm.adv_estimator=gae \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=2e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
+    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
+    critic.optim.lr=2e-5 \
+    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size_per_gpu=4 \
+    critic.megatron.pipeline_model_parallel_size=2 \
+    critic.megatron.tensor_model_parallel_size=4 \
+    algorithm.use_kl_in_reward=True \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_megatron_checkpoint' \
+    trainer.experiment_name='deepseek_megatron_checkpoint_saveload' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.save_freq=50 \
+    trainer.test_freq=1 \
+    trainer.total_epochs=15 \
+    trainer.total_training_steps=50 $@
+
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml'\
+    algorithm.adv_estimator=gae \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=2e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
+    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
+    critic.optim.lr=2e-5 \
+    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size_per_gpu=4 \
+    critic.megatron.pipeline_model_parallel_size=2 \
+    critic.megatron.tensor_model_parallel_size=4 \
+    algorithm.use_kl_in_reward=True \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_megatron_checkpoint' \
+    trainer.experiment_name='deepseek_megatron_checkpoint_saveload' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.resume_mode=auto \
+    trainer.save_freq=-1 \
+    trainer.test_freq=1 \
+    trainer.total_epochs=15 \
+    trainer.total_training_steps=150 $@
\ No newline at end of file
--- a/examples/checkpoint/run_qwen_megatron_ckpt.sh
+++ b/examples/checkpoint/run_qwen_megatron_ckpt.sh
+set -x
+
+# the config file used: verl/trainer/main_ppo/config/ppo_megatron_trainer.yaml
+
+huggingface-cli download Qwen/Qwen2-7B-Instruct
+
+export VLLM_ATTENTION_BACKEND=XFORMERS
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml'\
+    algorithm.adv_estimator=gae \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+    actor_rollout_ref.actor.optim.lr=2e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
+    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2 \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
+    critic.optim.lr=2e-5 \
+    critic.model.path=Qwen/Qwen2-7B-Instruct \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size_per_gpu=4 \
+    critic.megatron.pipeline_model_parallel_size=2 \
+    critic.megatron.virtual_pipeline_model_parallel_size=2 \
+    critic.megatron.tensor_model_parallel_size=4 \
+    algorithm.use_kl_in_reward=True \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_megatron_checkpoint' \
+    trainer.experiment_name='qwen2_7b_megatron_saveload' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.save_freq=100 \
+    trainer.test_freq=1 \
+    trainer.total_epochs=15 \
+    trainer.total_training_steps=100 $@
+
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml'\
+    algorithm.adv_estimator=gae \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+    actor_rollout_ref.actor.optim.lr=2e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
+    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2 \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
+    critic.optim.lr=2e-5 \
+    critic.model.path=Qwen/Qwen2-7B-Instruct \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size_per_gpu=4 \
+    critic.megatron.pipeline_model_parallel_size=2 \
+    critic.megatron.virtual_pipeline_model_parallel_size=2 \
+    critic.megatron.tensor_model_parallel_size=4 \
+    algorithm.use_kl_in_reward=True \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_megatron_checkpoint' \
+    trainer.experiment_name='qwen2_7b_megatron_saveload' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.resume_mode=auto \
+    trainer.save_freq=-1 \
+    trainer.test_freq=1 \
+    trainer.total_epochs=15 \
+    trainer.total_training_steps=150 $@
\ No newline at end of file
--- a/examples/data_preprocess/full_hh_rlhf.py
+++ b/examples/data_preprocess/full_hh_rlhf.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+- Preprocess data and split the training set into 75% for training RM and 25% for validting RM.
+- All the training data is used to train SFT and RL.
+- Both chosen and rejected is used to train SFT
+"""
+import argparse
+import os
+
+import pandas as pd
+from datasets import load_dataset
+
+from tqdm.auto import tqdm
+
+from verl.utils.fs import copy, makedirs
+
+
+def generate_sft_dataset(target_hdfs_path_dir, local_dir='~/data/full_hh_rlh/sft'):
+    dataset = load_dataset('Dahoas/full-hh-rlhf')
+    output = {'prompt': [], 'response': []}
+    for data in tqdm(dataset['train']):
+        # add chosen
+        output['prompt'].append(data['prompt'])
+        output['response'].append(data['chosen'])
+
+        # add rejection
+        output['prompt'].append(data['prompt'])
+        output['response'].append(data['rejected'])
+
+    df = pd.DataFrame(output)
+
+    local_dir = os.path.expanduser(local_dir)
+    os.makedirs(local_dir, exist_ok=True)
+
+    local_path = os.path.join(local_dir, 'train.parquet')
+
+    df.to_parquet(path=local_path)
+
+    if target_hdfs_path_dir is not None:
+        hdfs_dir = target_hdfs_path_dir + '/' + 'train.parquet'
+        makedirs(hdfs_dir)
+
+        copy(local_path, hdfs_dir)
+
+
+def generate_rm_dataset(target_hdfs_path_dir, local_dir='~/data/full_hh_rlh/rm'):
+    train_dataset = load_dataset('Dahoas/full-hh-rlhf', split='train[:75%]')
+    test_dataset = load_dataset('Dahoas/full-hh-rlhf', split='train[-25%:]')
+
+    local_dir = os.path.expanduser(local_dir)
+    os.makedirs(local_dir, exist_ok=True)
+
+    for dataset, name in zip([train_dataset, test_dataset], ['train', 'test']):
+        output = {'prompt': [], 'chosen': [], 'rejected': []}
+        for data in tqdm(dataset):
+            # add chosen
+            output['prompt'].append(data['prompt'])
+            output['chosen'].append(data['chosen'])
+            output['rejected'].append(data['rejected'])
+
+        df = pd.DataFrame(output)
+
+        local_path = os.path.join(local_dir, name + '.parquet')
+
+        df.to_parquet(path=local_path)
+
+        if target_hdfs_path_dir is not None:
+            hdfs_dir = target_hdfs_path_dir + '/' + name + '.parquet'
+            makedirs(hdfs_dir)
+
+            copy(local_path, hdfs_dir)
+
+
+def generate_rl_dataset(target_hdfs_path_dir, local_dir='~/data/full_hh_rlhf/rl'):
+    dataset = load_dataset('Dahoas/full-hh-rlhf')
+    train_dataset = dataset['train']
+
+    data_source = 'Dahoas/full-hh-rlhf'
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+
+        def process_fn(example, idx):
+            prompt = example.pop('prompt')
+            response = example.pop('response')
+
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": prompt
+                }],
+                "ability": "alignment",
+                "reward_model": {
+                    "style": "model",
+                    "ground_truth": response  # should not be used
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx
+                }
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    local_dir = os.path.expanduser(local_dir)
+    local_path = os.path.join(local_dir, 'train.parquet')
+    train_dataset.to_parquet(local_path)
+
+    if target_hdfs_path_dir is not None:
+        hdfs_dir = target_hdfs_path_dir + '/' + 'train.parquet'
+        makedirs(hdfs_dir)
+
+        copy(local_path, hdfs_dir)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--split', type=str, choices=['sft', 'rm', 'rl'], required=True)
+    parser.add_argument('--local_dir', type=str, default='~/data/full_hh_rlhf')
+    parser.add_argument('--hdfs_dir', type=str, required=False, default=None)
+
+    args = parser.parse_args()
+
+    if args.split == 'sft':
+        generate_sft_dataset(args.hdfs_dir, os.path.join(args.local_dir, args.split))
+    elif args.split == 'rm':
+        generate_rm_dataset(args.hdfs_dir, os.path.join(args.local_dir, args.split))
+    elif args.split == 'rl':
+        generate_rl_dataset(args.hdfs_dir, os.path.join(args.local_dir, args.split))
+    else:
+        raise NotImplementedError
--- a/examples/data_preprocess/geo3k.py
+++ b/examples/data_preprocess/geo3k.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the Geometry3k dataset to parquet format
+"""
+
+import os
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+import argparse
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='~/data/geo3k')
+    parser.add_argument('--hdfs_dir', default=None)
+
+    args = parser.parse_args()
+
+    data_source = 'hiyouga/geometry3k'
+
+    dataset = datasets.load_dataset(data_source)
+
+    train_dataset = dataset['train']
+    test_dataset = dataset['test']
+
+    instruction_following = (
+        r'You FIRST think about the reasoning process as an internal monologue and then provide the final answer. '
+        r'The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}.'
+    )
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+
+        def process_fn(example, idx):
+            problem = example.pop('problem')
+            prompt = problem + ' ' + instruction_following
+            answer = example.pop('answer')
+            images = example.pop('images')
+
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": prompt,
+                }],
+                "images": images,
+                "ability": "math",
+                "reward_model": {
+                    "style": "rule",
+                    "ground_truth": answer
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx,
+                    'answer': answer,
+                    "question": problem,
+                }
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True, num_proc=8)
+    test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True, num_proc=8)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/data_preprocess/gsm8k.py
+++ b/examples/data_preprocess/gsm8k.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the GSM8k dataset to parquet format
+"""
+
+import re
+import os
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+import argparse
+
+
+def extract_solution(solution_str):
+    solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str)
+    assert solution is not None
+    final_solution = solution.group(0)
+    final_solution = final_solution.split('#### ')[1].replace(',', '')
+    return final_solution
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='~/data/gsm8k')
+    parser.add_argument('--hdfs_dir', default=None)
+
+    args = parser.parse_args()
+
+    data_source = 'openai/gsm8k'
+
+    dataset = datasets.load_dataset(data_source, 'main')
+
+    train_dataset = dataset['train']
+    test_dataset = dataset['test']
+
+    instruction_following = "Let's think step by step and output the final answer after \"####\"."
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+
+        def process_fn(example, idx):
+            question_raw = example.pop('question')
+
+            question = question_raw + ' ' + instruction_following
+
+            answer_raw = example.pop('answer')
+            solution = extract_solution(answer_raw)
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": question,
+                }],
+                "ability": "math",
+                "reward_model": {
+                    "style": "rule",
+                    "ground_truth": solution
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx,
+                    'answer': answer_raw,
+                    "question": question_raw,
+                }
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/data_preprocess/hellaswag.py
+++ b/examples/data_preprocess/hellaswag.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess Hellaswag dataset.
+
+"""
+
+import re
+import os
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+import argparse
+
+
+def preprocess(text):
+    text = text.strip()
+    # NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='/opt/tiger/hellaswag')
+    parser.add_argument('--hdfs_dir', default=None)
+
+    args = parser.parse_args()
+
+    data_source = 'Rowan/hellaswag'
+
+    dataset = datasets.load_dataset(data_source, trust_remote_code=True)
+
+    train_dataset = dataset['train']
+    val_dataset = dataset['validation']
+    test_dataset = dataset['test']
+
+    instruction = 'Please complete the following sentence.\n'
+
+    def make_map_fn(split):
+
+        def process_fn(doc, idx):
+            ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
+            query = preprocess(doc["activity_label"] + ": " + ctx)
+            choices = [preprocess(ending) for ending in doc["endings"]]
+            gold = int(doc["label"])
+
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": query
+                }],
+                "ability": "nlp",
+                "reward_model": {
+                    "style": "model",
+                    "eval": "multiple_choice",  # using loglikelihood
+                    "ground_truth": gold,
+                    "choices": choices
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx
+                }
+            }
+            return data
+
+        return process_fn
+
+    # filter data that doesn't have a label
+    train_dataset = train_dataset.filter(lambda x: len(x['label']) > 0)
+    val_dataset = val_dataset.filter(lambda x: len(x['label']) > 0)
+    test_dataset = test_dataset.filter(lambda x: len(x['label']) > 0)
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    val_dataset = val_dataset.map(function=make_map_fn('validation'), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    val_dataset.to_parquet(os.path.join(local_dir, 'validation.parquet'))
+    test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/data_preprocess/math_dataset.py
+++ b/examples/data_preprocess/math_dataset.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the MATH-lighteval dataset to parquet format
+"""
+
+import os
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+import argparse
+
+from verl.utils.reward_score.math import remove_boxed, last_boxed_only_string
+
+
+def extract_solution(solution_str):
+    return remove_boxed(last_boxed_only_string(solution_str))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='~/data/math')
+    parser.add_argument('--hdfs_dir', default=None)
+
+    args = parser.parse_args()
+
+    # 'lighteval/MATH' is no longer available on huggingface.
+    # Use mirror repo: DigitalLearningGmbH/MATH-lighteval
+    data_source = 'DigitalLearningGmbH/MATH-lighteval'
+    print(f"Loading the {data_source} dataset from huggingface...", flush=True)
+    dataset = datasets.load_dataset(data_source, trust_remote_code=True)
+
+    train_dataset = dataset['train']
+    test_dataset = dataset['test']
+
+    instruction_following = "Let's think step by step and output the final answer within \\boxed{}."
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+
+        def process_fn(example, idx):
+            question = example.pop('problem')
+
+            question = question + ' ' + instruction_following
+
+            answer = example.pop('solution')
+            solution = extract_solution(answer)
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": question
+                }],
+                "ability": "math",
+                "reward_model": {
+                    "style": "rule",
+                    "ground_truth": solution
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx
+                }
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/data_preprocess/multiturn.py
+++ b/examples/data_preprocess/multiturn.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Create a simple multi-turn dataset for testing
+"""
+
+import os
+import pandas as pd
+import argparse
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='~/data/multiturn')
+    parser.add_argument('--hdfs_dir', default=None)
+    args = parser.parse_args()
+
+    # Create example conversations
+    conversations = []
+
+    # Conversation 1
+    conversations.append({
+        "messages": [{
+            "role": "system",
+            "content": "You are a helpful assistant."
+        }, {
+            "role": "user",
+            "content": "What is the capital of France?"
+        }, {
+            "role": "assistant",
+            "content": "The capital of France is Paris."
+        }, {
+            "role": "user",
+            "content": "And what about Germany?"
+        }, {
+            "role": "assistant",
+            "content": "The capital of Germany is Berlin."
+        }]
+    })
+
+    # Conversation 2
+    conversations.append({
+        "messages": [{
+            "role": "system",
+            "content": "You are a helpful assistant."
+        }, {
+            "role": "user",
+            "content": "Can you explain quantum computing?"
+        }, {
+            "role":
+                "assistant",
+            "content":
+                "Quantum computing is a type of computing that uses quantum-mechanical phenomena, such as superposition and entanglement, to perform operations on data."
+        }, {
+            "role": "user",
+            "content": "How is it different from classical computing?"
+        }, {
+            "role":
+                "assistant",
+            "content":
+                "Classical computing uses bits that are either 0 or 1, while quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously due to superposition."
+        }]
+    })
+
+    # Conversation 3
+    conversations.append({
+        "messages": [{
+            "role": "system",
+            "content": "You are a helpful assistant."
+        }, {
+            "role": "user",
+            "content": "Write a simple Python function to calculate factorial."
+        }, {
+            "role":
+                "assistant",
+            "content":
+                "```python\ndef factorial(n):\n    if n == 0 or n == 1:\n        return 1\n    else:\n        return n * factorial(n-1)\n```\n\nThis is a recursive function to calculate the factorial of a number."
+        }, {
+            "role": "user",
+            "content": "Can you make it iterative instead?"
+        }, {
+            "role":
+                "assistant",
+            "content":
+                "```python\ndef factorial(n):\n    result = 1\n    for i in range(1, n+1):\n        result *= i\n    return result\n```\n\nThis is an iterative version of the factorial function."
+        }]
+    })
+
+    # Create train and test datasets
+    train_data = conversations[:2]  # First 2 conversations for training
+    test_data = conversations[2:]  # Last conversation for testing
+
+    # Create output directory
+    local_dir = os.path.expanduser(args.local_dir)
+    os.makedirs(local_dir, exist_ok=True)
+
+    # Save to parquet files
+    train_df = pd.DataFrame(train_data)
+    test_df = pd.DataFrame(test_data)
+
+    train_df.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    test_df.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    # Handle HDFS if specified
+    if args.hdfs_dir is not None:
+        try:
+            from verl.utils.hdfs_io import copy, makedirs
+            makedirs(args.hdfs_dir)
+            copy(src=local_dir, dst=args.hdfs_dir)
+        except ImportError:
+            print("Warning: HDFS support not available. Skipping HDFS copy.")
+
+    # Print statistics
+    print(f"Train dataset size: {len(train_df)}")
+    print(f"Test dataset size: {len(test_df)}")
+    print(f"Data saved to {local_dir}")
+
+
+if __name__ == '__main__':
+    main()
--- a/examples/generation/run_deepseek7b_mutli_node.sh
+++ b/examples/generation/run_deepseek7b_mutli_node.sh
+set -x
+
+data_path=$HOME/data/rlhf/gsm8k/test.parquet
+save_path=$HOME/data/rlhf/math/deepseek_v2_lite_gen_test.parquet
+model_path=deepseek-ai/deepseek-llm-7b-chat
+
+python3 -m verl.trainer.main_generation \
+    trainer.nnodes=2 \
+    trainer.n_gpus_per_node=8 \
+    data.path=$data_path \
+    data.prompt_key=prompt \
+    data.n_samples=1 \
+    data.output_path=$save_path \
+    model.path=$model_path\
+    +model.trust_remote_code=True \
+    rollout.temperature=1.0 \
+    rollout.top_k=50 \
+    rollout.top_p=0.7 \
+    rollout.prompt_length=2048 \
+    rollout.response_length=1024 \
+    rollout.tensor_model_parallel_size=16 \
+    rollout.gpu_memory_utilization=0.8
--- a/examples/generation/run_deepseek_v2_lite_math.sh
+++ b/examples/generation/run_deepseek_v2_lite_math.sh
+set -x
+
+data_path=$HOME/data/rlhf/gsm8k/test.parquet
+save_path=$HOME/data/rlhf/math/deepseek_v2_lite_gen_test.parquet
+model_path=deepseek-ai/deepseek-llm-7b-chat
+
+python3 -m verl.trainer.main_generation \
+    trainer.nnodes=1 \
+    trainer.n_gpus_per_node=8 \
+    data.path=$data_path \
+    data.prompt_key=prompt \
+    data.n_samples=1 \
+    data.output_path=$save_path \
+    model.path=$model_path \
+    +model.trust_remote_code=True \
+    rollout.temperature=1.0 \
+    rollout.top_k=50 \
+    rollout.top_p=0.7 \
+    rollout.prompt_length=2048 \
+    rollout.response_length=1024 \
+    rollout.tensor_model_parallel_size=2 \
+    rollout.gpu_memory_utilization=0.8
--- a/examples/grpo_trainer/run_deepseek7b_llm.sh
+++ b/examples/grpo_trainer/run_deepseek7b_llm.sh
+set -x
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=80 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=160 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
\ No newline at end of file
--- a/examples/grpo_trainer/run_deepseek7b_llm_math.sh
+++ b/examples/grpo_trainer/run_deepseek7b_llm_math.sh
+set -x
+
+export VLLM_ATTENTION_BACKEND=XFORMERS
+
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm_math' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
\ No newline at end of file
--- a/examples/grpo_trainer/run_deepseek7b_llm_math_megatron.sh
+++ b/examples/grpo_trainer/run_deepseek7b_llm_math_megatron.sh
+set -x
+
+export VLLM_ATTENTION_BACKEND=XFORMERS
+
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml'\
+    algorithm.adv_estimator=grpo \
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm_math_megatron' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
\ No newline at end of file
--- a/examples/grpo_trainer/run_deepseek7b_llm_megatron.sh
+++ b/examples/grpo_trainer/run_deepseek7b_llm_megatron.sh
+set -x
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml'\
+    algorithm.adv_estimator=grpo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm_megatron' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
\ No newline at end of file
--- a/examples/grpo_trainer/run_deepseek7b_llm_seq_balance.sh
+++ b/examples/grpo_trainer/run_deepseek7b_llm_seq_balance.sh
+set -x
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=512 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.use_dynamic_bsz=True \
+    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm_seq_packing' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
\ No newline at end of file
--- a/examples/grpo_trainer/run_qwen2-7b.sh
+++ b/examples/grpo_trainer/run_qwen2-7b.sh
+set -x
+
+export VLLM_ATTENTION_BACKEND=XFORMERS
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='qwen2_7b_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
\ No newline at end of file
--- a/examples/grpo_trainer/run_qwen2-7b_math.sh
+++ b/examples/grpo_trainer/run_qwen2-7b_math.sh
+set -x
+
+export VLLM_ATTENTION_BACKEND=XFORMERS
+
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','wandb'] \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='qwen2_7b_function_rm' \
+    trainer.n_gpus_per_node=16 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
\ No newline at end of file