Initial commit

7f6cc211 · jerrrrry · 7f6cc211 · 7f6cc211 · 7f6cc211 · 7f6cc211
Commit 7f6cc211 authored Aug 05, 2025 by jerrrrry
20 changed files
--- a/docs/advance/checkpoint.rst
+++ b/docs/advance/checkpoint.rst
+.. _checkpoint-page:
+
+Using Checkpoints to Support Fault Tolerance Training
+=====================================================
+
+Last updated: 06/25/2025.
+
+There could be training errors or machine failure during the whole RLHF training process, 
+so it is recommended to enable checkpoints to minimize your loss.
+
+The API Interface has already been listed in :ref:`config-explain-page`,
+and we will not repeat them. But there are still some technique details
+we hope to clarify.
+
+.. note:: 
+
+    Notice that the ``checkpoint.contents`` field has no effect to FSDP checkpoint except ``hf_model``, 
+    the other 3 fields are binded together to save and load. We recommend to include ``model``, ``optimizer`` and ``extra`` all.
+
+Checkpoint Saving Directory Structure
+-------------------------------------
+
+Commonly, we use the ``default_local_dir`` declared in ``ppo_trainer.yaml`` or ``ppo_megatron_trainer.yml``
+to work as preffix when saving checkpoints, which is ``checkpoints/${trainer.project_name}/${trainer.experiment_name}``.
+
+So the inner checkpoint structure of **FSDP** is like:
+
+.. code::
+
+    checkpoints/${trainer.project_name}/${trainer.experiment_name}
+    ├── global_steps_${i}
+    │   ├── actor
+    │   │   ├── huggingface      # default save config and tokenizer, save huggingface model if include ``hf_model`` in checkpoint.contents
+    │   │   └── fsdp_config.json # FSDP config file, including world_size and fsdp version
+    │   │   ├── model_world_size_{self.world_size}_rank_{self.rank}.pt
+    │   │   ├── optim_world_size_{self.world_size}_rank_{self.rank}.pt
+    │   │   └── extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
+    │   ├── critic
+    │   │   ├── huggingface
+    │   │   └── fsdp_config.json
+    │   │   ├── model_world_size_{self.world_size}_rank_{self.rank}.pt
+    │   │   ├── optim_world_size_{self.world_size}_rank_{self.rank}.pt
+    │   │   └── extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
+    └── latest_checkpointed_iteration.txt
+
+All model shards, optimizers and extra states are stored together, in a sharded and distributed way.
+
+While **Megatron** current checkpoint structure is:
+
+.. code::
+
+    checkpoints/${trainer.project_name}/${trainer.experiment_name}
+    ├── global_steps_${i}
+    │   ├── actor
+    │   │   ├── huggingface     # default save config and tokenizer, save huggingface model if include ``hf_mode`` in checkpoint.contents
+    │   │   └── dist_ckpt       # save sharded model/optimizer/rng_states, naming the same as Megatron
+    │   └── critic
+    │   │   ├── huggingface
+    │   │   └── dist_ckpt
+    └── latest_checkpointed_iteration.txt
+
+Convert FSDP and Megatron Checkpoints to HuggingFace Format Model
+-----------------------------------------------------------------
+
+We provide a tool to convert the FSDP and Megatron checkpoints to HuggingFace format model.
+The tool is located in ``verl/model_merger``. For older versions of verl that don't include fsdp_config.json in checkpoints, you can use the legacy model merger located at ``verl/scripts/legacy_model_merger.py``.
+
+The script supports two main sub-commands: `merge` (to convert and save checkpoints) and `test` (to validate merged checkpoints against a reference model).
+The arguments for the `merge` sub-command are as follows:
+
+.. code:: bash
+
+    usage: python -m verl.model_merger merge [-h] --backend {fsdp,megatron} [--local_dir LOCAL_DIR] [--tie-word-embedding] [--is-value-model] [--use_cpu_initialization] [--target_dir TARGET_DIR]
+                         [--hf_upload_path HF_UPLOAD_PATH] [--private]
+
+    options:
+    -h, --help            show this help message and exit
+    --backend {fsdp,megatron}
+                            The backend of the model
+    --local_dir LOCAL_DIR
+                            Path to the saved model checkpoints
+    --tie-word-embedding  Whether to tie word embedding weights (currently only Megatron supported)
+    --is-value-model      Whether the model is a value model (currently only Megatron supported)
+    --use_cpu_initialization
+                            Whether to use CPU initialization for the model. This is useful for large models that cannot fit into GPU memory during initialization.
+    --target_dir TARGET_DIR
+                            Directory to save the merged huggingface model
+    --hf_upload_path HF_UPLOAD_PATH
+                            Hugging Face repository ID to upload the model
+    --private             Whether to upload the model to a private Hugging Face repository
+
+Example usage for merging Megatron checkpoints:
+
+.. code:: bash
+
+    python -m verl.model_merger merge \
+        --backend megatron \
+        --tie-word-embedding \
+        --local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor \
+        --target_dir /path/to/merged_hf_model
+
+Example usage for distributed merging Megatron checkpoints:
+
+.. code:: bash
+
+    torchrun --nproc_per_node 1 --nnodes 8 --node_rank ${RANK} -m verl.model_merger merge \
+        --backend megatron \
+        --tie-word-embedding \
+        --local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor \
+        --target_dir /path/to/merged_hf_model
+
+Example usage for merging FSDP checkpoints:
+
+.. code:: bash
+
+    python -m verl.model_merger merge \
+        --backend fsdp \
+        --local_dir checkpoints/verl_fsdp_gsm8k_examples/qwen2_5_0b5_fsdp_saveload/global_step_1/actor \
+        --target_dir /path/to/merged_hf_model
+
+
+Megatron Merger details
+-----------------------
+
+Current implement of decoder layers uses ``nn.ModuleList`` to store the layers, 
+and thus the model layers on every PP rank and VPP rank starts their index from 0.
+
+There are 3 ways to correct this behavior:
+
+1. Modify the decoder layer's state_dict, add ``offset`` to each layer's index, thus rewrite ``nn.ModuleList`` implementation.
+2. Modify the layer index when saving checkpoint and recover them when loading checkpoint.
+3. The Checkpoint merger do this work, calculate the actual ``offset`` from ``state_dict`` only, a little complex.
+
+Current implementation use solution 2.
+
+
+HuggingFace to Megatron DistCheckpoint details
+----------------------------------------------
+
+If your model is quite huge, we recommend you to use Megatron dist-checkpoint to load the model.
+Megatron dist-checkpoint supports loading with different kinds of model parallelism,
+and it is much faster than the original checkpoint loading.
+
+To convert original HuggingFace model to Megatron dist-checkpoint,
+you can use the ``scripts/converter_hf_to_mcore.py`` script. Large MoE models are temporarily supported with CPU initialization,
+which is a little slower. While we are working on a better solution to support large models.
+
+Example command to convert the model is as follows:
+
+.. code:: bash
+
+    python scripts/converter_hf_to_mcore.py \
+        --hf_model_path Qwen/Qwen1.5-MoE-A2.7B-Chat \
+        --output_path /mnt/disk/Qwen/Qwen1.5-MoE-A2.7B-Chat \
+        --use_cpu_initialization    # Only work for MoE models
+
+
+Example command to distributed convert the huge model like deepseekv3 671B is as follows:
+
+.. code:: bash
+
+    torchrun --nproc_per_node 1 --nnodes 8 --node_rank ${RANK} scripts/converter_hf_to_mcore.py \
+        --hf_model_path deepseek-ai/DeepSeek-V3 \
+        --output_path /mnt/disk/deepseek-ai/DeepSeek-V3 \
+        --use_cpu_initialization    # Only work for MoE models
+
+Original Checkpoint Utils
+-------------------------
+
+Original Checkpoint Utils refer to original checkpoint implementation in ``verl/models/[model]/megatron/checkpoint_utils``.
+
+We only need ``[model]_loader.py`` in original checkpoint utils now, since we get rid of storing ``hf_model`` every time (which is not recommended for large model training, try only saving sharded models if you can).
+
+.. note:: 
+
+    Note that ``[model]_loader`` only support environments where **storage clusters are able to connect with every calculation nodes**. 
+    Because it utilizes **sharded load way to minimize the loading checkpoint overhead**. 
+    Every rank loads its own data from ``state_dict`` which can be accessed by all of them.
+    While there is also no need to broadcast among DP ranks, since the saved state_dict is only produced by DP rank 0.
+
+    For users who can **only place the huggingface model on one device**, we keep the original costly implementation in ``[model]_loader_deprecated``. In this implementation, rank 0 broadcast all weights to each tp and pp rank, and then dp rank 0 broadcast to all dp ranks. There may be at risks of OOM.
+
+    To use deprecated loader, change the import package of ``load_state_dict_to_megatron_llama``.
--- a/docs/advance/dpo_extension.rst
+++ b/docs/advance/dpo_extension.rst
+Extend to other RL(HF) algorithms
+=================================
+
+Last updated: 02/25/2025.
+
+We already implemented the complete training pipeline of the PPO
+algorithms. To extend to other algorithms, we analyze the high-level
+principle to use verl and provide a tutorial to implement the DPO
+algorithm. Users can follow the similar paradigm to extend to other RL algorithms.
+
+.. note:: **Key ideas**: Single process drives multi-process computation and data communication.
+
+Overall Approach
+----------------
+
+Step 1: Consider what multi-machine multi-GPU computations are needed
+for each model, such as ``generate_sequence`` , ``compute_log_prob`` and
+``update_policy`` in the actor_rollout model. Implement distributed
+single-process-multiple-data (SPMD) computation and encapsulate them
+into APIs
+
+Step 2: Based on different distributed scenarios, including FSDP and 3D
+parallelism in Megatron-LM, implement single-process control of data
+interaction among multi-process computations.
+
+Step 3: Utilize the encapsulated APIs to implement the control flow
+
+Example: Online DPO
+-------------------
+
+We use verl to implement a simple online DPO algorithm. The algorithm
+flow of Online DPO is as follows:
+
+1. There is a prompt (rollout) generator which has the same weight as
+   the actor model. After a batch of prompts are fed into the generator,
+   it generates N responses for each prompt.
+2. Send all the prompts + responses to a verifier for scoring, which can
+   be reward model or a rule-based function. Then sort them in pairs to
+   form a training batch.
+3. Use this training batch to train the actor model using DPO. During
+   the process, a reference policy is needed.
+
+Step 1: What are the multi-machine multi-GPU computations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Sample Generator**
+
+Implementation details:
+
+.. code:: python
+
+   from verl.single_controller.base import Worker
+   from verl.single_controller.ray import RayWorkerGroup, RayClassWithInitArgs, RayResourcePool
+   import ray
+
+   @ray.remote
+   class SampleGenerator(Worker):
+       def __init__(self, config):
+           super().__init__()
+           self.config = config
+           
+       def generate_sequences(self, data):
+           pass
+
+Here, ``SampleGenerator`` can be viewed as a multi-process pulled up by
+``torchrun``, with each process running the same code (SPMD).
+``SampleGenerator`` needs to implement a ``generate_sequences`` API for
+the control flow to call. The implementation details inside can use any
+inference engine including vllm, sglang and huggingface. Users can
+largely reuse the code in
+verl/verl/workers/rollout/vllm_rollout/vllm_rollout.py and we won't
+go into details here.
+
+**ReferencePolicy inference**
+
+API: compute reference log probability
+
+.. code:: python
+
+   from verl.single_controller.base import Worker
+   import ray
+
+   @ray.remote
+   class ReferencePolicy(Worker):
+       def __init__(self):
+           super().__init__()
+           self.model = Model()
+           
+       def infer(self, data):
+           return self.model(data)
+
+**Actor update**
+
+API: Update actor model parameters
+
+.. code:: python
+
+   from verl.single_controller.base import Worker
+   import ray
+
+   @ray.remote
+   class DPOActor(Worker):
+       def __init__(self):
+           super().__init__()
+           self.model = Model()
+           self.model = FSDP(self.model)  # or other distributed strategy
+           self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3)
+           self.loss_fn = xxx
+           
+       def update(self, data):
+           self.optimizer.zero_grad()
+           logits = self.model(data)
+           loss = self.loss_fn(logits)
+           loss.backward()
+           self.optimizer.step()
+
+**Notes: How to distinguish between control processes and distributed computation processes**
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Control processes are generally functions directly decorated with
+  ``@ray.remote``
+- Computation processes are all wrapped into a ``RayWorkerGroup``.
+
+Users can reuse most of the distribtued computation logics implemented
+in PPO algorithm, including FSDP and Megatron-LM backend in
+verl/verl/trainer/ppo.
+
+Step 2: Based on different distributed scenarios, implement single-process control of multi-process data interaction
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**The core problem to solve here is how a single process sends data to
+multiple processes, drives multi-process computation, and how the
+control process obtains the results of multi-process computation.**
+First, we initialize the multi-process ``WorkerGroup`` in the control
+process.
+
+.. code:: python
+
+   @ray.remote(num_cpus=1)
+   def main_task(config):
+       # construct SampleGenerator
+       resource_pool = RayResourcePool(process_on_nodes=[8] * 2)  # 16 GPUs
+       ray_cls = RayClassWithInitArgs(SampleGenerator, config=config)
+       # put SampleGenerator onto resource pool
+       worker_group = RayWorkerGroup(resource_pool, ray_cls)
+       
+       # construct reference policy
+
+As we can see, in the control process, multiple processes are wrapped
+into a ``RayWorkerGroup``. Inside this ``WorkerGroup``, there is a
+``self._workers`` member, where each worker is a RayActor
+(https://docs.ray.io/en/latest/ray-core/actors.html) of SampleGenerator.
+ray_trainer.md also provide an implementation of
+``MegatronRayWorkerGroup``.
+
+Assuming the model is distributed using FSDP, and there is a batch of
+data on the control process, for data parallelism, the underlying
+calling process is:
+
+.. code:: python
+
+   data = xxx
+   data_list = data.chunk(dp_size)
+
+   output = []
+   for d in data_list:
+       # worker_group._workers[i] is a SampleGenerator
+       output.append(worker_group._workers[i].generate_sequences.remote(d))
+
+   output = ray.get(output)
+   output = torch.cat(output)
+
+Single process calling multiple processes involves the following 3
+steps:
+
+1. Split the data into DP parts on the control process.
+2. Send the data to remote, call the remote computation through RPC, and
+   utilize multi-process computation.
+3. Obtain the computation results of each worker on the control process
+   and merge them.
+
+Frequently calling these 3 steps on the controller process greatly hurts
+code readability. **In verl, we have abstracted and encapsulated these 3
+steps, so that the worker's method + dispatch + collect can be
+registered into the worker_group**
+
+.. code:: python
+
+   from verl.single_controller.base.decorator import register
+
+   def dispatch_data(worker_group, data):
+       return data.chunk(worker_group.world_size)
+       
+   def collect_data(worker_group, data):
+       return torch.cat(data)
+
+   dispatch_mode = {
+       'dispatch_fn': dispatch_data,
+       'collect_fn': collect_data
+   }
+
+   @register(dispatch_mode=dispatch_mode)
+   def generate_sequences(self, data):
+       pass
+
+In this way, we can directly call the method inside the worker through
+the ``worker_group`` on the control (driver) process (which is a single
+process):
+
+.. code:: python
+
+   output = worker_group.generate_sequences(data)
+
+This single line includes data splitting, data distribution and
+computation, and data collection.
+
+Furthermore, the model parallelism size of each model is usually fixed,
+including dp, tp, pp. So for these common distributed scenarios, we have
+pre-implemented specific dispatch and collect methods,in `decorator.py <https://github.com/volcengine/verl/blob/main/verl/single_controller/base/decorator.py>`_, which can be directly used to wrap the computations.
+
+.. code:: python
+
+   from verl.single_controller.base.decorator import register, Dispatch
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def generate_sequences(self, data: DataProto) -> DataProto:
+       pass
+
+Here it requires the data interface to be ``DataProto``. Definition of
+``DataProto`` is in `protocol.py <https://github.com/volcengine/verl/blob/main/verl/protocol.py>`_.
+
+Step 3: Main training loop
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+With the above training flows, we can implement the algorithm's control
+flow. It is recommended that ``main_task`` is also a ray remote process.
+
+.. code:: python
+
+   @ray.remote(num_cpus=1)
+   def main_task(config):
+       # construct SampleGenerator
+       resource_pool = RayResourcePool(process_on_nodes=[8] * 2)  # 16 GPUs
+       ray_cls = RayClassWithInitArgs(SampleGenerator, config=config) 
+       # put SampleGenerator onto resource pool
+       sample_gen = RayWorkerGroup(resource_pool, ray_cls)
+       
+       # construct reference policy
+       ray_cls = RayClassWithInitArgs(ReferencePolicy)
+       ref_policy = RayWorkerGroup(resource_pool, ray_cls)
+       
+       # construct actor
+       ray_cls = RayClassWithInitArgs(DPOActor)  
+       dpo_policy = RayWorkerGroup(resource_pool, ray_cls)
+       
+       dataloader = DataLoader()
+       
+       for data in dataloader:
+           # generate data
+           data = sample_gen.generate_sequences(data)
+           # generate scores for each data 
+           data = generate_scores(data)
+           # generate pairwise data using scores
+           data = generate_pairwise_data(data)
+           # generate ref_log_prob
+           data.batch['ref_log_prob'] = ref_policy.infer(data)
+           # update using dpo
+           dpo_policy.update(data)
+           # logging
+
+Here, different ``WorkerGroups`` can be placed in the same resource pool or
+in different resource pools using ``create_colocated_worker_cls``
+similar as in `ray_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py>`_.
--- a/docs/advance/fsdp_extension.rst
+++ b/docs/advance/fsdp_extension.rst
+
+Add models with the FSDP backend
+==================================
+
+Last updated: 02/09/2025.
+
+Model
+--------------------------
+
+In principle, our FSDP backend can support any HF model and we can
+sychronoize the actor model weight with vLLM using `hf_weight_loader.py` under `third_party/vllm`.
+However, ``hf_weight_loader`` is will gather the full state_dict of a
+model during synchronization, which may cause OOM. We suggest using
+``dtensor_weight_loader`` which gather the full model parameter layer by
+layer to reduce the peak memory usage. We already support dtensor weight
+loader for the models below in `dtensor_weight_loader.py` under `third_party/vllm`:
+
+- ``GPT2LMHeadModel``
+- ``LlamaForCausalLM``
+- ``LLaMAForCausalLM``
+- ``MistralForCausalLM``
+- ``InternLMForCausalLM``
+- ``AquilaModel``
+- ``AquilaForCausalLM``
+- ``Phi3ForCausalLM``
+- ``GemmaForCausalLM``
+- ``Gemma2ForCausalLM``
+- ``GPTBigCodeForCausalLM``
+- ``Starcoder2ForCausalLM``
+- ``Qwen2ForCausalLM``
+- ``DeepseekV2ForCausalLM``
+
+To implement ``dtensor_weight_loader`` of a model that's supported in
+vLLM, follow the guide of gemma model below:
+
+1. Copy the
+   ``load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]])`` from the vllm model class
+   to ``dtensor_weight_loaders.py``
+2. Modify the arguments to
+   ``(actor_weights: Dict, vllm_model: nn.Module)``
+3. Replace the ``self`` to ``vllm_model``
+4. Add the
+   ``local_loaded_weight = redistribute_dtensor(param_name=name, loaded_weights=loaded_weight)``
+   before each ``param = params_dict[name]`` and modify the following
+   weight loading using ``local_loaded_weight``.
+5. Register the implemented dtensor weight loader to ``__MODEL_DTENSOR_WEIGHT_LOADER_REGISTRY__``.
+
+.. code-block:: diff
+
+    - def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+    + def gemma_dtensor_weight_loader(actor_weights: Dict, vllm_model: nn.Module) -> nn.Module:
+        stacked_params_mapping = [
+            # (param_name, shard_name, shard_id)
+            ("qkv_proj", "q_proj", "q"),
+            ("qkv_proj", "k_proj", "k"),
+            ("qkv_proj", "v_proj", "v"),
+            ("gate_up_proj", "gate_proj", 0),
+            ("gate_up_proj", "up_proj", 1),
+        ]
+    -   params_dict = dict(self.named_parameters())
+    +   params_dict = dict(vllm_model.named_parameters())
+        loaded_params = set()
+    -   for name, loaded_weight in weights:
+    +   for name, loaded_weight in actor_weights.items():
+            for (param_name, shard_name, shard_id) in stacked_params_mapping:
+                if shard_name not in name:
+                    continue
+                name = name.replace(shard_name, param_name)
+                # Skip loading extra bias for GPTQ models.
+                if name.endswith(".bias") and name not in params_dict:
+                    continue
+    +           local_loaded_weight = redistribute_dtensor(param_name=name, loaded_weights=loaded_weight)
+                param = params_dict[name]
+                weight_loader = param.weight_loader
+    -           weight_loader(param, loaded_weight, shard_id)
+    +           weight_loader(param, local_loaded_weight.to(dtype=param.dtype), shard_id)
+                break
+            else:
+                # lm_head is not used in vllm as it is tied with embed_token.
+                # To prevent errors, skip loading lm_head.weight.
+                if "lm_head.weight" in name:
+                    continue
+                # Skip loading extra bias for GPTQ models.
+                if name.endswith(".bias") and name not in params_dict:
+                    continue
+    +           local_loaded_weight = redistribute_dtensor(param_name=name, loaded_weights=loaded_weight)
+                param = params_dict[name]
+                weight_loader = getattr(param, "weight_loader",
+                                        default_weight_loader)
+    -           weight_loader(param, loaded_weight)
+    +           weight_loader(param, local_loaded_weight.to(dtype=param.dtype))
+            loaded_params.add(name)
+        unloaded_params = params_dict.keys() - loaded_params
+        if unloaded_params:
+            raise RuntimeError(
+                "Some weights are not initialized from checkpoints: "
+                f"{unloaded_params}")
\ No newline at end of file
--- a/docs/advance/megatron_extension.rst
+++ b/docs/advance/megatron_extension.rst
+Add models with the Megatron-LM backend
+=========================================
+
+Last updated: 04/25/2025.
+
+Model
+-----------
+
+
+If use latest verl, we have direct support of ``GPTModel`` for Megatron backend. 
+You can use the similar way of using Megatron to pretrain custom models. 
+We list the steps here:
+
+1. Find `model_initializer.py <https://github.com/volcengine/verl/blob/main/verl/models/mcore/model_initializer.py>`_
+2. If your model is configurable by ``TransformerLayerSpec`` , you can
+   directly use ``GPTModel``. Otherwise, Please implement a new
+   ``ModelLayerSpec`` and ``ModelLayer`` here.
+3. Use the right ``LayerSpec`` , ``TransformerConfig`` and ``HuggingfaceConfig`` 
+   as arguments to initialize the GPTModel.
+4. Return the model at last.
--- a/docs/advance/one_step_off.md
+++ b/docs/advance/one_step_off.md
+# Recipe: One Step Off Policy Async Trainer
+
+**Author:**  `https://github.com/meituan-search`
+
+Last updated: 07/17/2025.
+
+## Introduction
+
+### Background
+
+The current reinforcement learning training process implemented by verl is synchronous, adhering to the algorithmic
+workflows of established methods like PPO, GRPO, and DAPO. In each step, training samples are generated by the latest
+model, and the model is updated after training completes. While this approach aligns with off-policy reinforcement
+learning and stabilizes RL training, but it suffers from severe efficiency issues.
+Model updates must wait for the longest output in the generation phase to complete.
+During the generation of long-tail samples, GPUs remain idle, resulting in significant underutilization.
+The more severe the long-tail problem in sample generation, the lower the overall training efficiency.
+For example, in DAPO 32B training, the Rollout phase accounts for approximately 70% of the total time,
+and increasing resources does not reduce the Rollout duration.
+
+![DAPO 32B Math Performance](
+https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/docs/dapo_32b_math.png)
+> source data: https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=nwusertongyuxuan361
+
+### Solution
+
+We have implemented the **One Step Off Async Trainer** to help alleviate this issue. This approach parallelizes the
+generation and training processes, utilizing samples generated in the previous step for current training.
+It also involves appropriately partitioning resources, allocating dedicated resources for generation while automatically
+assigning the remainder to training. By reducing resources allocated to the generation phase, we mitigate GPU idle time
+during long-tail sample generation. Throughout this process, generation and training parameters maintain a one-step off
+policy.
+
+![One Step Off Policy Diagram](
+https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/docs/one_step_off_policy.png)
+> reference: [AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning](
+> https://arxiv.org/abs/2505.24298)
+
+Our core contributions include:
+
+1. **Parallel Generation and Training**:  
+   Samples for the next batch are asynchronously generated while the current batch is being trained.
+
+2. **Resource Isolation**:  
+   Unlike `hybrid_engine`, this method requires explicit resource allocation for rollout, with remaining resources
+   automatically assigned to training.
+
+3. **NCCL Parameter Synchronization**:  
+   Employs NCCL communication primitives for seamless parameter transfer between generation and training modules.
+
+### Experimental Results
+
+- **Machine Configuration**: 2 nodes with 16 H20 GPUs each
+   - Generation: 4 GPUs
+   - Training: 12 GPUs
+- **Model**: Qwen2.5-Math-7B
+- **Rollout Configuration**:
+- **Max Response Length**: FSDP2: 20,480 tokens; Megatron: 8,192 tokens
+- **Algorithm**: DAPO
+- **Rollout Engine**: vLLM
+
+| training mode          | engine        | step | gen | wait_prev_gen | generate_sequences | old_log_prob | update_actor | total time    | acc/best@32/mean | acc/maj@32/mean |
+|------------------------|---------------|------|-----|---------------|--------------------|--------------|--------------|---------------|------------------|-----------------|
+| colocate sync          | VLLM+FSDP2    | 749  | 321 | -             | 247                | 88           | 286          | 19h18m        | 0.5948           | 0.417           |
+| one-step-overlap async | VLLM+FSDP2    | 520  | -   | 45            | 458                | 108          | 337          | 15h34m（+23%）  | 0.6165           | 0.494           |
+| colocate sync          | VLLM+Megatron | 699  | 207 | -             | 162                | 119          | 344          | 18h21m        | 0.605            | 0.4217          |
+| one-step-overlap async | VLLM+Megatron | 566  | -   | 59            | 501                | 120          | 347          | 13h06m (+40%) | 0.6569           | 0.4038          |
+
+* colocate sync: step ≈ gen + old_log_prob + update_actor
+* one-step-overlap async: step ≈ wait_prev_gen + old_log_prob + update_actor
+
+![One Step Off Megatron Performance](
+https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/docs/one_step_off_megatron.png)
+
+> source data: https://wandb.ai/hou-zg-meituan/one-step-off-policy?nw=nwuserhouzg
+
+## Implementation
+
+### One Step Off Policy Async Pipline
+
+Our implemented **One Step Off Policy Async Pipeline** integrates seamlessly into existing training logic at minimal
+cost,
+eliminating the need for additional sample storage management. The core mechanism uses `async_gen_next_batch`
+for asynchronous rollout generation while maintaining continuous operation during epoch transitions
+via `create_continuous_iterator`.
+
+```python
+# iterator generator, simplify one-step integration of the training process
+def _create_continuous_iterator(self):
+   for epoch in range(self.config.trainer.total_epochs):
+      iterator = iter(self.train_dataloader)
+      for batch_dict in iterator:
+         yield epoch, batch_dict
+
+
+# read next batch samples, parameters sync and launch asyn gen_seq
+def _async_gen_next_batch(self, continuous_iterator):
+   # read train_data
+   try:
+      epoch, batch_dict = next(continuous_iterator)
+   except StopIteration:
+      return None
+   batch = DataProto.from_single_dict(batch_dict)
+   gen_batch = batch_pocess(batch)
+   # sync weights from actor to rollout
+   self.sync_rollout_weights()
+   # async generation
+   gen_batch_output = self.rollout_wg.async_generate_sequences(gen_batch)
+   # future encapsulated
+   return GenerationBatchFuture(epoch, batch, gen_batch_output)
+
+
+continuous_iterator = self._create_continuous_iterator()
+# run rollout first to achieve one-step-off
+batch_data_future = self._async_gen_next_batch(continuous_iterator)
+
+while batch_data_future is not None:
+   # wait for the gen_seq result from the previous step
+   batch = batch_data_future.get()
+   # launch the next async call to generate sequences
+   batch_data_future = self._async_gen_next_batch(continuous_iterator)
+
+   # compute advantages 
+   batch = critic.compute_values(batch)
+   batch = reference.compute_log_prob(batch)
+   batch = reward.compute_reward(batch)
+   batch = compute_advantages(batch)
+
+   # model update
+   critic_metrics = critic.update_critic(batch)
+   actor_metrics = actor.update_actor(batch)
+```
+
+### Parameter Synchronization
+
+The exciting point is that our nccl based weights updating for rollout model has great performance.
+At most of time, the latency is under 300ms, which is negligible for RLHF.
+
+> **sync_rollout_weights**：The time for synchronizing parameters from actor to rollout is extremely fast and can almost
+> be ignored because it is implemented with nccl.
+
+```python
+class ActorRolloutRefWorker:
+   # actor acquires the meta-info of model parameters for parameter sync
+   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
+   def get_actor_weights_info(self):
+      params = self._get_actor_params()
+      ret = []
+      for key, tensor in params.items():
+         ret.append((key, tensor.size(), tensor.dtype))
+      self._weights_info = ret
+      return ret
+
+   # rollout sets the meta-info of model parameters for parameter sync
+   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
+   def set_actor_weights_info(self, weights_info):
+      self._weights_info = weights_info
+
+
+class AsyncRayPPOTrainer(RayPPOTrainer):
+   def init_workers(self):
+
+
+...
+# rollout obtains the meta-info of model parameters from the actor for parameter sync
+weights_info = self.actor_wg.get_actor_weights_info()[0]
+self.rollout_wg.set_actor_weights_info(weights_info)
+
+# Create an actor-rollout communication group for parameter sync
+actor_rollout_workers = self.actor_wg.workers + self.rollout_wg.workers
+collective.create_collective_group(
+   actor_rollout_workers,
+   len(actor_rollout_workers),
+   list(range(0, len(actor_rollout_workers))),
+   backend="nccl",
+   group_name="actor_rollout"
+)
+```
+
+```python
+# drive process call the actor and rollout respectively to sync parameters by nccl 
+def sync_rollout_weights(self):
+   self.actor_wg.sync_rollout_weights()
+   ray.get(self.rollout_wg.sync_rollout_weights())
+
+
+# fsdp model parameter sync
+@register(dispatch_mode=Dispatch.ONE_TO_ALL, blocking=False)
+def sync_rollout_weights(self):
+   params = self._get_actor_params() if self._is_actor else None
+   if self._is_rollout:
+      inference_model = (
+         self.rollout.inference_engine.llm_engine.model_executor.driver_worker.worker.model_runner.model
+      )
+      patch_vllm_moe_model_weight_loader(inference_model)
+   # Model parameters are broadcast tensor-by-tensor from actor to rollout
+   for key, shape, dtype in self._weights_info:
+      tensor = torch.empty(shape, dtype=dtype, device=get_torch_device().current_device())
+      if self._is_actor:
+         assert key in params
+         origin_data = params[key]
+         if hasattr(origin_data, "full_tensor"):
+            origin_data = origin_data.full_tensor()
+         if torch.distributed.get_rank() == 0:
+            tensor.copy_(origin_data)
+      from ray.util.collective import collective
+
+      collective.broadcast(tensor, src_rank=0, group_name="actor_rollout")
+      if self._is_rollout:
+         inference_model.load_weights([(key, tensor)])
+```
+
+## Usage
+
+### FSDP2 Configuration Example
+
+```shell
+python3 -m recipe.one_step_off_policy.async_main_ppo \
+    --config-path=config \
+    --config-name='one_step_off_ppo_trainer.yaml' \
+    actor_rollout_ref.actor.strategy=fsdp2 \
+    # actor and rollout are placed separately
+    actor_rollout_ref.hybrid_engine=False \
+    # actor and rollout resource
+    trainer.nnodes=1 \
+    trainer.n_gpus_per_node=6 \
+    rollout.nnodes=1 \
+    rollout.n_gpus_per_node=2
+```
+
+### Megatron Configuration Example
+
+```shell
+python3 -m recipe.one_step_off_policy.async_main_ppo \
+    --config-path=config \
+    --config-name='one_step_off_ppo_megatron_trainer.yaml' \
+    actor_rollout_ref.actor.strategy=megatron \
+    # actor and rollout are placed separately
+    actor_rollout_ref.hybrid_engine=False \
+    # actor and rollout resource
+    trainer.nnodes=1 \
+    trainer.n_gpus_per_node=6 \
+    rollout.nnodes=1 \
+    rollout.n_gpus_per_node=2
+```
+
+### Configuration Guidelines
+
+1. **Card Number Relationships**  
+   Maintain either of these relationships for optimal batch distribution:
+   - `actor_rollout_ref.rollout.n` should be an integer divisor of:  
+     `trainer.n_gpus_per_node * trainer.nnodes`
+   - `actor_rollout_ref.rollout.n * data.train_batch_size` should be evenly divisible by:  
+     `trainer.n_gpus_per_node * trainer.nnodes`
+
+   > Rationale: Ensures training samples can be evenly distributed across training GPUs when using partial resources for
+   generation.
+
+2. **Dynamic Resource Tuning**  
+   Adjust `trainer.nnodes` `trainer.n_gpus_per_node` `rollout.nnodes` `rollout.n_gpus_per_node` based on phase
+   durations:
+   - **Ideal state**: Rollout and training phases have comparable durations
+   - **Diagnostic metrics**:
+      - Monitor `wait_prev_gen` duration
+      - Analyze `sequence_length` distribution
+   - **Adjustment strategy**:
+      - High `wait_prev_gen` + uniform sequence lengths → Increase rollout resources
+      - High `wait_prev_gen` + long-tail sequences → Optimize stopping criteria (resource increase won't help)
+   > **wait_prev_gen**：The time consumed waiting for the previous rollout to end (the part that is not fully
+   overlapped).
+   **Resource Configuration Strategies:**
+   - **Resource-constrained scenario**: Optimize resource utilization by adjusting GPU allocation ratios,
+     keeping the number of nodes equal to allow training and rollout to share nodes;
+      - Configure `trainer.nnodes = rollout.nnodes` with
+        `trainer.n_gpus_per_node + rollout.n_gpus_per_node = physical_gpus_per_node`. Control rollout resource
+        allocation by adjusting `n_gpus_per_node`.
+   - **Resource-abundant scenario**: Optimize performance by adjusting the number of nodes,
+     keeping the number of GPUs per node equal to enable independent scaling of training and rollout
+     parallelism.
+      - Configure `trainer.n_gpus_per_node = rollout.n_gpus_per_node` and control rollout resource allocation by
+        adjusting `trainer.nnodes` and `rollout.nnodes`to achieve optimal performance.
+   > **Note**: The total number of nodes required by the system is not simply `trainer.nnodes + rollout.nnodes`. The
+   > actual calculation depends on GPU capacity:
+   > - When `trainer.n_gpus_per_node + rollout.n_gpus_per_node <= physical_gpus_per_node`,
+       > the required node count is `max(trainer.nnodes, rollout.nnodes)`
+   > - When `trainer.n_gpus_per_node + rollout.n_gpus_per_node > physical_gpus_per_node`,
+       > the required node count is `trainer.nnodes + rollout.nnodes`
+
+## Functional Support
+
+| Category           | Support Situation                                                                                               |
+|--------------------|-----------------------------------------------------------------------------------------------------------------|
+| train engine       | FSDP2  <br/> Megatron                                                                                           |
+| rollout engine     | vLLM                                                                                                            |
+| AdvantageEstimator | GRPO <br/> GRPO_PASSK <br/> REINFORCE_PLUS_PLUS <br/> RLOO <br/> OPO <br/> REINFORCE_PLUS_PLUS_BASELINE<br/>GPG |
+| Reward             | all                                                                                                             |
--- a/docs/advance/placement.rst
+++ b/docs/advance/placement.rst
+Ray API Design Tutorial
+=======================================
+
+Last updated: 10/30/2024.
+
+We provide a tutorial for our Ray API design, including:
+
+- Ray basic concepts
+- Resource Pool and RayWorkerGroup
+- Data Dispatch, Execution and Collection
+- Initialize the RayWorkerGroup and execute the distributed computation in the given Resource Pool
+
+See details in `tutorial.ipynb <https://github.com/volcengine/verl/blob/main/examples/ray/tutorial.ipynb>`_.
\ No newline at end of file
--- a/docs/advance/ppo_lora.rst
+++ b/docs/advance/ppo_lora.rst
+RL(HF) algorithms with LoRA Support
+===========================================
+
+Last updated: 06/05/2025.
+
+We support LoRA (Low-Rank Adaptation) for reinforcement learning algorithms such as PPO, GRPO, and others.
+
+LoRA is a parameter-efficient fine-tuning technique that injects trainable low-rank matrices into pre-trained weights (typically linear layers). This reduces memory footprint and compute cost, making it possible to fine-tune large models with limited hardware.
+
+The benefits this brings include:
+
+- reinforcement learning with very large models (e.g. 70B+) with modest hardware (e.g. 8x80G GPUs),
+- enable larger batch sizes due to reduced memory usage,
+- simplify model transfer and deployment, as only LoRA adapters need to be saved,
+- Combine with techniques like `SLoRA <https://arxiv.org/abs/2311.03285>`_ or `CCoE <https://arxiv.org/abs/2407.11686>`_ to serve multiple LoRA adapters efficiently
+
+This guide explains how to enable LoRA in RL training and configure related parameters.
+
+Usage Guide
+------------------------
+1. Lora is available in the `verl.trainer.ppo.ray_trainer.RayPPOTrainer`. Examples are provided via the `verl.trainer.main_ppo` entry point.
+
+2. Currently, LoRA is supported via huggingface peft, only with fsdp/fsdp2 and vllm backend (sglang support coming soon).
+
+- `strategy=fsdp` or `strategy=fsdp2`
+- `rollout.name=vllm`
+
+3. Required configurations for LoRA:
+
+- `actor_rollout_ref.model.lora_rank`: int, set to a reasonable value greater than 0 (e.g., 8, 16, 32, 64)
+- `actor_rollout_ref.model.lora_alpha`: float, the alpha term in LoRA
+- `actor_rollout_ref.rollout.load_format="safetensors"`: required. This enables vLLM to load the base model.
+- `actor_rollout_ref.model.target_modules`: the target modules for LoRA. Typically set to "all-linear".
+
+4. Recommend options:
+
+- `actor_rollout_ref.model.use_shm=True`: preload the model into `/dev/shm` to improve model loading speed.
+- `actor_rollout_ref.rollout.layered_summon=True`: this enables the actor-model to gather the FSDP shards per layers when synchronizing the LoRA Adapter to vLLM, thereby reducing GPU peak memory. Recommended if the model is very large (70B+) or the GPU memory is limited (< 48GB)
+
+
+Best Practices and Notes
+-------------------------
+
+1. **Learning rate**: it is recommended to increase the value of learning rate by an order of magnitude.
+
+2. **LoRA Rank**:
+
+- Too small a rank can hurt convergence.
+- LoRA rank recommendation from @thelongestusernameofall:
+
+  - A very small lora_rank can lead to slower convergence or worse training performance. It is recommended to set lora_rank to be>=32. Tests have shown that for a 0.5B model, with lora_rank=32,the training convergence speed and final performance are almost identical to non-LoRA training
+  - For a 32B model,with lora_rank=128,the training convergence speed and final performance are also almost identical to non-LoRA training.
+  - More comprehensive reference results are coming soon.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/f2b80b8b26829124dd393b7a795a0640eff11644/docs/lora.jpg?raw=true
+
+3. Reference configuration for RL training with the Qwen2.5-72B model using 8 x 80GB GPUs (increase lora_rank if needed):
+
+.. code-block::
+
+    data.train_batch_size=64 \
+    actor_rollout_ref.model.use_shm=True \
+    actor_rollout_ref.model.lora_rank=32 \
+    actor_rollout_ref.model.lora_alpha=32 \
+    actor_rollout_ref.model.target_modules=all-linear \
+    actor_rollout_ref.actor.optim.lr=3e-5 \
+    actor_rollout_ref.actor.fsdp_config.fsdp_size=8 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=True \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.rollout.max_num_seqs=64 \
+    actor_rollout_ref.rollout.max_model_len=1536 \
+    actor_rollout_ref.rollout.max_num_batched_tokens=1536 \
+    actor_rollout_ref.rollout.load_format=safetensors \
+    actor_rollout_ref.rollout.layered_summon=True \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
+
+Example Script
+-------------------
+
+For an end-to-end example, refer to the script below:
+
+examples/grpo_trainer/run_qwen2_5-3b_gsm8k_grpo_lora.sh
--- a/docs/advance/rollout_trace.rst
+++ b/docs/advance/rollout_trace.rst
+Trace Function Usage Instructions
+========================================
+
+Last updated: 07/10/2025.
+
+Applicable Scenarios
+--------------------
+
+Agentic RL involves multiple turns of conversations, tool invocations, and user interactions during the rollout process. During the Model Training process, it is necessary to track function calls, inputs, and outputs to understand the flow path of data within the application. The Trace feature helps, in complex multi-round conversations, to view the transformation of data during each interaction and the entire process leading to the final output by recording the inputs, outputs, and corresponding timestamps of functions, which is conducive to understanding the details of how the model processes data and optimizing the training results.
+
+The Trace feature integrates commonly used Agent trace tools, including wandb weave and mlflow, which are already supported. Users can choose the appropriate trace tool according to their own needs and preferences. Here, we introduce the usage of each tool.
+
+
+Trace Parameter Configuration
+-----------------------------
+
+- ``actor_rollout_ref.rollout.trace.backend=mlflow|weave`` # the trace backend type
+- ``actor_rollout_ref.rollout.trace.token2text=True`` # To show decoded text in trace view
+
+
+Glossary
+--------
+
+----------------+------------------------------------------------------------------------------------------------------+
+| Object         | Explaination                                                                                         |
+================+======================================================================================================+
+| trajectory     | A complete multi-turn conversation includes:                                                         |
+|                | 1. LLM output at least once                                                                          |
+|                | 2. Tool Call                                                                                         |
+----------------+------------------------------------------------------------------------------------------------------+
+| step           | The training step corresponds to the global_steps variable in the trainer                            |
+----------------+------------------------------------------------------------------------------------------------------+
+| sample_index   | The identifier of the sample, defined in the extra_info.index of the dataset. It is usually a number,|
+|                | but may also be a uuid in some cases.                                                                |
+----------------+------------------------------------------------------------------------------------------------------+
+| rollout_n      | In the GROP algorithm, each sample is rolled out n times. rollout_n represents the serial number of  |
+|                | the rollout.                                                                                         |
+----------------+------------------------------------------------------------------------------------------------------+
+| validate       | Whether the test dataset is used for evaluation?                                                     |
+----------------+------------------------------------------------------------------------------------------------------+
+
+Rollout trace functions
+-----------------------
+
+There are 2 functions used for tracing:
+
+1. ``rollout_trace_op``: This is a decorator function used to mark the functions to trace. In default, only few method has it, you can add it to more functions to trace more infor.
+2. ``rollout_trace_attr``: This function is used to mark the entry of a trajectory and input some info to trace. If you add new type of agent, you may need to add it to enable trace.
+
+
+Usage of wandb weave
+--------------------
+
+1.1 Basic Configuration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Set the ``WANDB_API_KEY`` environment variable
+2. Configuration Parameters
+
+   1. ``actor_rollout_ref.rollout.trace.backend=weave``
+   2. ``trainer.logger=['console', 'wandb']``: This item is optional. Trace and logger are independent functions. When using Weave, it is recommended to also enable the wandb logger to implement both functions in one system.
+   3. ``trainer.project_name=$project_name``
+   4. ``trainer.experiment_name=$experiment_name``
+   5. ``actor_rollout_ref.rollout.mode=async``: Since trace is mainly used for agentic RL, need to enable agent toop using async mode for either vllm or sglang.
+
+Note:
+The Weave Free Plan comes with a default monthly network traffic allowance of 1GB. During the training process, the amount of trace data generated is substantial, reaching dozens of gigabytes per day, so it is necessary to select an appropriate wandb plan.
+
+
+1.2 View Trace Logs
+~~~~~~~~~~~~~~~~~~~
+
+After executing the training, on the project page, you can see the WEAVE sidebar. Click Traces to view it.
+
+Each Trace project corresponds to a trajectory. You can filter and select the trajectories you need to view by step, sample_index, rollout_n, and experiment_name.
+
+After enabling token2text, prompt_text and response_text will be automatically added to the output of ToolAgentLoop.run, making it convenient to view the input and output content.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/weave_trace_list.png?raw=true
+
+1.3 Compare Trace Logs
+~~~~~~~~~~~~~~~~~~~~~~
+
+Weave can select multiple trace items and then compare the differences among them.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/weave_trace_compare.png?raw=true
+
+Usage of mlflow
+---------------
+
+1. Basic Configuration
+~~~~~~~~~~~~~~~~~~~~~~
+
+1. Set the ``MLFLOW_TRACKING_URI`` environment variable, which can be:
+
+   1. Http and https URLs corresponding to online services
+   2. Local files or directories, such as ``sqlite:////tmp/mlruns.db``, indicate that data is stored in ``/tmp/mlruns.db``. When using local files, it is necessary to initialize the file first (e.g., start the UI: ``mlflow ui --backend-store-uri sqlite:////tmp/mlruns.db``) to avoid conflicts when multiple workers create files simultaneously.
+
+2. Configuration Parameters
+
+   1. ``actor_rollout_ref.rollout.trace.backend=mlflow``
+   2. ``trainer.logger=['console', 'mlflow']``. This item is optional. Trace and logger are independent functions. When using mlflow, it is recommended to also enable the mlflow logger to implement both functions in one system.
+   3. ``trainer.project_name=$project_name``
+   4. ``trainer.experiment_name=$experiment_name``
+
+
+2. View Log
+~~~~~~~~~~~
+
+Since ``trainer.project_name`` corresponds to Experiments in mlflow, in the mlflow view, you need to select the corresponding project name, then click the "Traces" tab to view traces. Among them, ``trainer.experiment_name`` corresponds to the experiment_name of tags, and tags corresponding to step, sample_index, rollout_n, etc., are used for filtering and viewing.
+
+For example, searching for ``"tags.step = '1'"`` can display all trajectories of step 1.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/mlflow_trace_list.png?raw=true
+
+Opening one of the trajectories allows you to view each function call process within it.
+
+After enabling token2text, prompt_text and response_text will be automatically added to the output of ToolAgentLoop.run, making it convenient to view the content.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/mlflow_trace_view.png?raw=true
+
+Note:
+
+1. mlflow does not support comparing multiple traces
+2. rollout_trace can not associate the mlflow trace with the run, so the trace content cannot be seen in the mlflow run logs.
--- a/docs/advance/rope.rst
+++ b/docs/advance/rope.rst
+RoPE Scaling override
+=======================================
+
+Last updated: 05/14/2025.
+
+Some models such as `Qwen/Qwen2.5-7B-Instruct <https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#processing-long-texts>`_ support RoPE Scaling but don't have it defined in their config.json file.
+For example, this model supports this configuration:
+
+.. code:: python
+
+    {
+        ...,
+        "rope_scaling": {
+            "factor": 4.0,
+            "original_max_position_embeddings": 32768,
+            "type": "yarn"
+        }
+    }
+
+
+
+In order to support a longer context for such models, you must override the model configs when starting the trainer.
+
+PPO example:
+
+.. code:: bash
+
+    +actor_rollout_ref.model.override_config.rope_scaling.type=yarn \
+    +actor_rollout_ref.model.override_config.rope_scaling.factor=4.0 \
+    +actor_rollout_ref.model.override_config.rope_scaling.original_max_position_embeddings=32768 \
+
+
+And for the critic model
+
+.. code:: bash
+
+    +critic.model.override_config.rope_scaling.type=yarn \
+    +critic.model.override_config.rope_scaling.factor=4.0 \
+    +critic.model.override_config.rope_scaling.original_max_position_embeddings=32768 \
--- a/docs/algo/baseline.md
+++ b/docs/algo/baseline.md
+# Algorithm Baselines
+
+Last updated: 06/18/2025.
+
+## Math related datasets
+
+### GSM8k
+
+Assuming GSM8k/math dataset is preprocessed via:
+
+```bash
+python3 examples/data_preprocess/*.py
+```
+
+Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.
+
+
+| Hardware    | Model                            | Method            | Test score   | Details |
+|-------------|----------------------------------|-------------------|--------------|---------|
+| NVIDIA GPU  | google/gemma-2-2b-it             | hf checkpoint     | 23.9         | [Huggingface](https://huggingface.co/google/gemma-2-2b-it#benchmark-results) |
+| NVIDIA GPU  | google/gemma-2-2b-it             | SFT               | 52.06        | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log) |
+| NVIDIA GPU  | google/gemma-2-2b-it             | SFT + PPO         | 64.02        | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log), [wandb](https://api.wandb.ai/links/verl-team/h7ux8602) |
+| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | hf checkpoint     | 36.4         | [Qwen blog](https://qwenlm.github.io/blog/qwen2.5-llm/) |
+| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | PPO               | 56.7         | [command and log](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log) |
+| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | PRIME             | 58.7         | [script](https://github.com/volcengine/verl/blob/main/recipe/prime/run_prime_qwen.sh), [wandb](https://api.wandb.ai/links/zefan-wang-thu-tsinghua-university/rxd1btvb) |
+| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | GRPO-LoRA         | 54.3         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz64_2-prompt512-resp1024-lorarank32-score0.543.log)|
+| NVIDIA GPU  | Qwen/Qwen2.5-1.5B-Instruct       | GRPO-LoRA         | 77.9         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-1.5B-bsz64_2-prompt512-resp1024-lorarank32-score0.779.log)|
+| NVIDIA GPU  | Qwen/Qwen2.5-3B-Instruct         | GRPO-LoRA         | 86.1         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-3B-bsz64_2-prompt512-resp1024-lorarank32-score0.861.log)|
+| NVIDIA GPU  | deepseek-ai/deepseek-llm-7b-chat | PPO (Megatron)    | 69.5 [1]     | [log](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/deepseek-llm-7b-chat-megatron-bsz256_4-prompt512-resp512-0.695.log), [wandb](https://wandb.ai/verl-team/verl_megatron_gsm8k_examples/runs/10fetyr3) |
+| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GRPO              | 89           | [script](https://github.com/volcengine/verl/blob/a65c9157bc0b85b64cd753de19f94e80a11bd871/examples/grpo_trainer/run_qwen2-7b_seq_balance.sh) |
+| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GRPO (FSDP2)      | 89.8         | [log](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/qwen2-7b-fsdp2.log) |
+| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GRPO (Megatron)   | 89.6         | [log](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/qwen2-7b_math_megatron.log) |
+| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | ReMax             | 97           | [script](https://github.com/eric-haibin-lin/verl/blob/main/examples/remax_trainer/run_qwen2.5-3b_seq_balance.sh), [wandb](https://wandb.ai/liziniu1997/verl_remax_example_gsm8k/runs/vxl10pln) |
+| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | SPPO              | 65.6 (MATH)  | [SPPO script](https://github.com/volcengine/verl/tree/main/recipe/sppo/README.md) |
+| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | GRPO-LoRA         | 93.4         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-7B-bsz64_8-prompt512-resp1024-lorarank32-score0.934.log)|
+| NVIDIA GPU  | Mixtral-8x22B-Instruct-v0.1      | Instruct model    | 83.7         | [Qwen Blog](https://qwenlm.github.io/blog/qwen2.5-llm/) |
+| NVIDIA GPU  | Mixtral-8x22B-Instruct-v0.1      | RLOO (Megatron)   | 92.3         | [wandb](https://api.wandb.ai/links/ppo_dev/sbuiuf2d) |
+| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | SPIN              | 92           | [script](https://github.com/volcengine/verl/tree/main/recipe/spin/README.md) |
+| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GPG               | 88           | [log](https://github.com/diqiuzhuanzhuan/verldata/blob/main/run_logs/qwen2-7b_math.log), [wandb](https://wandb.ai/diqiuzhuanzhuan/verl_gpg_example_gsm8k_math/runs/ab86c4va) |
+| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GPG (Megatron)    | 88           | [log](https://github.com/diqiuzhuanzhuan/verldata/blob/main/run_logs/qwen2-7b_math_megatron.log), [wandb](https://wandb.ai/diqiuzhuanzhuan/verl_gpg_example_gsm8k_math/runs/yy8bheu8) |
+| NVIDIA GPU  | Qwen/Qwen2.5-VL-7B-Instruct      | GRPO (Megatron)   | 65.4 (GEO3k) | [script](https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen2_5_vl-7b-megatron.sh), [wandb](https://api.wandb.ai/links/megatron-core-moe-dev/1yngvkek) |
+| AMD MI300   | deepseek-ai/deepseek-llm-7b-chat | PPO               | 70.5 [1]     | [log](https://github.com/yushengsu-thu/verl_training_log/blob/main/gsm8k/ppo_run_deepseek7b_llm.log) |
+| AMD MI300   | deepseek-ai/deepseek-llm-7b-chat | GRPO              | 71.4 [1]     | [log](https://github.com/yushengsu-thu/verl_training_log/blob/main/gsm8k/grpo_run_deepseek7b_llm.log) |
+| NVIDIA GPU  | Qwen/Qwen2.5-14B-Instruct         | GRPO-LoRA         | 94.6         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-14B-bsz64_8-prompt512-resp1024-lorarank32-score0.946.log)|
+| NVIDIA GPU  | Qwen/Qwen2.5-32B-Instruct         | GRPO-LoRA         | 95.8         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-32B-bsz64_8-prompt512-resp1024-lorarank32-score0.958.log)|
+| NVIDIA GPU  | Qwen/Qwen2.5-72B-Instruct         | GRPO-LoRA         | 96.0         | [command and logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-72B-bs64_8-prompt512-resp1024-lorarank32-score0.960.log)|
+
+### DAPO math-17k
+
+- Training DAPO math-17k dataset: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
+- Testing: AIME'24: https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024
+
+Note:
+- For Qwen/Qwen2.5-Math-7B, we directly modify the max_position_embeddings to 32768 without observing performance degradation in order to train longer response length.
+
+| Hardware    | Model                       | Method                  | Test score | Details |
+|-------------|-----------------------------|-------------------------|------------|---------|
+| NVIDIA GPU  | Qwen/Qwen2.5-Math-7B (32k)  | DAPO                    | 36.3       | [command](https://github.com/volcengine/verl/blob/main/recipe/dapo/test_dapo_7b_math.sh), [logs](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/runs/ow47vvon?nw=nwusertongyuxuan361)|
+| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct    | DAPO + Code Interpreter | 40.0       | [command](https://github.com/volcengine/verl/blob/main/recipe/retool/run_qwen2_7b_dapo.sh)|
+
+
+
+
+## Coding related datasets
+
+Below is the result on leetcode if not specified otherwise.
+
+| Hardware    | Model                            | Method            | Test score   | Details |
+|-------------|----------------------------------|-------------------|--------------|---------|
+| NVIDIA GPU  | PRIME-RL/Eurus-2-7B-SFT          | RPIME             | 36.1         | [script](https://github.com/volcengine/verl/blob/main/recipe/prime/run_prime_qwen_code.sh), [swanlab](https://swanlab.cn/@wangzefan/prime_example/runs/7f541qhspgmy8nmhdlx35/chart) |
+
+
+### Notes
+
+[1] During evaluation, we have only extracted answers following the format `"####"`. A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.
+
+[2] The default value of `actor_rollout_ref.actor.entropy_coeff` is set to `0.0` since verl 0.3.x on 2025-05-30, which is different from previous versions.
--- a/docs/algo/dapo.md
+++ b/docs/algo/dapo.md
+# Recipe: Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
+
+Last updated: 06/19/2025.
+
+> Open-Source Algorithm Implementation & Expriement Running: [Yuxuan Tong](https://tongyx361.github.io/), [Guangming Sheng](https://hk.linkedin.com/in/guangming-sheng-b50640211)
+
+🏠 [Homepage](https://dapo-sia.github.io/) | 📝 [Paper@arXiv](https://arxiv.org/abs/2503.14476) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/BytedTsinghua-SIA/dapo-67d7f1517ee33c8aed059da0) | 🐱 [Code@GitHub](https://github.com/volcengine/verl/tree/recipe/dapo/recipe/dapo) | 🐱 [Repo@GitHub](https://github.com/BytedTsinghua-SIA/DAPO)
+
+> We propose the **D**ecoupled Clip and Dynamic s**A**mpling **P**olicy **O**ptimization (DAPO) algorithm. By making our work publicly available, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements. Our system is based on the awesome [verl](https://github.com/volcengine/verl) framework. Thanks for their great work! Applying DAPO training to Qwen2.5-32B base model proves to outperform the previous state-of-the-art DeepSeek-R1-Zero-Qwen-32B on AIME 2024, achieving **50%** accuracy with **50%** less training steps.
+>
+> ![dapo-main-result](https://dapo-sia.github.io/static/images/score.png)
+
+## Quickstart
+
+1. Prepare the datasets **on the Ray cluster**:
+
+```bash
+bash prepare_dapo_data.sh # This downloads the datasets to ${HOME}/verl/data by default
+```
+
+2. Submit the job to the Ray cluster **from any machine**:
+
+```bash
+cd verl # Repo root
+export RAY_ADDRESS="http://${RAY_IP:-localhost}:8265" # The Ray cluster address to connect to
+export WORKING_DIR="${PWD}" # The local directory to package to the Ray cluster
+# Set the runtime environment like env vars and pip packages for the Ray cluster in yaml
+export RUNTIME_ENV="./recipe/dapo/runtime_env.yaml" # This sets environment variables for the Ray cluster
+bash recipe/dapo/run_dapo_qwen2.5_32b.sh # or other scripts
+```
+
+## Reproduction Runs
+
+| Setup                                        | AIME 2024 Acc. | Hardware  | Image                                                                | Commit                                                                                       | Environment Variables                                                                                                             | Training Script                                                                                                                                             | Training Record                                                                           |
+| -------------------------------------------- | -------------- | --------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
+| DAPO                                         | 52%            | 16x8xH800 | `hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.3-flashinfer0.2.2-cxx11abi0` | [`4f80e4`](https://github.com/volcengine/verl/tree/4f80e465c2ec79ab9c3c30ec74b9745de61d0490) | [runtime_env.yaml](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/runtime_env.yaml) | [run_dapo_qwen2.5_32b.sh](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/run_dapo_qwen2.5_32b.sh)             | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
+| DAPO w/o Dynamic Sampling                    | 50%            | 16x8xH800 | `hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.3-flashinfer0.2.2-cxx11abi0` | [`4f80e4`](https://github.com/volcengine/verl/tree/4f80e465c2ec79ab9c3c30ec74b9745de61d0490) | [runtime_env.yaml](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/runtime_env.yaml) | [run_dapo_wo_ds_qwen2.5_32b.sh](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/run_dapo_wo_ds_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
+| DAPO w/o Token-level Loss & Dynamic Sampling | 44%            | 16x8xH20  | `hiyouga/verl:ngc-th2.5.1-cu120-vllm0.7.4-hotfix`                    | [`4f80e4`](https://github.com/volcengine/verl/tree/4f80e465c2ec79ab9c3c30ec74b9745de61d0490) | [runtime_env.yaml](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/runtime_env.yaml) | [run_dapo_early_qwen2.5_32b.sh](https://github.com/volcengine/verl/blob/4f80e465c2ec79ab9c3c30ec74b9745de61d0490/recipe/dapo/run_dapo_early_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
+
+> [!IMPORTANT]
+>
+> **📢 Call for Contribution!**
+>
+> Welcome to submit your reproduction runs and setups!
+
+## Configuration
+
+### Separated Clip Epsilons (-> Clip-Higher)
+
+An example configuration:
+
+```yaml
+actor_rollout_ref:
+  actor:
+    clip_ratio_low: 0.2
+    clip_ratio_high: 0.28
+```
+
+`clip_ratio_low` and `clip_ratio_high` specify the $\varepsilon_{\text {low }}$ and $\varepsilon_{\text {high }}$ in the DAPO objective.
+
+Core relevant code:
+
+```python
+pg_losses1 = -advantages * ratio
+pg_losses2 = -advantages * torch.clamp(ratio, 1 - cliprange_low, 1 + cliprange_high)
+pg_losses = torch.maximum(pg_losses1, pg_losses2)
+```
+
+### Dynamic Sampling (with Group Filtering)
+
+An example configuration:
+
+```yaml
+data:
+  gen_batch_size: 1536
+  train_batch_size: 512
+algorithm:
+  filter_groups:
+    enable: True
+    metric: acc # score / seq_reward / seq_final_reward / ...
+    max_num_gen_batches: 10 # Non-positive values mean no upper limit
+```
+
+Setting `filter_groups.enable` to `True` will filter out groups whose outputs' `metric` are all the same, e.g., for `acc`, groups whose outputs' accuracies are all 1 or 0.
+
+The trainer will repeat sampling with `gen_batch_size` until there are enough qualified groups for `train_batch_size` or reaching the upper limit specified by `max_num_gen_batches`.
+
+Core relevant code:
+
+```python
+prompt_bsz = self.config.data.train_batch_size
+if num_prompt_in_batch < prompt_bsz:
+    print(f'{num_prompt_in_batch=} < {prompt_bsz=}')
+    num_gen_batches += 1
+    max_num_gen_batches = self.config.algorithm.filter_groups.max_num_gen_batches
+    if max_num_gen_batches <= 0 or num_gen_batches < max_num_gen_batches:
+        print(f'{num_gen_batches=} < {max_num_gen_batches=}. Keep generating...')
+        continue
+    else:
+        raise ValueError(
+            f'{num_gen_batches=} >= {max_num_gen_batches=}. Generated too many. Please check your data.'
+        )
+else:
+    # Align the batch
+    traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
+    batch = batch[:traj_bsz]
+```
+
+### Flexible Loss Aggregation Mode (-> Token-level Loss)
+
+An example configuration:
+
+```yaml
+actor_rollout_ref:
+  actor:
+    loss_agg_mode: "token-mean" # / "seq-mean-token-sum" / "seq-mean-token-mean"
+    # NOTE: "token-mean" is the default behavior
+```
+
+Setting `loss_agg_mode` to `token-mean` will mean the (policy gradient) loss across all the tokens in all the sequences in a mini-batch.
+
+Core relevant code:
+
+```python
+if loss_agg_mode == "token-mean":
+    loss = verl_F.masked_mean(loss_mat, loss_mask)
+elif loss_agg_mode == "seq-mean-token-sum":
+    seq_losses = torch.sum(loss_mat * loss_mask, dim=-1)  # token-sum
+    loss = torch.mean(seq_losses)  # seq-mean
+elif loss_agg_mode == "seq-mean-token-mean":
+    seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1)  # token-mean
+    loss = torch.mean(seq_losses)  # seq-mean
+else:
+    raise ValueError(f"Invalid loss_agg_mode: {loss_agg_mode}")
+```
+
+### Overlong Reward Shaping
+
+An example configuration:
+
+```yaml
+data:
+  max_response_length: 20480 # 16384 + 4096
+reward_model:
+  overlong_buffer:
+    enable: True
+    len: 4096
+    penalty_factor: 1.0
+```
+
+Setting `overlong_buffer.enable` to `True` will penalize the outputs whose lengths are overlong but still within the hard context limit.
+
+Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length` by `0` to `overlong_buffer.len` tokens.
+
+Core relevant code:
+
+```python
+if self.overlong_buffer_cfg.enable:
+    overlong_buffer_len = self.overlong_buffer_cfg.len
+    expected_len = self.max_resp_len - overlong_buffer_len
+    exceed_len = valid_response_length - expected_len
+    overlong_penalty_factor = self.overlong_buffer_cfg.penalty_factor
+    overlong_reward = min(-exceed_len / overlong_buffer_len * overlong_penalty_factor, 0)
+    reward += overlong_reward
+```
+
+## FAQ
+
+### Where is the "Overlong Filtering" in the paper?
+
+Most experiments in the paper, including the best-performant one, are run without Overlong Filtering because it's somehow overlapping with Overlong Reward Shaping in terms of properly learning from the longest outputs. So we don't implement it here.
+
+### What's the difference between [the `recipe/dapo` directory in the `main` branch](https://github.com/volcengine/verl/tree/main/recipe/dapo) and the [`recipe/dapo` branch](https://github.com/volcengine/verl/tree/recipe/dapo/recipe/dapo)?
+
+[The `recipe/dapo` branch](https://github.com/volcengine/verl/tree/recipe/dapo/recipe/dapo) is for **as-is reproduction** and thus won't be updated with new features.
+
+[The `recipe/dapo` directory in the `main` branch](https://github.com/volcengine/verl/tree/main/recipe/dapo) works as an example of how to extend the latest `verl` to implement an algorithm recipe, which will be maintained with new features.
+
+### Why can't I produce similar results after modifications?
+
+RL infrastructures nowadays still have inherent unrobustness, on which we are still working hard to improve.
+
+We strongly recommend to only modify one thing at a time.
+
+We also list some known problems here:
+
+1. Enabling CUDA graph (`enforce_eager=False`) might cause model performance degradation, whose cause is still under investigation.
--- a/docs/algo/entropy.md
+++ b/docs/algo/entropy.md
+# Recipe: Entropy Mechanism
+
+Last updated: 06/27/2025.
+
+
+<div align="center">
+
+  The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning.
+
+[![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/pdf/2505.22617)  [![Github](https://img.shields.io/badge/PRIME-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/PRIME-RL/Entropy-Mechanism-of-RL) [![alphaXiv](https://img.shields.io/badge/discussion-A42C25?style=for-the-badge&logo=arxiv&logoColor=white&color=blue
+)](https://www.alphaxiv.org/abs/2505.22617) [![Twitter](https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=twitter&logoColor=white)](https://x.com/stingning/status/1928088554166505667) [![Twitter](https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=twitter&logoColor=white)](https://x.com/charlesfornlp/status/1928089451080585283) [![Twitter-ak](https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=twitter&logoColor=white)](https://x.com/_akhaliq/status/1928077929105268861)
+
+
+<div align="center" style="font-family: Arial, sans-serif;">
+  <p>
+    <a href="#🎉news" style="text-decoration: none; font-weight: bold;">🎉 News</a> •
+    <a href="#✨getting-started" style="text-decoration: none; font-weight: bold;">✨ Getting Started</a> •
+    <a href="#📖introduction" style="text-decoration: none; font-weight: bold;">📖 Introduction</a>
+  </p>
+  <p>
+    <a href="#🎈citation" style="text-decoration: none; font-weight: bold;">🎈 Citation</a> •
+    <a href="#🌻acknowledgement" style="text-decoration: none; font-weight: bold;">🌻 Acknowledgement</a> •
+    <a href="#📬Contact" style="text-decoration: none; font-weight: bold;">📬 Contact</a> •
+    <a href="#📈star-history" style="text-decoration: none; font-weight: bold;">📈 Star History</a>
+  </p>
+</div>
+
+</div>
+
+
+## 🎉News
+
+- **[2025/05/29]** 🎉 Ranked **#1** of the day on [Huggingface Daily Papers](https://huggingface.co/papers?date=2025-05-29).
+- **[2025/05/29]** Released our Paper on arXiv. See [here](https://arxiv.org/pdf/2505.22617). We provide insights into the entropy mechanism of RL for LLMs and propose two simple yet effective strategies to alleviate the entropy collapse. 
+
+
+
+## ✨Getting started
+
+After preparing the training data, for training Qwen2.5-7B on a single node, taking the KL-Cov approach as an example, you can simply run:
+
+```
+cd verl
+conda activate your_env
+bash recipe/dapo/7b_kl_cov.sh
+```
+
+While for training Qwen2.5-32B on multi nodes, you can run the following commands:
+
+```
+cd verl
+conda activate your_env
+bash recipe/dapo/32b_kl_cov.sh
+```
+
+## 📖Introduction
+
+<div align="left">
+  <img src="https://github.com/PRIME-RL/Entropy-Mechanism-of-RL/blob/main/figures/e2a.jpg?raw=true" alt="issue" style="width: 96%; height: auto;">
+</div>
+
+This paper addresses the entropy collapse issue in scaling reinforcement learning (RL) for large language models (LLMs), where policy entropy drops sharply during training, leading to overconfidence and performance saturation. We empirically establish a relationship between entropy ($H$) and performance ($R$): $R=−aexp(H)+b$, showing performance is bottlenecked by entropy exhaustion. 
+
+<div align="left">
+  <img src="https://github.com/PRIME-RL/Entropy-Mechanism-of-RL/blob/main/figures/cov.jpg?raw=true" alt="issue" style="width: 96%; height: auto;">
+</div>
+
+Theoretically, we find entropy changes are driven by the covariance between action probability and logit updates, which correlates with advantage in Policy Gradient methods. High-probability, high-advantage actions reduce entropy, while rare, high-advantage actions increase it. Empirically, the covariance term remains positive, explaining entropy’s monotonic decline. To mitigate this, we propose Clip-Cov and KL-Cov, which restrict updates for high-covariance tokens. These methods effectively prevent entropy collapse, and improve performance. 
+
+## 📃Evaluation
+
+<div align="left">
+  <img src="https://github.com/PRIME-RL/Entropy-Mechanism-of-RL/blob/main/figures/performance_fig.jpg?raw=true" alt="issue" style="width: 96%; height: auto;">
+</div>
+
+
+Our method is able to maintain a considerably higher level of entropy throughout training. For example, when the baseline's entropy reaches a plateau and can no longer be consumed, the KL-Cov method still sustains an entropy level over 10 times higher. Meanwhile, the response length of the policy model steadily increases, and its performance on the test set consistently surpasses that of the baseline. This indicates that our model is able to explore more freely during training, learning better policy through RL. 
+| **Method**        | **AIME24** | **AIME25** |  **AMC** | **MATH-500** | **OMNI-MATH** | **OlympiadBench** | **Minerva** | **Avg.** |
+| ----------------- | ---------: | ---------: | -------: | -----------: | ------------: | ----------------: | ----------: | -------: |
+| *Qwen2.5-7B*      |            |            |          |              |               |                   |             |          |
+| GRPO              |       21.2 |        9.6 |     58.7 |         78.8 |          27.9 |              40.7 |        36.7 |     38.6 |
+| w. Clip-higher    |       18.1 |       11.5 |     56.6 |         79.2 |          29.8 |              43.3 |        40.4 |     38.8 |
+| w. **`CLIP-Cov`** |       22.1 |   **15.8** |     58.2 |         80.4 |      **30.5** |          **44.1** |    **41.1** |     40.4 |
+| w. **`KL-Cov`**   |   **22.6** |       12.9 | **61.4** |     **80.8** |          29.1 |              42.6 |        38.2 | **40.6** |
+| *Qwen2.5-32B*     |            |            |          |              |               |                   |             |          |
+| GRPO              |       21.8 |       16.2 |     69.7 |         84.2 |          35.2 |              43.6 |        45.5 |     45.8 |
+| w. Clip-higher    |       35.6 |       22.3 |     69.5 |         77.2 |          35.1 |              42.5 |        43.0 |     47.2 |
+| w. **`CLIP-Cov`** |       32.3 |       22.7 |     67.2 |     **87.0** |      **42.0** |          **57.2** |        46.0 |     50.3 |
+| w. **`KL-Cov`**   |   **36.8** |   **30.8** | **74.5** |         84.6 |          39.1 |              49.0 |    **46.3** | **52.2** |
+
+Our two approaches both achieve non-trivial improvements across all benchmarks. Compared to GRPO, our method outperforms it by 2.0% on average for the 7B model and by 6.4% for the 32B model. Moreover, we observe that our method yields more substantial gains on the larger Qwen2.5-32B. Specifically, our method achieves improvements of 15.0% and 14.6% compared to GRPO on the most challenging benchmarks, AIME24 and AIME25, respectively.
+
+
+## 🎈Citation
+If you find this paper or repo helpful, please cite us.
+
+```bibtex
+@article{cui2025entropy,
+  title={The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models},
+  author={Cui, Ganqu and Zhang, Yuchen and Chen, Jiacheng and Yuan, Lifan and Wang, Zhi and Zuo, Yuxin and Li, Haozhan and Fan, Yuchen and Chen, Huayu and Chen, Weize and others},
+  journal={arXiv preprint arXiv:2505.22617},
+  year={2025}
+}
+```
+## 🌻Acknowledgement
+We implement our reinforcement learning algorithm extending from [verl](https://github.com/volcengine/verl). We utilize [vLLM](https://github.com/vllm-project/vllm) for inference. Our models are trained primarily on [Qwen2.5 family](https://github.com/QwenLM/Qwen2.5). Our training data is built from [DAPO-MATH](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k). Thanks for their great contributions!
+
+## 📬 Contact
+
+For questions, discussion, or collaboration opportunities, feel free to contact:
+- Ganqu Cui: cuiganqu@pjlab.org.cn
+- Yuchen Zhang: yuchen.zhang2003@gmail.com
+- Jiacheng Chen: jackchan9345@gmail.com
+- Ning Ding: ningding.cs@gmail.com
+
--- a/docs/algo/gpg.md
+++ b/docs/algo/gpg.md
+# GPG: Group Policy Gradient
+
+Last updated: 07/03/2025.
+
+Group Policy Gradient (GPG) is a minimalist reinforcement learning (RL) method that enhances the reasoning ability of large language models without relying on supervised fine-tuning or complex tricks. GPG revisits traditional policy gradients and directly optimizes the RL objective—no surrogate losses, no KL penalties, no critic, and no reference model. Compared to GRPO, GPG is simpler, more efficient, and achieves better results on many tasks. For more details, please refer to the original paper [GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
+](https://arxiv.org/abs/2504.02546).
+
+## Key Components
+- Use a corrected advantage function to improve policy gradient accuracy and training efficiency.
+- By eliminating the critic and reference models, avoiding KL divergence constraints, significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO)
+
+## Configuration
+To configure GPG within the framework, use the following YAML settings.
+
+```yaml
+algorithm:
+  adv_estimator: gpg 
+actor_rollout_ref:
+  actor:
+    policy_loss:
+      loss_mode: "gpg"
+```
+
+## Advanced Extensions
+GPG is a simple and strong baseline for model reasoning. Although it avoids using KL loss in its original form, you can still use KL loss to further improve the performance.
+
+```yaml
+algorithm:
+  adv_estimator: gpg
+actor_rollout_ref:
+  actor:
+    use_kl_loss: True # enable kl regularization
+    kl_loss_coef: 0.01
+    policy_loss:
+      loss_mode: "gpg"
+```
\ No newline at end of file
--- a/docs/algo/grpo.md
+++ b/docs/algo/grpo.md
+# Group Relative Policy Optimization (GRPO)
+
+Last updated: 05/31/2025.
+
+In reinforcement learning, classic algorithms like PPO rely on a "critic" model to estimate the value of actions, guiding the learning process. However, training this critic model can be resource-intensive. 
+
+GRPO simplifies this process by eliminating the need for a separate critic model. Instead, it operates as follows:
+- Group Sampling: For a given problem, the model generates multiple possible solutions, forming a "group" of outputs.
+- Reward Assignment: Each solution is evaluated and assigned a reward based on its correctness or quality.
+- Baseline Calculation: The average reward of the group serves as a baseline. 
+- Policy Update: The model updates its parameters by comparing each solution's reward to the group baseline, reinforcing better-than-average solutions and discouraging worse-than-average ones.
+
+This approach reduces computational overhead by avoiding the training of a separate value estimation model, making the learning process more efficient. For more details, refer to the original paper [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/pdf/2402.03300)
+
+## Key Components
+
+- No Value Function (Critic-less): unlike PPO, GRPO does not train a separate value network (critic)
+- Group Sampling (Grouped Rollouts): instead of evaluating one rollout per input, GRPO generates multiple completions (responses) from the current policy for each prompt. This set of completions is referred to as a group.
+- Relative Rewards: within each group, completions are scored (e.g., based on correctness), and rewards are normalized relative to the group.
+
+## Configuration
+
+Note that all configs containing `micro_batch_size` are used to configure the maximum sample or token count per forward or backward pass to avoid GPU OOMs, whose value should not change algorithmic/convergence behavior.
+
+Despite that many configurations start with the `ppo_` prefix, they work across different RL algorithms in verl, as the GRPO training loop is similar to that of PPO (without critic).
+
+![image](https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d)
+
+- `actor_rollout.ref.rollout.n`: For each prompt, sample n times. Default to 1. For GRPO, please set it to a value larger than 1 for group sampling.
+
+- `data.train_batch_size`: The global batch size of prompts used to generate a set of sampled trajectories/rollouts. The number of responses/trajectories is `data.train_batch_size * actor_rollout.ref.rollout.n`
+
+- `actor_rollout_ref.actor.ppo_mini_batch_size`: The set of sampled trajectories is split into multiple mini-batches with batch_size=ppo_mini_batch_size for PPO actor updates. The ppo_mini_batch_size is a global size across all workers.
+
+- `actor_rollout_ref.actor.ppo_epochs`: Number of epochs for GRPO updates on one set of sampled trajectories for actor
+
+- `actor_rollout_ref.actor.clip_ratio`: The GRPO clip range. Default to 0.2
+
+- `algorithm.adv_estimator`: Default is gae. Please set it to grpo instead
+
+- `actor_rollout_ref.actor.loss_agg_mode`: Default is "token-mean". Options include "token-mean", "seq-mean-token-sum", "seq-mean-token-mean". The original GRPO paper takes the sample-level loss (seq-mean-token-mean), which may be unstable in long-CoT scenarios. All GRPO example scripts provided in verl uses the default configuration "token-mean" for loss aggregation instead.
+
+Instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss:
+
+- `actor_rollout_ref.actor.use_kl_loss`: To use kl loss in the actor. When used, we are not applying KL in the reward function. Default is False. Please set it to True for GRPO.
+
+- `actor_rollout_ref.actor.kl_loss_coef`: The coefficient of kl loss. Default is 0.001.
+
+- `actor_rollout_ref.actor.kl_loss_type`: Support kl(k1), abs, mse(k2), low_var_kl(k3) and full. How to calculate the kl divergence between actor and reference policy. See this blog post for detailed analysis: http://joschu.net/blog/kl-approx.html
+
+## Advanced Extensions
+
+### DrGRPO
+
+[Understanding R1-Zero-Like Training: A Critical Perspective](https://arxiv.org/pdf/2503.20783) claims there's optimization bias in GRPO, which leads to artificially longer responses, especially for incorrect outputs. This inefficiency stems from the way GRPO calculates advantages using group-based reward normalization. Instead, DrGRPO aggregates token-level losses by normalizing with a global constant to eliminate length bias.
+
+Configure the following to enable DrGRPO, with all other parameters the same as GRPO's:
+
+- `actor_rollout_ref.actor.loss_agg_mode`: "seq-mean-token-sum-norm", which turns off seq-dim averaging
+- `actor_rollout_ref.actor.use_kl_loss`: Please set it to False for DrGRPO
+- `algorithm.norm_adv_by_std_in_grpo`: False, which turns off standard deviation norm
+
+## Reference Example
+
+Qwen2.5 GRPO training log and commands: [link](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/qwen2-7b-fsdp2.log)
+
+```bash
+bash examples/grpo_trainer/run_qwen3-8b.sh
+```
+
+For more reference performance, please see https://verl.readthedocs.io/en/latest/algo/baseline.html
--- a/docs/algo/opo.md
+++ b/docs/algo/opo.md
+# On-Policy RL with Optimal Reward Baseline (OPO)
+
+Last updated: 06/02/2025.
+
+Loose on-policy constraints and suboptimal baselines in reinforcement learning often lead to training instability such as large policy shifts and entropy collapse. OPO addresses these challenges by using exact on-policy training with the theretically optimal reward baseline for advantage estimation. It achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses.
+
+OPO uses group sampling to generate multiple outputs for each input like GRPO. Unlike group-based algorithms which typically use the mean reward of a group as its baseline, OPO employs a theoretically optimal baseline: the length-weighted reward of the group. It also  omits the standard deviation normalization. By adopting these two key components, OPO enables the training of a single policy model with the objective of maximizing only the expected reward. For more detailes, refer to the original paper [On-Policy RL with Optimal Reward Baseline](https://arxiv.org/pdf/2505.23585).
+
+## Key Components
+
+- Exact On-Policy Training: always generates responses from the current policy, without using any pre-generated data or off-policy data.
+- Optimal Reward Baseline: uses a length-weighted reward of the group as the baseline for normalizing the rewards.
+
+## Configuration
+
+To configure OPO within the framework, use the following YAML settings. These parameters are crucial for enabling exact on-policy training and activating the optimal reward baseline.
+
+```yaml
+algorithm:
+  adv_estimator: opo  # Use OPO for optimal reward baseline 
+data:
+  train_batch_size: 1024
+actor_rollout_ref:
+  actor:
+    ppo_mini_batch_size: 1024 # ppo_mini_batch_size should equal to train_batch_size to enable exact on-policy training
+    entropy_coeff: 0 # disable entropy regularization
+    use_kl_loss: False # disable kl regularization
+    kl_loss_coef: 0 
+```
+
+## Advanced Extensions
+
+OPO can also be extended to other algorithms like RLOO and Reinforce++. It just needs to adjust their configurations to enable exact on-policy training and incorporate the optimal length-weighted reward baseline with minimal modifications to their advantage estimation functions.
--- a/docs/algo/ppo.md
+++ b/docs/algo/ppo.md
+# Proximal Policy Optimization (PPO)
+
+Last updated: 06/19/2025.
+
+Proximal Policy Optimization (PPO) is a family of policy gradient methods for reinforcement learning, proposed by OpenAI in 2017. PPO strikes a balance between simplicity, stability, and performance, making it one of the most widely used algorithms in modern RL applications, including large-scale language model fine-tuning.
+
+Traditional policy gradient methods like REINFORCE or Vanilla Policy Gradient suffer from:
+
+- High variance and sample inefficiency.
+- Instability due to large policy updates.
+
+PPO addresses this problem using a clipped surrogate objective that avoids overly large updates without requiring second-order derivatives.
+
+For more technical details regarding PPO, we suggest reading the introduction in the [OpenAI spinning up tutorial](https://spinningup.openai.com/en/latest/algorithms/ppo.html), and the paper [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347).
+
+## Key Components
+
+- Actor-Critic Architecture: PPO requires both an actor model (policy) and a critic model (value function). This differs from other algorithms like GRPO and RLOO that don't require a critic model.
+
+- Generalized Advantage Estimation (GAE): PPO uses GAE for computing advantage values, which helps reduce variance in policy gradient estimates while maintaining low bias.
+
+- Clipped Surrogate Objective: The core of PPO is implemented through the clipped surrogate objective function that limits policy updates.
+
+## Configuration
+
+Note that all configs containing `micro_batch_size` are used to configure the maximum sample or token count per forward or backward pass to avoid GPU OOMs, whose value should not change algorithmic/convergence behavior.
+
+Most critic configs are similar to those of actors. Note that the critic model is omitted from the figure below.
+
+![image](https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d)
+
+- `data.train_batch_size`: The global batch size of prompts used to generate a set of sampled trajectories/rollouts. The number of responses/trajectories is `data.train_batch_size * actor_rollout.ref.rollout.n`
+
+- `actor_rollout_ref.actor.ppo_mini_batch_size`: The set of sampled trajectories is split into multiple mini-batches with batch_size=ppo_mini_batch_size for PPO actor updates. The ppo_mini_batch_size is a global size across all workers
+
+- `actor_rollout_ref.critic.ppo_mini_batch_size`: The set of sampled trajectories is split into multiple mini-batches with batch_size=ppo_mini_batch_size for PPO critic updates. The ppo_mini_batch_size is a global size across all workers
+
+- `actor_rollout_ref.actor.clip_ratio`: The PPO clip range. Default to 0.2
+
+- `actor_rollout_ref.actor.ppo_epochs`: Number of epochs for PPO updates on one set of sampled trajectories for actor
+
+- `critic.ppo_epochs`: Number of epochs for PPO updates on one set of sampled trajectories for critic. Defaults to `actor_rollout_ref.actor.ppo_epochs`
+
+- `algorithm.gemma`: discount factor
+
+- `algorithm.lam`: The lambda term that trades off between bias and variance in the GAE estimator
+
+- `algorithm.adv_estimator`: Support gae, grpo, reinforce_plus_plus, reinforce_plus_plus_baseline, rloo
+
+## Advanced Extensions
+
+### KL Divergence Control
+
+Options to prevent the policy from diverging too far from a reference policy. Two mechanisms are available: KL reward penalty and KL loss. For more technical details, see [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
+
+Options to use KL loss for KL divergence control: 
+
+- `actor_rollout_ref.actor.use_kl_loss`: to use kl loss in the actor. When used, we are not applying KL in the reward function. Default is False
+
+- `actor_rollout_ref.actor.kl_loss_coef`: The coefficient of kl loss. Default is 0.001.
+
+- `actor_rollout_ref.actor.kl_loss_type`: Support kl(k1), abs, mse(k2), low_var_kl(k3) and full. How to calculate the kl divergence between actor and reference policy. See this blog post for detailed analysis: http://joschu.net/blog/kl-approx.html
+
+Options to use KL penalty in the reward:
+
+- `algorithm.use_kl_in_reward`: Whether to enable in-reward kl penalty. Default is False.
+
+- `algorithm.kl_penalty`: Support kl(k1), abs, mse(k2), low_var_kl(k3) and full. This defines the way to calculate the kl divergence between actor and reference policy. For specific options, refer to `kl_penalty` in core_algos.py. See this blog post for detailed analysis: http://joschu.net/blog/kl-approx.html
+
+- `algorithm.kl_ctrl.kl_coef`: The (initial) coefficient of in-reward kl_penalty. Default is 0.001.
+- `algorithm.kl_ctrl.type`: 'fixed' for FixedKLController and 'adaptive' for AdaptiveKLController.
+- `algorithm.kl_ctrl.horizon`: See source code of AdaptiveKLController for details.
+- `algorithm.kl_ctrl.target_kl`: See source code of AdaptiveKLController for details.
+
+### Dual-clip PPO
+
+The Dual-Clip PPO introduces a approach by applying a lower bound to the policy ratio when the advantage is less than zero, when multiplied by a large raito, does not exceed a specified lower bound.
+
+![image](https://github.com/user-attachments/assets/fc232181-d8b0-4307-8dd2-4dc0a4c1c139)
+
+- `actor_rollout_ref.actor.clip_ratio_c`: lower bound of the value for Dual-clip PPO, defaults to 3.0
+
+## Reference Example
+
+Qwen2.5 training log and commands: [link](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log)
+
+```bash
+bash run_gemma.sh
+  trainer.n_gpus_per_node=1 \
+  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+  trainer.logger=console \
+  critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+  data.train_batch_size=256 \
+  actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+  actor_rollout_ref.actor.ppo_micro_batch_size=2 \
+  critic.ppo_micro_batch_size=2
+```
+
+Reference performance with verl v0.2:
+
+| Model                          | Method          | Score | Link                                                                                           |
+|-------------------------------|------------------|-------|------------------------------------------------------------------------------------------------|
+| Qwen/Qwen2.5-0.5B-Instruct     | pretrained model | 36.4  | [Qwen Blog](https://qwenlm.github.io/blog/qwen2.5-llm/)                                        |
+| Qwen/Qwen2.5-0.5B-Instruct     | PPO              | 56.7  | [PPO Command and Logs](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log) |
--- a/docs/algo/spin.md
+++ b/docs/algo/spin.md
+# Recipe: Self-Play Fine-Tuning (SPIN)
+
+Last updated: 05/31/2025.
+
+`verl` provides a recipe inspired by the paper **"Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models"** (SPIN). SPIN is a language model finetuning algorithm that enables iterative self-improvement through a self-play mechanism inspired by game theory.
+
+**Core Idea:** Models learn by playing against themselves, reducing reliance on external preference datasets or stronger teacher models:
+
+1.  **Synthetic Data Generation:** The current model generates responses, creating its own training data from previous iterations.
+2.  **Two-Player Game Setup:** A game involving two players acted by a single LLM.
+3.  **Iterative Training:** The model progressively improves by refining its policy, with each iteration's model becoming the opponent for the next iteration.
+
+Paper Authors: [Zixiang Chen](https://github.com/uclaml/SPIN)\*, [Yihe Deng](https://github.com/uclaml/SPIN)\*, [Huizhuo Yuan](https://scholar.google.com/citations?user=8foZzX4AAAAJ)\*, [Kaixuan Ji](https://scholar.google.com/citations?user=FOoKDukAAAAJ), [Quanquan Gu](https://web.cs.ucla.edu/~qgu/)
+
+[[Webpage](https://uclaml.github.io/SPIN/)] [[Huggingface](https://huggingface.co/papers/2401.01335)] [[Paper](https://arxiv.org/abs/2401.01335)] [[Original Implementation](https://github.com/uclaml/SPIN)]
+
+verl Implementation Authors: [Chendong Wang](https://cdwang96.github.io/), [Chenyang Zhao](https://github.com/zhaochenyang20)
+
+---
+
+## Key Function (compute_online_dpo_loss) and Related works
+SPIN (Chen et al., 2024) proposes an iterative self-play mechanism to fine-tune language models. In each iteration, SPIN's training objective, when using a logistic loss function, is equivalent to Direct Preference Optimization (DPO) loss (Rafailov et al., 2023). 
+
+This `verl` recipe realizes SPIN's core concept by using DPO loss iteratively (Xu et al., 2023; Xiong et al., 2023; Snorkel AI, 2024). This means that in each iteration, we fine-tune the LLM using DPO loss for preference optimization. Notably, Xu et al. (2023) explored iterative preference optimization with pairwise cringe loss, while Xiong et al. (2023) discussed how to bridge theory and practice for RLHF under KL constraints using iterative training. The concept of iterative preference learning was also explored in online DPO (Guo et al., 2024), which focuses on direct alignment from online AI feedback. In online DPO, preference data is dynamically updated during training, allowing the model to learn from its own generated data.
+
+Specifically, we developed the **`compute_online_dpo_loss`** function and built this SPIN recipe on top of it. By incorporating online preference generation, this approach enables continuously refining language models without relying on fixed external preference datasets.
+
+**Reference Papers:**
+* [Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models](https://arxiv.org/abs/2401.01335) (Chen et al., 2024) 
+* [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290) (Rafailov et al., 2023) 
+* [Somethings are more cringe than others: Preference optimization with the pairwise cringe loss](https://arxiv.org/abs/2312.16682) (Xu et al., 2023) 
+* [Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint](https://arxiv.org/abs/2312.11456) (Xiong et al., 2023)
+* [Snorkel-Mistral-PairRM-DPO](https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO) (Snorkel AI, 2024)
+* [Direct language model alignment from online ai feedback](https://arxiv.org/abs/2402.04792) (Guo et al., 2024)
+
+
+## Our Online DPO Implementation
+
+Our `compute_online_dpo_loss` function adapts `verl`'s existing PPO infrastructure (based on `verl` v0.3.0.post1) for this iterative online DPO. Key aspects of our implementation include:
+
+* **No Critic:** Unlike PPO, we omit the value function critic.
+* **Dynamic Reference Model:** An explicit reference policy (`ref_policy_wg`) is used for DPO loss. This reference model's weights can be periodically updated from the actor (`ref_update_freq`), providing a dynamic baseline.
+* **Online Preference Generation:** The `compute_onlineDPO_pref` function (in `core_algos.py`) dynamically creates chosen/rejected pairs based on a reward source (e.g., rule-based ranking for math problems).
+* **DPO Loss Integration:** We replace PPO's policy loss with our `compute_online_dpo_loss` (in `core_algos.py`) within the actor update (`dp_actor.py`), directly optimizing the policy using the generated preferences.
+* **Iterative Training Orchestration:** The `SpinTrainer` (in `spin_trainer.py`) manages the entire self-play loop: generation, preference labeling, optional reference model updates, and policy updates, enabling continuous self-improvement aligned with SPIN's principles.
+
+---
+## Algorithm
+
+This recipe implements an Online algorithm adapted to the `verl` Reinforcement Learning framework, which provides an alternative to PPO for fine-tuning language models.
+
+**Online Loop:** Instead of maximizing a scalar reward signal in PPO, this approach directly optimizes the policy model to align with preference data generated *online* during training:
+
+1.  **Generation:** The current model generates multiple responses for each prompt in a batch.
+2.  **Preference Labeling:** A function evaluates these generated responses to determine which one is preferred (chosen) and which is dispreferred (rejected). This can be done using a reward function or implicit ranking based on specific rules. (In this recipe, we use rule-based ranking on the math problem).
+3.  **Update:** This preference tuple (`prompt`, `chosen_response`, `rejected_response`) is used to update the actor model using `compute_online_dpo_loss`, comparing against a reference model.
+
+**Connection with SPIN:**
+Instead of only using a fixed target data distribution, the online generation loop in step 2 will dynamically change the target data distribution by using a certain Preference Labeling method (rule-based ranking on the math problem by selecting the better one in this recipe). This explores the direction mentioned in SPIN's paper Section 7 about "dynamically changing target data distribution" to potentially elevate LLM performance beyond the fixed human-annotated data ceiling.
+
+---
+
+## Reproduce the Experiment (Example Setup)
+
+The following steps outline how to set up the environment and run the SPIN recipe, based on the provided test log using GSM8K and Qwen2.5-3B-Instruct.
+
+1.  **Setup Environment (Example using Docker):**
+    ```bash
+    # Start a container with GPU access and shared memory
+    docker run -it --name spin_test --gpus all \
+        --shm-size=32g \
+        --ipc=host \
+        -v /path/to/host/.cache:/root/.cache \
+        -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
+        lmsysorg/sglang:latest \
+        /bin/bash
+
+    # Inside the container or on your host machine:
+    # Ensure /tmp is writable
+    mkdir -p /tmp
+    chmod 1777 /tmp
+
+    # Install Python 3.10 (if not present) and venv
+    sudo apt update
+    sudo apt install -y python3.10 python3.10-venv tmux
+    python3 -m ensurepip --upgrade
+
+    # Create and activate a virtual environment
+    python3 -m venv ~/.python/spin_env
+    source ~/.python/spin_env/bin/activate
+
+    # Install uv (fast package installer)
+    python3 -m pip install uv
+    ```
+
+2.  **Install verl and Dependencies:**
+    ```bash
+    # Clone the verl repository and checkout the spin branch
+    cd ~
+    git clone git@github.com:volcengine/verl.git && cd verl
+
+    # Install flash-attn (handle potential build issues)
+    python3 -m uv pip install wheel packaging
+    python3 -m uv pip install flash-attn --no-build-isolation --no-deps
+
+    # Install verl with sglang extras
+    python3 -m uv pip install -e ".[sglang]"
+    ```
+    *Note: If `flash-attn` installation fails, try the manual steps again or consult its documentation.*
+
+3.  **Login & Download Data/Model:**
+    ```bash
+    # Login to Weights & Biases (optional, for logging)
+    export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
+    # wandb login
+
+    # Download the GSM8K dataset
+    python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k # Adjusted path
+
+    # Download the base model (Example: Qwen2.5-3B-Instruct)
+    huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct
+    ```
+
+4.  **Configure:**
+    * Modify the configuration file (e.g., `config/spin_trainer.yaml` or the one specified in the run script) with correct paths to your downloaded model, data, desired hyperparameters (`dpo_beta`, learning rate, etc.), and distributed training settings (nodes, GPUs per node).
+    * Pay attention to `actor_rollout_ref.model_path`, `data` paths, `reward_model` config (if using one), and `trainer.ref_update_freq`.
+
+5.  **Run Training:**
+    ```bash
+    # Set CUDA visible devices (adjust based on your hardware and config)
+    export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+    # Launch the training script (e.g., test.sh or a custom script)
+    # Ensure test.sh points to the correct config and main script
+    bash recipe/spin/run_spin.sh
+    ```
+
+---
+
+## Configuration
+
+* The primary configuration is typically managed through a YAML file specified in the launch script (e.g., `config/spin_trainer.yaml`).
+* Key configuration sections:
+    * `data`: Paths to training/validation prompt files, batch sizes, sequence lengths.
+    * `actor_rollout_ref`: Paths to the base model (used for actor and initial reference), FSDP settings, optimization parameters (learning rate, scheduler).
+    * `reward_model`: Configuration for the reward model used for online preference labeling (path, batch size, etc.). Can be omitted if using a simpler reward function.
+    * `algorithm`: DPO-specific hyperparameters like `dpo_beta`, `dpo_loss_type`.
+    * `trainer`: Distributed training settings (nodes, GPUs per node), logging (WandB), checkpointing frequency, and `ref_update_freq` (set > 0 to enable periodic reference model updates from the actor).
+
+---
+
+## Key Files
+
+* `main_spin.py`: Main entry point using Hydra to load the config and launch the `SpinTrainer`.
+* `spin_trainer.py`: Defines the `SpinTrainer` class, orchestrating the Online DPO training loop.
+* `fsdp_workers.py`: Implements Ray workers (Actor, Reference) potentially using FSDP.
+* `dp_actor.py`: Contains the actor class, including the DPO policy update logic.
+* `core_algos.py`: Includes helper functions for `compute_online_dpo_loss` and `compute_onlineDPO_pref`.
+* `config/spin_trainer.yaml` (or similar): Main Hydra configuration file for the recipe.
+* `run_spin.sh` (or similar): Example bash script for launching a training run.
+* `README.md`: This file.
+
+---
+
+## Acknowledgement
+
+We sincerely thank the contribution and guidance from the `verl` community and advisors, including (adapted from SPPO):
+
+* [Zixiang Chen](https://sites.google.com/view/zxchen)
+* [Yuhao Yang](https://github.com/yhyang201)
+* [Yifan Zhang](https://github.com/yifanzhang-pro)
+* [Yongan Xiang](https://github.com/BearBiscuit05)
+* [Junrong Lin](https://github.com/ocss884)
+* [Yuxuan Tong](https://github.com/tongyx361)
+* [Guangming Shen](https://github.com/PeterSH6)
+* [Biao He](https://www.linkedin.com/in/biao-he/)
+* [Qingquan Song](https://qingquansong.github.io/)
+* [Chenyang Zhao](https://zhaochenyang20.github.io/Chayenne/)
+* [Quanquan Gu](https://web.cs.ucla.edu/~qgu/)
--- a/docs/algo/sppo.md
+++ b/docs/algo/sppo.md
+# Recipe: Self-Play Preference Optimization (SPPO)
+
+Last updated: 05/28/2025.
+
+verl provides a community recipe implementation for the paper [Self-Play Preference Optimization for Language Model Alignment](https://arxiv.org/abs/2405.00675). SPPO can significantly enhance the performance of an LLM without strong external signals such as responses or preferences from GPT-4. It can outperform the model trained with iterative direct preference optimization (DPO), among other methods. SPPO is theoretically grounded, ensuring that the LLM can converge to the von Neumann winner (i.e., Nash equilibrium) under general, potentially intransitive preference, and empirically validated through extensive evaluations on multiple datasets.
+
+Paper Authors: [Yue Wu](https://yuewu.us/)\*, [Zhiqing Sun](https://www.cs.cmu.edu/~zhiqings/)\*, [Huizhuo Yuan](https://scholar.google.com/citations?user=8foZzX4AAAAJ)\*, [Kaixuan Ji](https://scholar.google.com/citations?user=FOoKDukAAAAJ), [Yiming Yang](https://www.cs.cmu.edu/~yiming/), [Quanquan Gu](https://web.cs.ucla.edu/~qgu/)
+
+verl Implementation Authors: [Yuhao Yang](https://github.com/yhyang201), [Chenyang Zhao](https://github.com/zhaochenyang20)
+
+[[Webpage](https://uclaml.github.io/SPPO/)] [[Huggingface](https://huggingface.co/papers/2405.00675)] [[Paper](https://arxiv.org/abs/2405.00675)][[Original Implementation](https://github.com/uclaml/SPPO)]
+
+## Reproduce the Experiment
+
+We evaluate the performance of SPPO on the MATH dataset. Starting from an initial score of 46.6 with Qwen2.5-7B-Instruct, we achieve a score of 65.6 after 20 epochs of training, placing our model approximately in the top 20 on the [MATH leaderboard](https://paperswithcode.com/sota/math-word-problem-solving-on-math). It's important to note that verl's internal evaluation metrics may not perfectly align with the official evaluation methodology for Qwen2.5-7B-Instruct. Therefore, for consistency and fair comparison, we report only the results based on verl's evaluation framework.
+
+```
+git clone git@github.com:volcengine/verl.git
+cd verl
+python3 -m uv pip install -e ".[sglang]"
+
+export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
+
+python3 examples/data_preprocess/math_dataset.py --local_dir ~/data/math
+huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir $HOME/models/Qwen2.5-7B-Instruct
+
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+bash recipe/sppo/run_qwen2.5-7b_rm.sh
+```
+
+Note that the installation would occasionally fail to install flash-attn. If this happens, you can install it manually by running:
+
+```bash
+python3 -m uv pip install wheel
+python3 -m uv pip install packaging
+python3 -m uv pip install flash-attn --no-build-isolation --no-deps
+```
+
+## Acknowledgement
+
+We sincerely thank the contribution and guidance from:
+
+- [Yue Wu](https://yuewu.us/)
+- [Chendong Wang](https://cdwang96.github.io/)
+- [Yifan Zhang](https://github.com/yifanzhang-pro)
+- [Yongan Xiang](https://github.com/BearBiscuit05)
+- [Junrong Lin](https://github.com/ocss884)
+- [Yuxuan Tong](https://github.com/tongyx361)
+- [Guangming Shen](https://github.com/PeterSH6)
+- [Biao He](https://www.linkedin.com/in/biao-he/)
+- [Qingquan Song](https://qingquansong.github.io/)
+- [Quanquan Gu](https://web.cs.ucla.edu/~qgu/)
--- a/docs/amd_tutorial/amd_build_dockerfile_page.rst
+++ b/docs/amd_tutorial/amd_build_dockerfile_page.rst
+Getting started with AMD (ROCM Kernel)
+=====================================================
+
+Last updated: 07/06/2025.
+
+Author: `Yusheng Su <https://yushengsu-thu.github.io/>`_
+
+Setup
+-----
+
+If you run on AMD GPUs (MI300) with ROCM platform, you cannot use the previous quickstart to run verl. You should follow the following steps to build a docker and set ``RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES`` or ``RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`` when starting ray in verl's RLHF training.
+
+
+docker/Dockerfile.rocm
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    FROM "rlfoundation.azurecr.io/rocm6.3.4:vllm-0.8.5-numa-patch-ubuntu-22.04"
+
+    SHELL ["/bin/bash", "-ceuxo", "pipefail"]
+
+    ENV MAX_JOBS=512
+
+    ENV PATH="/usr/local/python3.12/bin:$PATH"
+    RUN ln -sf /usr/bin/python3.12 /usr/bin/python && \
+        ln -sf /usr/bin/pip3.12 /usr/bin/pip
+
+    ############################################
+    RUN apt-get update
+    RUN apt-get install -y pkg-config liblzma-dev
+    ############################################
+
+    ###########################################
+    ##########Install TransformerEngine########
+    ###########################################
+    WORKDIR /workspace/
+    # transformer-engine install
+    # https://github.com/ROCm/TransformerEngine
+    RUN rm -rf TransformerEngine 
+    RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
+    WORKDIR /workspace/TransformerEngine
+    git checkout 236178e5
+    # git checkout bb061ade
+    # git checkout 864405c
+    ENV NVTE_FRAMEWORK=pytorch 
+    ENV NVTE_ROCM_ARCH=gfx942 
+    ENV NVTE_USE_HIPBLASLT=1
+    ENV NVTE_USE_ROCM=1  
+    # export CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr:${CMAKE_PREFIX_PATH:-}"
+    ENV CMAKE_PREFIX_PATH="/opt/rocm:/opt/rocm/hip:/usr/local:/usr"
+    RUN MAX_JOBS=$(MAX_JOBS) pip install . -vvv 
+    WORKDIR /workspace/
+    ###########################################
+    ###########################################
+    ###########################################
+
+
+
+
+
+    ####################################################################################
+    ################Install vllm - sglang require vllm 0.6.7 dependency#################
+    ####################################################################################
+    #### Require vllm 0.6.7 - checkout 113274a0
+    WORKDIR /workspace/
+    RUN rm -rf vllm
+    RUN pip uninstall -y vllm
+    # Refer to here (down-grade vllm to 0.6.3): https://docs.vllm.ai/en/v0.6.3/getting_started/amd-installation.html
+    RUN git clone https://github.com/ROCm/vllm.git
+    # git clone https://github.com/vllm-project/vllm.git
+    WORKDIR /workspace/vllm
+    RUN git checkout 113274a0
+    ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+    #ENV MAX_JOBS=512
+    ENV MAX_JOBS=${MAX_JOBS}
+    RUN pip install "boto3>=1.26.0"
+    RUN pip install setuptools_scm
+    # will add src into py. You can delete the repo
+    RUN python3 setup.py install
+    WORKDIR /workspace/
+    ####################################################################################
+    ####################################################################################
+    ####################################################################################
+
+
+
+    ###########################################
+    ############For hack docker################
+    ###########################################
+    RUN pip install setuptools==75.8.0
+    ###########################################
+    ###########################################
+    ###########################################
+
+
+
+    ###########################################
+    ############build sgalng###################
+    ###########################################
+    # Set environment variables
+    ENV BASE_DIR=/sgl-workspace
+    ENV BUILD_TYPE=all
+    ENV SGL_REPO=https://github.com/sgl-project/sglang
+    ENV SGL_BRANCH=v0.4.6.post5
+    ENV TRITON_REPO=https://github.com/ROCm/triton.git
+    ENV TRITON_COMMIT=improve_fa_decode_3.0.0
+    ENV AITER_REPO=https://github.com/ROCm/aiter.git
+    ENV AITER_COMMIT=v0.1.2
+    # v0.1.2 version - commit id: 9d11f47
+    # ENV AITER_COMMIT=9d11f47
+    ENV HIP_FORCE_DEV_KERNARG=1
+    ENV HSA_NO_SCRATCH_RECLAIM=1
+    ENV SGLANG_SET_CPU_AFFINITY=1
+    ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
+    ENV NCCL_MIN_NCHANNELS=112
+    ENV MOE_PADDING=1
+    ENV VLLM_FP8_PADDING=1
+    ENV VLLM_FP8_ACT_PADDING=1
+    ENV VLLM_FP8_WEIGHT_PADDING=1
+    ENV VLLM_FP8_REDUCE_CONV=1
+    ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
+    ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
+    ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+    ENV AMDGPU_TARGETS=gfx942
+    ENV ROCM_ARCH=gfx942
+    ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+    # Switch to working directory
+    WORKDIR /sgl-workspace
+    # Clean and create directory
+    RUN rm -rf /sgl-workspace && mkdir -p /sgl-workspace
+
+    # Clone and build sglang
+    RUN git clone ${SGL_REPO} \
+        && cd sglang \
+        && git checkout ${SGL_BRANCH} || echo "Using default branch" \
+        && cd sgl-kernel \
+        && rm -f pyproject.toml \
+        && mv pyproject_rocm.toml pyproject.toml \
+        && python setup_rocm.py install \
+        && cd .. \
+        && if [ "$BUILD_TYPE" = "srt" ]; then \
+            python -m pip --no-cache-dir install -e "python[srt_hip]"; \
+        else \
+            python -m pip --no-cache-dir install -e "python[all_hip]"; \
+        fi \
+        && cd /sgl-workspace \
+        && cp -r /sgl-workspace/sglang /sglang \
+        && python -m pip cache purge
+
+    # Install common Python packages
+    RUN pip install IPython orjson python-multipart torchao pybind11
+    # Rebuild Triton
+    RUN pip uninstall -y triton || true \
+        && git clone ${TRITON_REPO} \
+        && cd triton \
+        && git checkout ${TRITON_COMMIT} \
+        && cd python \
+        && python3 setup.py install \
+        && cd /sgl-workspace
+    # ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942 --amdgpu-lower-module-lds-strategy=1"
+    # ENV HIPCC_COMPILE_FLAGS_APPEND="--offload-arch=gfx942"
+
+    # Build aiter
+    #version: Commit 9d11f47
+        # && git checkout ${AITER_COMMIT} \
+    RUN pip uninstall -y aiter || true
+    RUN git clone ${AITER_REPO} \
+        && cd aiter \
+        && git checkout ${AITER_COMMIT} \
+        && git submodule sync \
+        && git submodule update --init --recursive \
+        && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py install \
+        && cd /sgl-workspace
+
+    # Copy MI300X config 
+    RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
+            /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
+            -type f -name '*MI300X*' | \
+            xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
+
+    # Environment setup complete.
+    RUN echo "Environment setup complete."
+
+    WORKDIR /workspace/
+    ###########################################
+    ###########################################
+    ###########################################
+
+
+
+
+
+
+    ###########################################
+    ###############vllm v0.8.5#################
+    ###########################################
+    WORKDIR /workspace/
+
+    ENV VLLM_TARGET_DEVICE=rocm 
+    ENV ROCM_PATH=/opt/rocm 
+    ENV SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev
+    # Find the repo path in: DockerFile/Dockerfile.rocm_yang
+    # RUN git clone https://github.com/RLFoundation/vllm-patch.git
+    RUN pip uninstall -y vllm || true
+    RUN rm -rf vllm-patch
+    RUN git clone https://github.com/RLFoundation/vllm-patch.git \
+        && cd vllm-patch \
+        && git checkout v0.8.5-sleep-numa \
+        && rm -rf build/ dist/ *.egg-info \
+        && ln -sf /opt/rocm/lib/libamdhip64.so /usr/lib/libamdhip64.so \
+        && SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py install
+        # RUN SETUPTOOLS_SCM_PRETEND_VERSION=0.8.5.dev PYTORCH_ROCM_ARCH="gfx90a;gfx942" MAX_JOBS=${MAX_JOBS} python3 setup.py develop
+    WORKDIR /workspace/
+    ###########################################
+    ###########################################
+    ###########################################
+
+
+
+
+    #########################################
+    #### Install megatron-core###############
+    #########################################
+    RUN pip uninstall -y megatron-core && \
+        git clone https://github.com/yushengsu-thu/Megatron-LM-amd_version.git && \
+        cd Megatron-LM-amd_version && \
+        pip install -vvv -e . && \
+        cd /workspace/
+    #########################################
+    #########################################
+    #########################################
+
+
+
+
+    #######################################
+    ################apex###################
+    #######################################
+    WORKDIR /workspace/
+    RUN pip uninstall -y apex && \
+        git clone git@github.com:ROCm/apex.git && \
+        cd apex && \
+        python setup.py install && \
+        cd /workspace/ 
+    #######################################
+    #######################################
+    #######################################
+
+
+    ################################################################################
+    ###########################Add torch_memory_saver###############################
+    ################################################################################
+    # Set environment variables
+    ENV HIPCC_COMPILE_FLAGS_APPEND="--amdgpu-target=gfx90a;gfx942 -D__HIP_PLATFORM_AMD__"
+    ENV CFLAGS="-D__HIP_PLATFORM_AMD__"
+    ENV CXXFLAGS="-D__HIP_PLATFORM_AMD__"
+    RUN pip install "git+https://github.com/YangWang92/torch_memory_saver_numa.git@numa"
+    ################################################################################
+    ################################################################################
+    ################################################################################
+
+
+
+    ########################################
+    ######Install ray#######################
+    ########################################
+    # need to add this patch: https://github.com/ray-project/ray/pull/53531/files
+    RUN pip uninstall ray -y
+    RUN pip install "ray[data,train,tune,serve]>=2.47.0" 
+    ########################################
+    ########################################
+    ########################################
+
+
+    ##########################################
+    #######Install other dependencies#########
+    ##########################################
+    RUN pip install "tensordict==0.6.2" --no-deps && \
+        pip install accelerate \
+        codetiming \
+        datasets \
+        dill \
+        hydra-core \
+        liger-kernel \
+        numpy \
+        pandas \
+        peft \
+        "pyarrow>=15.0.0" \
+        pylatexenc \
+        torchdata \
+        wandb \
+        orjson \
+        pybind11
+        
+    WORKDIR /workspace/
+    RUN git clone https://github.com/volcengine/verl.git && \
+        cd verl && \
+        pip install -e . 
+    ##########################################
+    ##########################################
+    ##########################################
+
+    WORKDIR /workspace/
+    CMD ["/usr/bin/bash"]
+
+
+Build the image:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    docker docker/build -t verl-rocm .
+
+Run the container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Note: You can pull the docker from this DockerHub: [RLSys Foundation](https://hub.docker.com/u/yushengsuthu)
+Pull the image:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    docker pull yushengsuthu/verl:verl-0.4.1_ubuntu-22.04_rocm6.3.4-numa-patch_vllm0.8.5_sglang0.4.6.post4
+
+    docker tag yushengsuthu/verl:verl-0.4.1_ubuntu-22.04_rocm6.3.4-numa-patch_vllm0.8.5_sglang0.4.6.post4 verl-rocm:latest
+
+Run the container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+Optional: Running without root and with user permissions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: bash
+
+    docker run --rm -it \
+      --device /dev/dri \
+      --device /dev/kfd \
+      -p 8265:8265 \
+      --group-add video \
+      --cap-add SYS_PTRACE \
+      --security-opt seccomp=unconfined \
+      --privileged \
+      -v $HOME/.ssh:/root/.ssh \
+      -v $HOME:$HOME \
+      --shm-size 128G \
+      -w $PWD \
+      verl-rocm \
+      /bin/bash
+
+(Optional): If you do not want to root mode and require assign yourself as the user
+Please add ``-e HOST_UID=$(id -u)`` and ``-e HOST_GID=$(id -g)`` into the above docker launch script. 
+
+Example
+-------
+
+Due to to special setting in AMD (ROCM) torch, 
+1. If your ``ray>=2.45.0`` (default), you need to set ``RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`` when starting ray in verl's RLHF training and add this [patch](https://github.com/ray-project/ray/pull/53531/files).
+2. If your ``ray<2.45.0``, you need to set ``RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES`` when starting ray in verl's RLHF training.
+Inference ``$ENGINE`` can be ``vllm`` or ``sglang``. We choose ``vllm`` as default in the following examples.
+
+
+
+PPO
+~~~
+
+.. code-block:: bash
+
+    YOUR_PROJECT_NAME=r1-verl-ppo-upstream
+    YOUR_RUN_NAME=r1-training_ppo-upstream 
+    # export HYDRA_FULL_ERROR=1
+
+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    
+    # [ray] < 2.45.0
+    #export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
+
+    # [ray] >= 2.45.0
+    export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1 # Patch with https://github.com/ray-project/ray/pull/52794
+
+    GPUS_PER_NODE=8
+    MODEL_PATH=Qwen/Qwen2.5-0.5B-Instruct
+    python3 examples/data_preprocess/gsm8k.py --local_dir data/gsm8k
+    python3 -c "import transformers; transformers.pipeline('text-generation', model='$MODEL_PATH')"
+    ENGINE=vllm #sglang
+
+    PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
+     data.train_files=data/gsm8k/train.parquet \
+     data.val_files=data/gsm8k/test.parquet \
+     data.train_batch_size=256 \
+     data.val_batch_size=1312 \
+     data.max_prompt_length=512 \
+     data.max_response_length=256 \
+     actor_rollout_ref.model.path=$MODEL_PATH \
+     actor_rollout_ref.actor.optim.lr=1e-6 \
+     actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+     actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+     actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+     actor_rollout_ref.rollout.name=$ENGINE \
+     actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+     critic.optim.lr=1e-5 \
+     critic.model.path=$MODEL_PATH \
+     critic.ppo_micro_batch_size_per_gpu=4 \
+     algorithm.kl_ctrl.kl_coef=0.001 \
+     trainer.logger=console \
+     trainer.project_name=$YOUR_PROJECT_NAME \
+     trainer.experiment_name=$YOUR_RUN_NAME \
+     trainer.val_before_train=False \
+     trainer.n_gpus_per_node=$GPUS_PER_NODE \
+     trainer.nnodes=1 \
+     trainer.save_freq=10 \
+     trainer.test_freq=10 \
+     trainer.total_epochs=15 #2>&1 | tee verl_demo.log
+
+GRPO
+~~~~
+
+.. code-block:: bash
+
+    YOUR_PROJECT_NAME=r1-verl-grpo-upstream
+    YOUR_RUN_NAME=r1-training_grpo-upstream
+    # export HYDRA_FULL_ERROR=1
+    # export FSDP_VERBOSE=1 
+
+    #export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+    # [ray] < 2.45.0
+    #export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
+
+    # [ray] >= 2.45.0
+    export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1 # Patch with https://github.com/ray-project/ray/pull/52794
+
+    GPUS_PER_NODE=8
+    MODEL_PATH=Qwen/Qwen2.5-0.5B-Instruct
+    # MODEL_PATH=Qwen/Qwen2-7B-Instruct
+    python3 examples/data_preprocess/gsm8k.py --local_dir data/gsm8k
+    python3 -c "import transformers; transformers.pipeline('text-generation', model='$MODEL_PATH')"
+    ENGINE=vllm #sglang
+    
+    python3 -m verl.trainer.main_ppo \
+        algorithm.adv_estimator=grpo \
+        data.train_files=data/gsm8k/train.parquet \
+        data.val_files=data/gsm8k/test.parquet \
+        data.train_batch_size=1024 \
+        data.val_batch_size=1312 \
+        data.max_prompt_length=512 \
+        data.max_response_length=1024 \
+        actor_rollout_ref.model.path=$MODEL_PATH \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.model.use_remove_padding=True \
+        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+        actor_rollout_ref.actor.use_dynamic_bsz=True \
+        actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
+        actor_rollout_ref.actor.use_kl_loss=True \
+        actor_rollout_ref.actor.kl_loss_coef=0.001 \
+        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+        actor_rollout_ref.model.enable_gradient_checkpointing=Flase \
+        actor_rollout_ref.actor.fsdp_config.param_offload=False \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.name=$ENGINE \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+        actor_rollout_ref.rollout.n=5 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=False \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.critic_warmup=0 \
+        trainer.logger=console \
+        trainer.project_name=$YOUR_PROJECT_NAME \
+        trainer.experiment_name=$YOUR_RUN_NAME \
+        trainer.n_gpus_per_node=$GPUS_PER_NODE \
+        trainer.val_before_train=False \
+        trainer.nnodes=1 \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15
+
+
+
+Multi-node training: slurm with Docker/Podman container
+---------------------------------------------------------------------------------------
+
+If you want to run multi-node training with slurm, you can use the following script. 
+
+.. note::
+    1. You need to use ``podman`` or ``docker`` in the following script. We will release the apptainer script later.
+    2. If you want to use ``podman``, you just replace ``docker`` with ``podman`` in the following script.
+
+The script includes the following steps:
+
+1. SLURM Configuration
+2. Environment Setup
+3. Docker/Podman Container Setup
+4. Ray Cluster Initialization
+5. Data Preprocessing
+6. Model Setup
+7. Training Launch
+
+
+slurm_script.sh
+~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    #!/bin/bash
+
+    #SBATCH --job-name=verl-ray-on-slurm
+    #SBATCH --nodes=2
+    #SBATCH --ntasks-per-node=2
+    #SBATCH --mem=200G
+    #SBATCH --time=30-00:00:00
+    #SBATCH --gpus-per-node=8
+    #SBATCH --cpus-per-task=28
+    #SBATCH --output=../verl_log/slurm-%j.out
+    #SBATCH --error=../verl_log/slurm-%j.err
+    #SBATCH --nodelist=gpu-[0,1]
+
+
+    # load necessary modules
+    ### Run this setup
+    # [Cluster]: Use docker
+    # docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+
+
+    ##########################################################################
+    ###The following setting should be set in different project and cluster###
+    ##########################################################################
+
+    ### Project
+    CONTAINER_NAME="multinode_verl_training"
+    IMG="verl.rocm"
+    DOCKERFILE="docker/Dockerfile.rocm"
+    # echo $PWD
+    verl_workdir="${HOME}/projects/verl_upstream"
+    export TRANSFORMERS_CACHE="${HOME}/.cache/huggingface"
+    export HF_HOME=$TRANSFORMERS_CACHE
+
+    ### Cluster Network Setting
+    export NCCL_DEBUG=TRACE
+    export GPU_MAX_HW_QUEUES=2
+    export TORCH_NCCL_HIGH_PRIORITY=1
+    export NCCL_CHECKS_DISABLE=1
+    # export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 
+    export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9
+    export NCCL_IB_GID_INDEX=3
+    export NCCL_CROSS_NIC=0
+    export CUDA_DEVICE_MAX_CONNECTIONS=1
+    export NCCL_PROTO=Simple
+    export RCCL_MSCCL_ENABLE=0
+    export TOKENIZERS_PARALLELISM=false
+    export HSA_NO_SCRATCH_RECLAIM=1
+    ##########################################################################
+
+    ## Assign using GPUs
+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+    ### For rocm and training script
+    # [ray] < 2.45.0
+    #export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
+
+    # [ray] >= 2.45.0
+    export RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1 # Patch with https://github.com/ray-project/ray/pull/52794
+
+
+    # Build and launch the Docker container
+    srun bash -c "
+        # Exit on any error
+        set -e 
+
+        # Clean up dangling images (images with <none> tag)
+        docker image prune -f
+
+        # Need to pull the docker first
+        docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+        
+        if ! docker images --format "{{.Repository}}:{{.Tag}}" | grep -q "${IMG}"; then
+            echo \"Building ${IMG} image...\"
+            docker build -f \"${DOCKERFILE}\" -t \"${IMG}\" .
+        else
+            echo \"${IMG} image already exists, skipping build\"
+        fi
+
+        # Removing old container if exists
+        docker rm \"${CONTAINER_NAME}\" 2>/dev/null || true
+
+        # Checking network devices
+        ibdev2netdev
+
+        # Launch the docker
+        docker run --rm -d \
+        -e HYDRA_FULL_ERROR=1 \
+        -e RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 \
+        -e RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1 \
+        -e NCCL_DEBUG=${NCCL_DEBUG} \
+        -e GPU_MAX_HW_QUEUES=${GPU_MAX_HW_QUEUES} \
+        -e TORCH_NCCL_HIGH_PRIORITY=${TORCH_NCCL_HIGH_PRIORITY} \
+        -e NCCL_CHECKS_DISABLE=${NCCL_CHECKS_DISABLE} \
+        -e NCCL_IB_HCA=${NCCL_IB_HCA} \
+        -e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX} \
+        -e NCCL_CROSS_NIC=${NCCL_CROSS_NIC} \
+        -e CUDA_DEVICE_MAX_CONNECTIONS=${CUDA_DEVICE_MAX_CONNECTIONS} \
+        -e NCCL_PROTO=${NCCL_PROTO} \
+        -e RCCL_MSCCL_ENABLE=${RCCL_MSCCL_ENABLE} \
+        -e TOKENIZERS_PARALLELISM=${TOKENIZERS_PARALLELISM} \
+        -e HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM} \
+        -e TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE} \
+        -e HF_HOME=${HF_HOME} \
+        --network host \
+        --device /dev/dri \
+        --device /dev/kfd \
+        --device /dev/infiniband \
+        --group-add video \
+        --cap-add SYS_PTRACE \
+        --security-opt seccomp=unconfined \
+        --privileged \
+        -v \${HOME}:\${HOME} \
+        -v \${HOME}/.ssh:/root/.ssh \
+        -w "${verl_workdir}" \
+        --shm-size 128G \
+        --name \"${CONTAINER_NAME}\" \
+        \"${IMG}\" \
+        tail -f /dev/null
+
+        echo \"Container setup completed\"
+    "
+        # (Optional): If you do not want to root mode and require assign yuorself as the user
+        # Please add `-e HOST_UID=$(id -u)` and `-e HOST_GID=$(id -g)` into the above docker launch script. 
+
+
+
+
+
+    ### Ray launch the nodes before training
+
+    # Getting the node names
+    nodes_array=($(scontrol show hostnames "$SLURM_JOB_NODELIST" | tr '\n' ' '))
+
+    head_node=${nodes_array[0]}
+    head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+
+    # if we detect a space character in the head node IP, we'll
+    # convert it to an ipv4 address. This step is optional.
+    if [[ "$head_node_ip" == *" "* ]]; then
+        IFS=' ' read -ra ADDR <<<"$head_node_ip"
+    if [[ ${#ADDR[0]} -gt 16 ]]; then
+        head_node_ip=${ADDR[1]}
+    else
+        head_node_ip=${ADDR[0]}
+    fi
+        echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
+    fi
+
+    port=6379
+    ip_head=$head_node_ip:$port
+    export ip_head
+    echo "IP Head: $ip_head"
+
+    # make sure we set environment variables before Ray initialization
+
+    # Print out all env variables
+    printenv
+
+    echo "Starting HEAD at $head_node"
+    srun --nodes=1 --ntasks=1 -w "$head_node" \
+        docker exec "${CONTAINER_NAME}" \
+            ray start --head --node-ip-address="$head_node_ip" --port=$port \
+            --dashboard-port=8266 \
+            --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
+    # optional, though may be useful in certain versions of Ray < 1.0.
+    sleep 10
+
+    # number of nodes other than the head node
+    worker_num=$((SLURM_JOB_NUM_NODES - 1))
+
+    for ((i = 1; i <= worker_num; i++)); do
+        node_i=${nodes_array[$i]}
+        echo "Debug: Starting worker on node_i = ${node_i}"
+        if [ -z "$node_i" ]; then
+            echo "Error: Empty node name for worker $i"
+            continue
+        fi
+        echo "Starting WORKER $i at $node_i"
+        srun --nodes=1 --ntasks=1 -w "$node_i" \
+            docker exec "${CONTAINER_NAME}" \
+                ray start --address "$ip_head" --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
+        sleep 5
+    done
+
+
+
+
+    # Ray initlization test (See whether any error in the above execution)
+    echo "Testing Ray initialization in the slurm nodes..."
+    docker exec "${CONTAINER_NAME}" python3 -c '
+    import ray
+    try:
+        ray.init(address="auto")
+        print("\n=== Ray Cluster Status ===")
+        print(f"Number of nodes: {len(ray.nodes())}")
+        for node in ray.nodes():
+            print("Node: {}, Status: {}".format(node["NodeManagerHostname"], node["Alive"]))
+            # print(f"Node: {node}")
+        ray.shutdown()
+        print("Ray initialization successful!")
+    except Exception as e:
+        print(f"Ray initialization failed: {str(e)}")
+    '
+    echo "=== Ray test completed ==="
+    ######
+
+
+
+    # Run data preprocessing
+
+    echo "Starting data preprocessing..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 "examples/data_preprocess/gsm8k.py" "--local_dir" "../data/gsm8k"
+
+    echo "Starting data preprocessing..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 "examples/data_preprocess/math_dataset.py" "--local_dir" "../data/math"
+
+    train_files="../data/gsm8k/train.parquet"
+    val_files="../data/gsm8k/test.parquet"
+
+    # Download and test model
+    echo "Loading model..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
+    MODEL_PATH="Qwen/Qwen2-7B-Instruct"
+
+    # Set model path after pipeline test
+    MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"
+
+    echo "== Data and model loading Done =="
+
+    echo "Start to train..."
+
+    docker exec "${CONTAINER_NAME}" \
+        python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
+    MODEL_PATH="Qwen/Qwen2-7B-Instruct"
+
+
+    PYTHONUNBUFFERED=1 srun --overlap --nodes=${SLURM_NNODES} --ntasks=1 -w "$head_node" \
+        docker exec "${CONTAINER_NAME}" \
+        python3 -m verl.trainer.main_ppo \
+        data.train_files=$train_files \
+        data.val_files=$val_files \
+        data.train_batch_size=1024 \
+        data.max_prompt_length=1024 \
+        data.max_response_length=1024 \
+        actor_rollout_ref.model.path=$MODEL_PATH \
+        actor_rollout_ref.model.enable_gradient_checkpointing=False \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.model.use_remove_padding=True \
+        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=False \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.name=vllm \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=True \
+        critic.optim.lr=1e-5 \
+        critic.model.use_remove_padding=True \
+        critic.model.path=$MODEL_PATH \
+        critic.model.enable_gradient_checkpointing=False \
+        critic.ppo_micro_batch_size_per_gpu=8 \
+        critic.model.fsdp_config.param_offload=False \
+        critic.model.fsdp_config.optimizer_offload=False \
+        algorithm.kl_ctrl.kl_coef=0.0001 \
+        trainer.critic_warmup=0 \
+        trainer.logger='["console","wandb"]' \
+        trainer.project_name='verl_example' \
+        trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
+        trainer.n_gpus_per_node=${SLURM_GPUS_PER_NODE} \
+        trainer.val_before_train=False \
+        trainer.nnodes=${SLURM_NNODES} \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15
+
+
+Run slurm_script.sh
+~~~~~~~~~~~~~~~~~~~~
+Just sbatch your slurm_script.sh
+
+.. code-block:: bash
+
+    sbatch slurm_script.sh
+
--- a/docs/amd_tutorial/amd_vllm_page.rst
+++ b/docs/amd_tutorial/amd_vllm_page.rst
+verl performance tuning for AMD (ROCm Kernel)
+=====================================================
+
+Last updated: 04/25/2025.
+
+Author: `Yang Wang <https://github.com/YangWang92/>`_
+
+Patch vLLM to Enable Sleep Mode for AMD GPUs
+--------------------------------------------------------------
+
+By default, verl requires vLLM to enable sleep mode, which allows vLLM to offload GPU memory to CPU memory after rollout. However, this feature is still under review by the vLLM community.
+
+To enable vLLM's sleep mode, you can first use community patched code (from `this pull request <https://github.com/vllm-project/vllm/pull/12695>`_) to build vLLM from the source code in the corresponding pull request. After the patch merged in vLLM main branch, you can directly install vLLM from the latest version.
+
+1. Clone the vLLM repository and build it with the following commands:
+
+.. code-block:: bash
+
+    git clone -b sleep_amd https://github.com/HollowMan6/vllm.git
+    cd vllm
+    sudo ln -sf /opt/rocm/lib/libamdhip64.so /usr/lib/libamdhip64.so
+    VLLM_TARGET_DEVICE=rocm ROCM_PATH=/opt/rocm/ VLLM_GPU_LANG=HIP SETUPTOOLS_SCM_PRETEND_VERSION=0.8.4.dev python3 setup.py develop
+
+2. Additionally, make sure to use the ROCm version in your Docker image lager than or equal to ROCm 6.3.4, and we recommend to use ROCm 6.4.0 for better performance (see `this comment <https://github.com/vllm-project/vllm/pull/12695#issuecomment-2637839574>`_).
+
+After the upgrade, you can verify whether sleep mode is enabled by running the following test code (from `this comment <https://github.com/vllm-project/vllm/pull/12695#issuecomment-2637839574>`_).
+
+.. code-block:: python
+
+	import torch
+	from vllm import LLM
+
+	llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enable_sleep_mode=True)
+
+	def run_inference(prompt):
+		outputs = llm.generate(prompt)
+		for output in outputs:
+			prompt = output.prompt
+			generated_text = output.outputs[0].text
+			print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
+	print("CUDA Memory Usage (after inference):")
+	torch.cuda.empty_cache()
+	print(f"{torch.cuda.memory_allocated()=}")
+
+	run_inference("San Francisco is")
+	llm.sleep()
+
+	print("CUDA Memory Usage (after sleep):")
+	torch.cuda.empty_cache()
+	print(f"{torch.cuda.memory_allocated()=}")
+
+	llm.wake_up()
+
+	print("CUDA Memory Usage (after wakeup):")
+	torch.cuda.empty_cache()
+	print(f"{torch.cuda.memory_allocated()=}")
+
+	run_inference("Paris is")
+
+If sleep mode is enabled, you should see the memory usage reduce after sleep.
+
+After applying the vLLM patch and completing the installation, you can enable sleep mode in verl to reduce memory overhead. This allows verl to offload unused GPU memory during rollout, significantly lowering the memory footprint during long-context training or multi-node reinforcement learning.
+
+
+Enable CUDA Graph and Bypass ROCm-related issues
+--------------------------------------------------------------
+
+Due to potential issues with CUDA graph capture in ROCm, we’ve found that vLLM’s CUDA graph feature cannot be enabled on multiple nodes in verl on AMD platforms with vLLM V1 mode. This leads to significantly slower rollout performance.
+
+Our investigation shows that ROCm may trigger an unexpected crash when attempting to capture large batches with CUDA graph. One workaround is to patch the LLM configuration (from `this commit <https://github.com/volcengine/verl/blob/v0.3.0.rc0/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py#L100-L115>`_).
+
+.. code-block:: python
+	
+    self.inference_engine = LLM(
+        model=model_path,
+        enable_sleep_mode=True,
+        tensor_parallel_size=tensor_parallel_size,
+        distributed_executor_backend="external_launcher",
+        dtype=config.dtype,
+        enforce_eager=config.enforce_eager,
+        gpu_memory_utilization=config.gpu_memory_utilization,
+        disable_custom_all_reduce=True,
+        disable_mm_preprocessor_cache=True,
+        limit_mm_per_prompt=limit_mm_per_prompt,
+        skip_tokenizer_init=False,
+        max_model_len=max_model_len,
+        load_format=load_format,
+        disable_log_stats=config.disable_log_stats,
+        max_num_batched_tokens=max_num_batched_tokens,
+        enable_chunked_prefill=config.enable_chunked_prefill,
+        enable_prefix_caching=True,
+        trust_remote_code=trust_remote_code,
+        # enable compilation config to bypass oom on rocm
+	# change depends on your GPU memory size
+        compilation_config={"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64]},
+        seed=config.get('seed', 0),
+    )
+
+Then, you can choose to enable CUDA graph by setting the following environment variables (see `this page <https://github.com/volcengine/verl/blob/v0.3.0.rc0/docs/README_vllm0.8.md>`_):
+
+.. code-block:: bash
+
+	actor_rollout_ref.rollout.enforce_eager=False \