Initial commit

7f6cc211 · jerrrrry · 7f6cc211 · 7f6cc211 · 7f6cc211 · 7f6cc211
Commit 7f6cc211 authored Aug 05, 2025 by jerrrrry
20 changed files
--- a/docs/api/data.rst
+++ b/docs/api/data.rst
+Data interface
+=========================
+
+Last updated: 05/19/2025 (API docstrings are auto-generated).
+
+DataProto is the interface for data exchange.
+
+The :class:`verl.DataProto` class contains two key members:
+
+- batch: a :class:`tensordict.TensorDict` object for the actual data
+- meta_info: a :class:`Dict` with additional meta information
+
+TensorDict
+~~~~~~~~~~~~
+
+:attr:`DataProto.batch` is built on top of :class:`tensordict`, a project in the PyTorch ecosystem.
+A TensorDict is a dict-like container for tensors. To instantiate a TensorDict, you must specify key-value pairs as well as the batch size.
+
+.. code-block:: python
+
+    >>> import torch
+    >>> from tensordict import TensorDict
+    >>> tensordict = TensorDict({"zeros": torch.zeros(2, 3, 4), "ones": torch.ones(2, 3, 5)}, batch_size=[2,])
+    >>> tensordict["twos"] = 2 * torch.ones(2, 5, 6)
+    >>> zeros = tensordict["zeros"]
+    >>> tensordict
+    TensorDict(
+    fields={
+        ones: Tensor(shape=torch.Size([2, 3, 5]), device=cpu, dtype=torch.float32, is_shared=False),
+        twos: Tensor(shape=torch.Size([2, 5, 6]), device=cpu, dtype=torch.float32, is_shared=False),
+        zeros: Tensor(shape=torch.Size([2, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
+    batch_size=torch.Size([2]),
+    device=None,
+    is_shared=False)
+
+One can also index a tensordict along its batch_size. The contents of the TensorDict can be manipulated collectively as well.
+
+.. code-block:: python
+
+    >>> tensordict[..., :1]
+    TensorDict(
+    fields={
+        ones: Tensor(shape=torch.Size([1, 3, 5]), device=cpu, dtype=torch.float32, is_shared=False),
+        twos: Tensor(shape=torch.Size([1, 5, 6]), device=cpu, dtype=torch.float32, is_shared=False),
+        zeros: Tensor(shape=torch.Size([1, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
+    batch_size=torch.Size([1]),
+    device=None,
+    is_shared=False)
+    >>> tensordict = tensordict.to("cuda:0")
+    >>> tensordict = tensordict.reshape(6)
+
+For more about :class:`tensordict.TensorDict` usage, see the official tensordict_ documentation.
+
+.. _tensordict: https://pytorch.org/tensordict/overview.html
+
+
+Core APIs
+~~~~~~~~~~~~~~~~~
+
+.. autoclass::  verl.DataProto
+   :members: to, select, union, make_iterator, concat
--- a/docs/api/single_controller.rst
+++ b/docs/api/single_controller.rst
+Single Controller interface
+============================
+
+Last updated: 05/27/2025 (API docstrings are auto-generated).
+
+The Single Controller provides a unified interface for managing distributed workers
+using Ray or other backends and executing functions across them.
+It simplifies the process of dispatching tasks and collecting results, particularly 
+when dealing with data parallelism or model parallelism. 
+
+
+Core APIs
+~~~~~~~~~~~~~~~~~
+
+.. autoclass:: verl.single_controller.Worker
+   :members: __init__, __new__, get_master_addr_port, get_cuda_visible_devices, world_size, rank
+
+.. autoclass:: verl.single_controller.WorkerGroup
+   :members: __init__,  world_size
+
+.. autoclass:: verl.single_controller.ClassWithInitArgs
+   :members: __init__, __call__
+
+.. autoclass:: verl.single_controller.ResourcePool
+   :members: __init__, world_size, local_world_size_list, local_rank_list
+
+.. autoclass:: verl.single_controller.ray.RayWorkerGroup
+   :members: __init__
+
+.. autofunction:: verl.single_controller.ray.create_colocated_worker_cls
\ No newline at end of file
--- a/docs/api/trainer.rst
+++ b/docs/api/trainer.rst
+Trainer Interface
+================================
+
+Last updated: 06/08/2025 (API docstrings are auto-generated).
+
+Trainers drive the training loop. Introducing new trainer classes in case of new training paradiam is encouraged.
+
+.. autosummary::
+   :nosignatures:
+
+   verl.trainer.ppo.ray_trainer.RayPPOTrainer
+
+
+Core APIs
+~~~~~~~~~~~~~~~~~
+
+.. autoclass:: verl.trainer.ppo.ray_trainer.RayPPOTrainer
+   :members: __init__, init_workers, fit
+
+.. automodule:: verl.utils.tokenizer
+   :members: hf_tokenizer
+
+.. automodule:: verl.trainer.ppo.core_algos
+   :members: agg_loss, kl_penalty, compute_policy_loss, kl_penalty
+
+.. automodule:: verl.trainer.ppo.reward
+   :members: load_reward_manager, compute_reward, compute_reward_async
+
+.. autoclass:: verl.workers.reward_manager.NaiveRewardManager
+
+.. autoclass:: verl.workers.reward_manager.DAPORewardManager
--- a/docs/api/utils.rst
+++ b/docs/api/utils.rst
+Utilities
+============
+
+Last updated: 05/19/2025 (API docstrings are auto-generated).
+
+This section documents the utility functions and classes in the VERL library.
+
+Python Functional Utilities
+------------------------------
+
+.. automodule:: verl.utils.py_functional
+   :members: append_to_dict
+
+File System Utilities
+------------------------
+
+.. automodule:: verl.utils.fs
+   :members: copy_to_local
+
+Tracking Utilities
+---------------------
+
+.. automodule:: verl.utils.tracking
+   :members: Tracking
+
+Metrics Utilities
+---------------------
+
+.. automodule::  verl.utils.metric
+   :members: reduce_metrics
+
+Checkpoint Management
+------------------------
+
+.. automodule:: verl.utils.checkpoint.checkpoint_manager
+   :members: find_latest_ckpt_path
+
+.. automodule:: verl.utils.checkpoint.fsdp_checkpoint_manager
+   :members: FSDPCheckpointManager
+
+Dataset Utilities
+---------------------
+
+.. automodule:: verl.utils.dataset.rl_dataset
+   :members: RLHFDataset, collate_fn
+
+Torch Functional Utilities
+-----------------------------
+
+.. automodule:: verl.utils.torch_functional
+   :members: get_constant_schedule_with_warmup, masked_whiten, masked_mean, logprobs_from_logits
+
+Sequence Length Balancing
+----------------------------
+
+.. automodule:: verl.utils.seqlen_balancing
+   :members: get_reverse_idx, rearrange_micro_batches
+
+Ulysses Utilities
+--------------------
+
+.. automodule:: verl.utils.ulysses
+   :members: gather_outputs_and_unpad, ulysses_pad_and_slice_inputs
+
+FSDP Utilities
+------------------
+
+.. automodule:: verl.utils.fsdp_utils
+   :members: get_fsdp_wrap_policy, get_init_weight_context_manager, init_fn, load_fsdp_model_to_gpu, load_fsdp_optimizer, offload_fsdp_model_to_cpu, offload_fsdp_optimizer,
+
+Debug Utilities
+-------------------
+
+.. automodule:: verl.utils.profiler
+   :members: log_gpu_memory_usage, GPUMemoryLogger
+
--- a/docs/ascend_tutorial/ascend_profiling.rst
+++ b/docs/ascend_tutorial/ascend_profiling.rst
+在昇腾设备上基于FSDP后端进行数据采集
+====================================
+
+Last updated: 07/24/2025.
+
+这是一份在昇腾设备上基于FSDP后端使用GRPO或DAPO算法进行数据采集的教程。
+
+配置
+----
+
+复用verl/trainer/config/ppo_trainer.yaml中的配置项控制采集的模式和步数，
+通过verl/trainer/config/npu_profile/npu_profile.yaml中的配置项控制例如采集等级等参数。
+
+全局采集控制
+~~~~~~~~~~~~
+
+通过 ppo_trainer.yaml 中的参数控制采集步数和模式：
+
+-  trainer.profile_steps：
+   该参数可以设置为一个包含采集步数的列表，例如[2，
+   4]， 意味着将会采集第二步和第四步。如果该参数为null，则代表不进行采集
+-  actor_rollout_ref.profiler：
+   控制采集的ranks和模式
+
+   -  all_ranks：设为True代表对所有rank进行采集
+   -  ranks：当all_ranks不为True时，
+      通过ranks参数控制需要采集的rank，该参数设置为一个包含采集rank的列表， 例如[0，
+      1]
+   -  discrete：
+      控制采集的模式。当该参数设置为False，代表采集端到端的数据；当该参数设置为True，代表采用离散模式分训练阶段采集数据
+
+通过 npu_profile.yaml 中的参数控制具体采集行为：
+
+-  save_path：采集数据的存放路径
+-  roles: 采集的角色，下列为可选项
+
+   -  rollout_generate：采集rollout的generate_sequences阶段
+   -  actor_compute_log_prob：采集actor的compute_log_prob阶段
+   -  actor_update：采集actor的update_actor阶段
+   -  ref_compute_log_prob：采集ref的compute_ref_log_prob阶段
+   -  all： 采集以上所有阶段
+
+-  level：采集等级，可选项为level_none、level0、level1和level2
+
+   -  level_none：不采集所有Level层级控制的数据，即关闭profiler_level
+   -  level0：采集上层应用数据、底层NPU数据以及NPU上执行的算子信息
+   -  level1：在level0的基础上多采集CANN层AscendCL数据和NPU上执行的AI
+      Core性能指标信息
+   -  level2：在level1的基础上多采集CANN层Runtime数据以及AI CPU
+
+-  record_shapes：是否记录张量形状
+-  with_memory：是否启用内存分析
+-  with_npu：是否采集device侧性能数据
+-  with_cpu：是否采集host侧性能数据
+-  with_module：是否记录框架层python调用栈信息
+-  with_stack：是否记录算子调用栈信息
+-  analysis：是否自动解析数据
+
+示例
+----
+
+禁用采集
+~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: null # disable profile
+
+端到端采集
+~~~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: [1, 2, 5]
+       actor_rollout_ref:
+            profiler:
+                discrete: False
+                all_ranks: True
+
+
+离散模式采集
+~~~~~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: [1, 2, 5]
+       actor_rollout_ref:
+            profiler:
+                discrete: True
+                all_ranks: False
+                ranks: [0, 1]
+
+
+离散模式采集actor
+~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: [1, 2, 5]
+           npu_profile:
+                options:
+                    roles: ["actor_compute_log_prob", "actor_update"]
+       actor_rollout_ref:
+            profiler:
+                discrete: True
+                all_ranks: False
+                ranks: [0, 1]
+
+
+可视化
+------
+
+采集后的数据存放在用户设置的save_path下，可通过 `MindStudio Insight <https://www.hiascend.com/document/detail/zh/mindstudio/80RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html>`_ 工具进行可视化。
+
+如果analysis参数设置为False，采集之后需要进行离线解析：
+
+.. code:: python
+
+    import torch_npu
+    # profiler_path请设置为"localhost.localdomain_<PID>_<timestamp>_ascend_pt"目录的上一级目录
+    torch_npu.profiler.profiler.analyse(profiler_path=profiler_path)
\ No newline at end of file
--- a/docs/ascend_tutorial/ascend_profiling_en.rst
+++ b/docs/ascend_tutorial/ascend_profiling_en.rst
+Data collection based on FSDP (Fully Sharded Data Parallel) backend on Ascend devices(NPU)
+==========================================================================================
+
+Last updated: 07/24/2025.
+
+This is a tutorial for data collection using the GRPO or DAPO algorithm
+based on FSDP on Ascend devices.
+
+Configuration
+-------------
+
+Reuse the configuration items in
+verl/trainer/config/ppo_trainer.yaml to control the collection mode
+and steps, you can also manage the collection behaviors such as
+collection level via verl/trainer/config/npu_profile/npu_profile.yaml.
+
+Global collection control
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use parameters in ppo_trainer.yaml to control the collection mode
+and steps.
+
+-  trainer.profile_steps: This parameter can be set as a list that has
+   collection steps, such as [2, 4], which means it will collect steps 2
+   and 4. If set to null, no collection occurs.
+-  actor_rollout_ref.profiler: Control the ranks and mode of profiling
+
+   -  all_ranks: Collects data from all ranks when set to true.
+   -  ranks: This parameter specifies which ranks to collect (e.g., [0,
+      1]) when all_ranks is False.
+   -  discrete: Controls the collection mode. If False, end-to-end data
+      is collected; if True, data is collected in discrete phases during
+      training.
+
+Use parameters in npu_profile.yaml to control collection behavior:
+
+-  save_path: Storage path for collected data.
+-  roles: Roles to collect. The following options are available
+
+   -  rollout_generate: Collect the `generate_sequences` phase 
+      of rollout worker.
+   -  actor_compute_log_prob: Collect the `compute_log_prob` phase 
+      of the actor worker.
+   -  actor_update:  Collect the `update_actor` phase of the actor worker.
+   -  ref_compute_log_prob: Collect the `compute_ref_log_prob` phase 
+      of the ref worker.
+   -  all: Collect all of the above phases.
+
+-  level: Collection level—options are level_none, level0, level1, and
+   level2
+
+   -  level_none: Disables all level-based data collection (turns off
+      profiler_level).
+   -  level0: Collect high-level application data, underlying NPU data,
+      and operator execution details on NPU.
+   -  level1: Extends level0 by adding CANN-layer AscendCL data and AI
+      Core performance metrics on NPU.
+   -  level2: Extends level1 by adding CANN-layer Runtime data and AI
+      CPU metrics.
+
+-  record_shapes: Whether to record tensor shapes.
+-  with_memory: Whether to enable memory analysis.
+-  with_npu: Whether to collect device-side performance data.
+-  with_cpu: Whether to collect host-side performance data.
+-  with_module: Whether to record framework-layer Python call stack
+   information.
+-  with_stack: Whether to record operator call stack information.
+-  analysis: Enables automatic data parsing.
+
+Examples
+--------
+
+Disabling collection
+~~~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: null # disable profile
+
+End-to-End collection
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: [1, 2, 5]
+       actor_rollout_ref:
+            profiler:
+                discrete: False
+                all_ranks: True
+
+
+Discrete Mode Collection
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: [1, 2, 5]
+       actor_rollout_ref:
+            profiler:
+                discrete: True
+                all_ranks: False
+                ranks: [0, 1]
+
+
+Enable actor collection in discrete mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+
+       trainer:
+           profile_steps: [1, 2, 5]
+           npu_profile:
+                options:
+                    roles: ["actor_compute_log_prob", "actor_update"]
+       actor_rollout_ref:
+            profiler:
+                discrete: True
+                all_ranks: False
+                ranks: [0, 1]
+
+
+Visualization
+-------------
+
+Collected data is stored in the user-defined save_path and can be
+visualized by using the `MindStudio Insight <https://www.hiascend.com/document/detail/zh/mindstudio/80RC1/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html>`_ tool.
+
+If the analysis parameter is set to False, offline parsing is required after data collection:
+
+.. code:: python
+
+    import torch_npu
+    # Set profiler_path to the parent directory of the "localhost.localdomain_<PID>_<timestamp>_ascend_pt" folder
+    torch_npu.profiler.profiler.analyse(profiler_path=profiler_path)
\ No newline at end of file
--- a/docs/ascend_tutorial/ascend_quick_start.rst
+++ b/docs/ascend_tutorial/ascend_quick_start.rst
+verl x Ascend
+===================================
+
+Last updated: 06/17/2025.
+
+我们在 verl 上增加对华为昇腾设备的支持。
+
+硬件支持
+-----------------------------------
+
+Atlas 200T A2 Box16
+
+Atlas 900 A2 PODc
+
+
+安装
+-----------------------------------
+
+基础环境准备
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+-----------+-------------+
+| software  | version     |
+-----------+-------------+
+| Python    | == 3.10     |
+-----------+-------------+
+| CANN      | == 8.1.RC1  |
+-----------+-------------+
+| torch     | == 2.5.1    |
+-----------+-------------+
+| torch_npu | == 2.5.1.RC1|
+-----------+-------------+
+
+
+vllm & vllm-ascend
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+为了能够在 verl 中正常使用 vllm，需使用以下命令编译安装 vllm 和 vllm-ascend。请注意根据机器类型区分安装方式。
+
+.. code-block:: bash
+    
+    # vllm
+    git clone -b v0.7.3 --depth 1 https://github.com/vllm-project/vllm.git
+    cd vllm
+    pip install -r requirements-build.txt
+
+    # for Atlas 200T A2 Box16
+    VLLM_TARGET_DEVICE=empty pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
+    
+    # for Atlas 900 A2 PODc
+    VLLM_TARGET_DEVICE=empty pip install -e .
+
+.. code-block:: bash
+    
+    # vllm-ascend
+    git clone -b v0.7.3.post1 --depth 1 https://github.com/vllm-project/vllm-ascend.git
+    cd vllm-ascend
+    export COMPILE_CUSTOM_KERNELS=1
+    python setup.py install
+
+安装verl
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: bash
+
+    git clone https://github.com/volcengine/verl.git
+    cd verl
+    pip install -r requirements-npu.txt
+    pip install -e .
+
+其他三方库说明
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+--------------+---------------+
+| software     | description   |
+--------------+---------------+
+| transformers | v4.52.4       |
+--------------+---------------+
+| flash_attn   | not supported |
+--------------+---------------+
+| liger-kernel | not supported |
+--------------+---------------+
+| tensordict   | 0.8.3 (ARM)   |
+--------------+---------------+
+
+1. 支持通过 transformers 使能 --flash_attention_2， transformers 需大于等于 4.52.0版本。
+2. 不支持通过 flash_attn 使能 flash attention 加速。
+3. 不支持 liger-kernel 使能。
+4. 针对 ARM 服务器，tensordict 要求 0.8.3，可在依赖安装完成后再手动安装 tensordict。
+5. 针对 x86 服务器，需要安装 cpu 版本的 torchvision。
+
+.. code-block:: bash
+
+    pip install torchvision==0.20.1+cpu --index-url https://download.pytorch.org/whl/cpu
+
+
+快速开始
+-----------------------------------
+正式使用前，建议您通过对Qwen2.5-0.5B GRPO的训练尝试以检验环境准备和安装的正确性。
+
+1.下载数据集并将数据集预处理为parquet格式，以便包含计算RL奖励所需的必要字段
+
+.. code-block:: bash
+
+    python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
+
+2.执行训练
+
+.. code-block:: bash
+
+    set -x
+
+    export VLLM_ATTENTION_BACKEND=XFORMERS
+
+    python3 -m verl.trainer.main_ppo \
+        algorithm.adv_estimator=grpo \
+        data.train_files=$HOME/data/gsm8k/train.parquet \
+        data.val_files=$HOME/data/gsm8k/test.parquet \
+        data.train_batch_size=128 \
+        data.max_prompt_length=512 \
+        data.max_response_length=128 \
+        data.filter_overlong_prompts=True \
+        data.truncation='error' \
+        actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+        actor_rollout_ref.actor.optim.lr=5e-7 \
+        actor_rollout_ref.model.use_remove_padding=False \
+        actor_rollout_ref.actor.entropy_coeff=0.001 \
+        actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=20 \
+        actor_rollout_ref.actor.use_kl_loss=True \
+        actor_rollout_ref.actor.kl_loss_coef=0.001 \
+        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=False \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
+        actor_rollout_ref.rollout.enable_chunked_prefill=False \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.name=vllm \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+        actor_rollout_ref.rollout.n=5 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=True \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.critic_warmup=0 \
+        trainer.logger=console \
+        trainer.project_name='verl_grpo_example_gsm8k' \
+        trainer.experiment_name='qwen2_7b_function_rm' \
+        trainer.n_gpus_per_node=8 \
+        trainer.nnodes=1 \
+        trainer.save_freq=-1 \
+        trainer.test_freq=5 \
+        trainer.total_epochs=1 \
+        trainer.device=npu $@
+
+
+支持现状
+-----------------------------------
+
+-----------+-------------------------+-------------+-------------------+----------------------+
+| algorithm |         model           | rewards mae |  throughput ratio |        hardware      |
+-----------+-------------------------+-------------+-------------------+----------------------+
+|   GRPO    | Qwen2.5-7B-instruct     |    0.38%    |        0.588      |  Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+----------------------+
+|   GRPO    | Qwen2.5-32B-instruct    |    0.30%    |        0.685      |  Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+----------------------+
+|   GRPO    | Qwen2.5-VL-3B-instruct  |    3.14%    |        0.470      |  Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+----------------------+
+|   GRPO    | Qwen2.5-VL-7B-instruct  |    3.30%    |        0.380      |  Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+----------------------+
+|   GRPO    | Qwen2.5-VL-32B-instruct |    0.79%    |        0.568      |  Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+----------------------+
+|   DAPO    | Qwen2.5-7B-instruct     |    3.83%    |        pending    |  Atlas 200T A2 Box16 |
+-----------+-------------------------+-------------+-------------------+----------------------+
+|  SFT-PEFT | Qwen2.5-0.5B-instruct   |    0.06%    |        0.305      |  Atlas 900 A2 PODc   |
+-----------+-------------------------+-------------+-------------------+----------------------+
+
+精度对比说明
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+对于 SFT 类算法，我们期望在相同配置下华为昇腾设备与 A100 的 loss 平均绝对误差<= 2%。计算方式如下图。更多信息请参考 `精度计算说明 <https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html>`_。
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/loss_comparison.png?raw=true
+   :alt: loss_comparison
+
+根据经验，对于 GRPO 等 RL 类算法，我们期望在相同配置下华为昇腾设备与 A100 的 rewards 平均绝对误差<= 4%，计算方式参考上图。
+
+
+吞吐对比说明
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Ascend npu 和 A100 分别取日志中前4个 step 的 "perf/throughput" 做平均， throughput ratio = npu 平均值 / A100 平均值。 
+
+
+
+计划
+-----------------------------------
+
+查看 `roadmap <https://github.com/volcengine/verl/discussions/900>`_ 获取更多特性的支持进度。
+
+
+
+声明
+-----------------------------------
+verl中提供的ascend支持代码皆为参考样例，商业使用请通过官方正式途径沟通，谢谢。
--- a/docs/conf.py
+++ b/docs/conf.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+
+
+# -- Project information -----------------------------------------------------
+
+project = "verl"
+copyright = "2024 ByteDance Seed Foundation MLSys Team"
+author = "Guangming Sheng, Chi Zhang, Yanghua Peng, Haibin Lin"
+
+
+# -- General configuration ---------------------------------------------------
+# The master toctree document.
+master_doc = "index"
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "myst_parser",
+    "sphinx.ext.autodoc",
+    "sphinx.ext.autosummary",
+    "sphinx.ext.autosectionlabel",
+    "sphinx.ext.napoleon",
+    "sphinx.ext.viewcode",
+]
+# Use Google style docstrings instead of NumPy docstrings.
+napoleon_google_docstring = True
+napoleon_numpy_docstring = False
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+source_suffix = {
+    ".rst": "restructuredtext",
+    ".md": "markdown",
+}
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = "en"
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "sphinx_rtd_theme"
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+
+# Add the JavaScript file
+html_js_files = [
+    "js/runllm-widget.js",
+    "js/resizable-sidebar.js",
+]
+
+# Add custom CSS file for full-width layout
+html_css_files = [
+    "custom.css",
+]
+
+exclude_patterns += ["README.md", "README_vllm0.7.md"]
+
+suppress_warnings = ["ref.duplicate", "ref.myst"]
--- a/docs/examples/config.rst
+++ b/docs/examples/config.rst
+.. _config-explain-page:
+
+Config Explanation
+===================
+
+Last updated: 06/18/2025.
+
+ppo_trainer.yaml for RL FSDP Backend
+-------------------------------------
+
+Data
+~~~~
+
+.. code:: yaml
+
+   data:
+     tokenizer: null
+     train_files: ~/data/rlhf/gsm8k/train.parquet
+     val_files: ~/data/rlhf/gsm8k/test.parquet
+     prompt_key: prompt
+     max_prompt_length: 512
+     max_response_length: 512
+     train_batch_size: 1024
+     return_raw_input_ids: False  # This should be set to true when the tokenizer between policy and rm differs
+     return_raw_chat: False
+     return_full_prompt: False
+     shuffle: True
+     filter_overlong_prompts: False
+     filter_overlong_prompts_workers: 1
+     truncation: error
+     image_key: images
+     trust_remote_code: True
+     custom_cls:
+        path: null
+        name: null
+
+- ``data.train_files``: Training set parquet. Can be a list or a single
+  file. The program will read all files into memory, so it can't be too
+  large (< 100GB). The path can be either local path or HDFS path. For
+  HDFS path, we provide utils to download it to DRAM and convert the
+  HDFS path to local path.
+- ``data.val_files``: Validation parquet. Can be a list or a single
+  file.
+- ``data.prompt_key``: The field in the dataset where the prompt is
+  located. Default is 'prompt'.
+- ``data.max_prompt_length``: Maximum prompt length. All prompts will be
+  left-padded to this length. An error will be reported if the length is
+  too long
+- ``data.max_response_length``: Maximum response length. Rollout in RL
+  algorithms (e.g. PPO) generates up to this length
+- ``data.train_batch_size``: Batch size sampled for one training
+  iteration of different RL algorithms.
+- ``data.return_raw_input_ids``: Whether to return the original
+  input_ids without adding chat template. This is mainly used to
+  accommodate situations where the reward model's chat template differs
+  from the policy. It needs to be decoded first, then apply the RM's
+  chat template. If using a model-based RM, and the policy and RM
+  chat_templates are different, this flag needs to be set
+- ``data.return_raw_chat``: Whether to return the original chat (prompt)
+  without applying chat template.
+- ``data.return_full_prompt``: Whether to return the full prompt with chat template
+- ``data.shuffle``: Whether to shuffle the data in the dataloader.
+- ``data.filter_overlong_prompts``: Default don't filter.
+- ``data.filter_overlong_prompts_workers``: For large-scale dataset, filtering
+  overlong prompts could be timeconsuming. You cat set the ``filter_overlong_prompts_workers``
+  to use multiprocessing for speed up. Default to 1.
+- ``data.truncation``: Truncate the input_ids or prompt length if they
+  exceed max_prompt_length. Default is 'error', not allow exceed the
+  max_prompt_length. The users should increase the max_prompt_length if
+  throwing the error. You can also set ``left``, ``right`` and ``middle``. 
+  When ``middle`` is selected, the logic splits the allowed max length roughly in half 
+  and keeps the head and tail of the sequence, effectively discarding the middle section.
+- ``data.image_key``: The field in the multi-modal dataset where the image is
+  located. Default is 'images'.
+- ``data.trust_remote_code``: If the remote tokenizer has python file, we can use this field to allow 
+  using remote tokenizer. For example: moonshotai/Moonlight-16B-A3B-Instruct
+
+Customized Dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Customized dataset extension is implemented for the SFT trainer and can be extended to other trainers with similar changes.
+
+.. code:: yaml
+
+   custom_cls:
+     path: null
+     name: null
+
+- ``data.custom_cls.path``: The path to the file containing your customized dataset class. If not specified, pre-implemented dataset will be used.
+- ``data.custom_cls.name``: The name of the dataset class within the specified file.
+
+Actor/Rollout/Reference Policy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+
+   actor_rollout_ref:
+    hybrid_engine: True
+    model:
+      path: ~/models/deepseek-llm-7b-chat
+      external_lib: null
+      override_config:
+        model_config: {}
+        moe_config:  # Megatron only, can adjust moe configuration
+          freeze_moe_router: False  # Megatron only, can freeze moe router (no grad)
+      enable_gradient_checkpointing: False
+      enable_activation_offload: False
+      trust_remote_code: False
+      use_remove_padding: False
+    actor:
+      strategy: fsdp  # This is for backward-compatibility
+      ppo_mini_batch_size: 256
+      ppo_micro_batch_size: null # will be deprecated, use ppo_micro_batch_size_per_gpu
+      ppo_micro_batch_size_per_gpu: 8
+      use_dynamic_bsz: False
+      ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
+      grad_clip: 1.0
+      clip_ratio: 0.2
+      entropy_coeff: 0.0
+      use_kl_loss: False # True for GRPO
+      use_torch_compile: True # False to disable torch compile
+      kl_loss_coef: 0.001 # for grpo
+      kl_loss_type: low_var_kl # for grpo
+      ppo_epochs: 1
+      data_loader_seed: null
+      shuffle: False
+      ulysses_sequence_parallel_size: 1 # sp size
+      optim:
+        lr: 1e-6
+        lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio.
+        lr_warmup_steps_ratio: 0.  # the total steps will be injected during runtime
+        min_lr_ratio: 0.0   # only used with cosine lr scheduler, default to 0.0
+        num_cycles: 0.5     # only used with cosine lr scheduler, default to 0.5
+        warmup_style: constant  # select from constant/cosine
+        total_training_steps: -1  # must be override by program
+      fsdp_config:
+        wrap_policy:
+          # transformer_layer_cls_to_wrap: None
+          min_num_params: 0
+        param_offload: False
+        optimizer_offload: False
+        fsdp_size: -1
+      checkpoint:
+        # What to include in saved checkpoints
+        # with 'hf_model' you can save whole model as hf format, now only use sharded model checkpoint to save space
+        save_contents: ['model', 'optimizer', 'extra']
+        # For more flexibility, you can specify the contents to load from the checkpoint.
+        load_contents: ${actor_rollout_ref.actor.checkpoint.save_contents}
+    ref:
+      fsdp_config:
+        param_offload: False
+        wrap_policy:
+          # transformer_layer_cls_to_wrap: None
+          min_num_params: 0
+      log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
+      log_prob_micro_batch_size_per_gpu: 16
+      log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
+      log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
+      ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
+    rollout:
+      name: vllm
+      temperature: 1.0
+      top_k: -1 # 0 for hf rollout, -1 for vllm rollout
+      top_p: 1
+      prompt_length: ${data.max_prompt_length}  # not use for opensource
+      response_length: ${data.max_response_length}
+      # for vllm rollout
+      dtype: bfloat16 # should align with FSDP
+      gpu_memory_utilization: 0.5
+      ignore_eos: False
+      enforce_eager: True
+      free_cache_engine: True
+      load_format: dummy_dtensor
+      tensor_model_parallel_size: 2
+      max_num_batched_tokens: 8192
+      max_num_seqs: 1024
+      log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
+      log_prob_micro_batch_size_per_gpu: 16
+      log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
+      log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
+      # for hf rollout
+      do_sample: True
+      engine_kwargs: # inference engine parameters
+        vllm:
+          swap_space: null # null means "use the engine default value" (usually 4 GB), setting it to, e.g., 32 means 32 GB
+          disable_mm_preprocessor_cache: False # disable preprocessor cache for multimodel models
+        sglang:
+          attention_backend: null # null means use the engine default value, available options: flashinfer, triton, flashmla
+
+      n: 1 # for each prompt, sample n responses (i.e. num sample times). set it to values > 1 for grpo, rloo
+      val_kwargs:
+        # sampling parameters for validation
+        top_k: -1 # 0 for hf rollout, -1 for vllm rollout
+        top_p: 1.0
+        temperature: 0
+        n: 1
+        do_sample: False # default eager for validation
+
+      agent:
+        custom_async_server: # Use custom async server implementation for rollout
+          path: null
+          name: null
+
+**Common config for actor, rollout and reference model**
+
+- ``actor_rollout_ref.hybrid_engine``: Whether it's a hybrid engine,
+  currently only supports hybrid engine
+- ``actor_rollout_ref.model.path``: Huggingface model path. This can be
+  either local path or HDFS path. For HDFS path, we provide utils to
+  download it to DRAM and convert the HDFS path to local path.
+- ``actor_rollout_ref.model.external_libs``: Additional Python packages
+  that need to be imported. Used to register models or tokenizers into
+  the Huggingface system.
+- ``actor_rollout_ref.model.override_config``: Used to override some of
+  the model's original configurations, mainly dropout
+- ``actor_rollout_ref.model.enable_gradient_checkpointing``: FSDP only, decide
+  Whether to enable gradient checkpointing for the actor,
+  Megatron uses recompute options in ``override_transformer_config`` to set this
+- ``actor_rollout_ref.model.enable_activation_offload``: Whether to enable
+  activation offloading for the actor
+- ``actor_rollout_ref.model.trust_remote_code``: Whether to enable loading
+  a remote code model
+- ``actor_rollout_ref.model.use_fused_kernels``: Whether to use fused
+  kernels in the model. If set to True, the following parameters will be
+  used.
+  - ``actor_rollout_ref.model.fused_kernel_options.impl_backend``: The
+  implementation backend for fused kernels. Options: "triton" or
+  "torch". Default is "torch".
+  While in megatron, we only support "triton" as the
+  implementation backend, so there is no need for this option.
+- ``actor_rollout_ref.model.use_remove_padding``: Whether to use remove
+  padding in the model. If set to True, the model will remove padding
+  tokens in the input_ids and response_ids. This helps a lot in improving model running efficiency.
+
+**Actor model**
+
+- ``actor_rollout_ref.actor.strategy``: fsdp or megatron. In this
+  example, we use fsdp backend.
+
+- ``actor_rollout_ref.actor.ppo_mini_batch_size``: One sample is split
+  into multiple sub-batches with batch_size=ppo_mini_batch_size for PPO
+  updates. The ppo_mini_batch_size is a global num across all workers/gpus
+
+- ``actor_rollout_ref.actor.ppo_micro_batch_size``: [Will be deprecated, use ppo_micro_batch_size_per_gpu] 
+  Similar to gradient accumulation, the micro_batch_size_per_gpu for one forward pass,
+  trading speed for GPU memory. The value represent the global view.
+
+- ``actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu``: Similar to gradient
+  accumulation, the micro_batch_size_per_gpu for one forward pass, trading speed
+  for GPU memory. The value represent the local num per gpu.
+
+- ``actor_rollout_ref.actor.grad_clip``: Gradient clipping for actor
+  updates
+- ``actor_rollout_ref.actor.use_kl_loss``: to use kl loss in actor. When used, we are not applying KL in the reward function.
+
+- ``actor_rollout_ref.actor.clip_ratio``: PPO clip ratio
+
+- ``actor_rollout_ref.actor.use_torch_compile``: Whether to use torch compile in actor
+
+- ``actor_rollout_ref.actor.entropy_coeff``: The weight of entropy when
+  calculating PPO loss. The default value is changed to 0.0 since v0.3.x
+
+- ``actor_rollout_ref.actor.ppo_epochs``: Number of epochs for PPO
+  updates on one set of sampled data
+
+- ``actor_rollout_ref.actor.data_loader_seed``: From torch 2.6.0 Megatron backend can get wrong seed generated by pytorch 
+  between cp ranks and cause misalignment between data on these ranks, so we shall manually set the seed to avoid hanging
+  issue. if ``actor_rollout_ref.actor.shuffle`` is not null, this must be set.
+
+- ``actor_rollout_ref.actor.shuffle``: Whether to shuffle data when
+  there are multiple epochs
+
+- ``actor_rollout_ref.actor.optim``: Actor's optimizer parameters
+
+- ``actor_rollout_ref.actor.fsdp_config``: FSDP config for actor
+  training
+
+  - ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface's
+    wrap policy, i.e., wrapping by DecoderLayer
+
+    - No need to set transformer_layer_cls_to_wrap, so we comment it.
+
+  - ``*_offload``: Whether to enable parameter, gradient and optimizer
+    offload
+
+    - Trading speed for GPU memory.
+
+- ``actor_rollout_ref.actor.use_kl_loss``: Whether to enable kl loss. Default is False.
+
+- ``actor_rollout_ref.actor.kl_loss_coef``: The coefficient of kl loss. Default is 0.001. 
+
+- ``actor_rollout_ref.actor.kl_loss_type``: Support ``kl`` (``k1``), ``abs``, ``mse`` (``k2``), ``low_var_kl`` (``k3``) and ``full``. How to calculate the kl divergence between actor and reference policy. For specific options, refer to `kl_penalty()` in `core_algos.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/core_algos.py>`_ . See this blog post for detailed analysis: http://joschu.net/blog/kl-approx.html
+
+- ``actor_rollout_ref.actor.checkpoint``: The configurations of checkpoint function in actor
+
+  - ``save_contents``: The contents to save in the checkpoint. By default, we save model, optimizer and extra information in the checkpoint.
+    The extra information includes Rng states currently, FSDP supported lr_scheduler, and Megatron opt_param_scheduler will coming soon.
+    We do not store hf_model in checkpoint by default, but we provide a tool in ``scripts/model_merge.py`` to convert checkpoint format to hf format.
+
+  - ``load_contents``: The contents to load in the checkpoint, you can specify different checkpoint loading contents. By default, it is the same with ``save_checkpoint``.
+
+**Reference Model**
+
+Reference model will be enabled when ``actor.use_kl_loss`` or/and ``algorithm.use_kl_in_reward`` is/are True.
+
+- ``actor_rollout_ref.ref``: FSDP config same as actor. **For models
+  larger than 7B, it's recommended to turn on offload for ref by
+  default**
+
+- ``actor_rollout_ref.ref.log_prob_micro_batch_size``: [Will be deprecate, use log_prob_micro_batch_size_per_gpu]
+  The batch size for one forward pass in the computation of ``ref_log_prob``. The value represent the global num.
+
+- ``actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu``: The batch size
+  for one forward pass in the computation of ``ref_log_prob``. The value represent the local num per gpu.
+
+**Rollout Model**
+
+- ``actor_rollout_ref.rollout.name``: hf/vllm/sglang.
+
+- Rollout (Auto-regressive) parameters. The key should be equal to the
+  property name in vLLM's ``SamplingParams``.
+
+  - ``temperature``, ``top_k``, ``top_p`` and others: Sampling
+    parameters in ``SamplingParams``.
+
+- ``actor_rollout_ref.rollout.dtype``: Rollout model parameters type. This should be align with
+  the actor model parameter type in FSDP/Megatron backend.
+
+- ``actor_rollout_ref.rollout.gpu_memory_utilization``:
+
+  - For vLLM v0.7.0 and later: The fraction of **total** GPU memory to be used for the vLLM instance.
+  - For SGLang: Corresponding to ``mem_fraction_static``, the fraction of the free GPU memory used for **static** memory like model weights and KV cache. 
+
+- ``actor_rollout_ref.rollout.tensor_model_parallel_size``: TP size for rollout. Only effective
+  for vllm.
+
+- ``actor_rollout_ref.rollout.log_prob_micro_batch_size``: [Will be deprecate, use log_prob_micro_batch_size_per_gpu]
+  The batch size for one forward pass in the computation of ``log_prob``. The value represent the global num.
+
+- ``actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu``: Micro batch size per gpu (The batch size for
+  one forward pass) for recalculating ``log_prob``. The value represent the local num per gpu.
+
+- ``actor_rollout_ref.rollout.do_sample``: Whether to sample during training rollout. If set to False, the rollout model
+  will perform greedy sampling.
+
+- ``actor_rollout_ref.rollout.val_kwargs```: Sampling parameters used specifically during validation.
+
+  - ``top_k``: Top-k sampling parameter. Default to -1 for vLLM rollout or 0 for HF rollout.
+  - ``top_p``: Top-p sampling parameter. Default is 1.0 (disabled).
+  - ``temperature``: Sampling temperature. Default is 0 (deterministic greedy).
+  - ``n``: Number of responses to generate during validation. Default is 1.
+  - ``do_sample``: Whether to use sampling during validation. Default is False for
+    deterministic outputs. When set to True, the rollout will use the ``actor_rollout_ref.rollout.val_kwargs`` parameters
+    (top_k, top_p, temperature) to control the sampling behavior.
+
+- ``actor_rollout_ref.rollout.engine_kwargs.vllm``: extra vllm engine args
+
+  - ``swap_space``: swap space in GB used by the inference engine. Positive integer, e.g., ``32`` means 32 GB. ``null``: means not setting and using the engine default value (usually, e.g., 4 GB for vLLM)
+  - ``disable_mm_preprocessor_cache``: Whether to disable preprocessor cache for multimodel models. 
+
+- ``actor_rollout_ref.rollout.engine_kwargs.sglang``: extra sglang engine args
+
+  - ``attention_backend``: The attention backend to use for the inference engine.
+
+    - ``null``: means not setting and using the engine default value (usually, e.g., ``fa3`` for SGLang)
+    - ``flashinfer``: Use flashinfer attention backend.
+    - ``triton``: Use triton attention backend.
+    - ``flashmla``: Use flashmla attention backend.
+
+- ``actor_rollout_ref.rollout.ignore_eos``: Whether to ignore the EOS
+  token and continue generating tokens after the EOS token is generated.
+
+- ``actor_rollout_ref.rollout.free_cache_engine``: Offload the KVCache
+  after rollout generation stage. Default is True. When set to True,
+  for vllm v0.5.4 and v0.6.3, we need to disable the usage of CUDAGraph
+  (set ``enforce_eager`` to True.)
+
+- ``actor_rollout_ref.rollout.enforce_eager``: Whether to use CUDAGraph
+  in vLLM generation. Default set to True to disable CUDAGraph.
+
+- ``actor_rollout_ref.rollout.load_format``: Which weight loader to use
+  to load the actor model weights to the rollout model.
+
+  - ``auto``: Use Megatron weight loader.
+  - ``megatron``: Use Megatron weight loader. Deployed with Megatron
+    backend. The input model ``state_dict()`` is already partitioned
+    along TP dimension and already gathered along PP dimension. This
+    weight loader requires that the Rollout model and Actor model's
+    parameters shape and name should be identical.
+  - ``dtensor``: Default solution when using Huggingface weight loader.
+    Deployed with FSDP backend and the state_dict_type is
+    ``StateDictType.SHARDED_STATE_DICT``. Recommend to use this weight
+    loader
+  - ``hf``: Use Huggingface weight loader. Deployed with FSDP backend
+    and the state_dict_type is ``StateDictType.FULL_STATE_DICT``. This
+    solution doesn't need to rewrite the weight loader for each model
+    implemented in vLLM but it results in larger peak memory usage.
+  - ``dummy_hf``, ``dummy_megatron``, ``dummy_dtensor``: Random
+    initialization.
+
+.. note:: **NOTED**: In this config field, users only need to select from ``dummy_megatron``, ``dummy_dtensor``, ``dummy_hf`` for rollout initialization and our hybrid engine will select the corresponding weight loader (i.e., ``megatron``, ``dtensor``, ``hf``) during actor/rollout weight synchronization.
+
+
+Megatron Optimizer and Optimizer Parameter Scheduler
+____________________________________________________
+
+.. code:: yaml
+
+    optim:
+      optimizer: adam
+      lr: 1e-6
+      clip_grad: 1.0
+      total_training_steps: -1  # must be override by program
+      lr_warmup_init: 0.0  # initial learning rate for warmup, default to 0.0
+      lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio.
+      lr_warmup_steps_ratio: 0.  # the total steps will be injected during runtime
+      lr_decay_steps: null
+      lr_decay_style: constant # select from constant/linear/cosine/inverse_square_root
+      min_lr: 0.0 # minimum learning rate, default to 0.0
+      weight_decay: 0.01
+      weight_decay_incr_style: constant # select from constant/linear/cosine
+      lr_wsd_decay_style: exponential # select from constant/exponential/cosine
+      lr_wsd_decay_steps: null
+      use_checkpoint_opt_param_scheduler: False # use checkpoint optimizer parameter scheduler
+
+
+Notice that there are some differences in APIs between Megatron optimizer and FSDP optimizer.
+
+- Megatron optimizer scheduler names the period after lr_warmup as lr_decay_steps, so the ``warmup_style`` actually means the style of lr decay after warmup.
+- Megatron optimizer also support weight decay decay mechanism
+- ``use_checkpoint_opt_param_scheduler`` determines whether to use the checkpoint optimizer parameter scheduler. If set to True, the optimizer parameter scheduler will be saved in the checkpoint and loaded from the checkpoint during resuming training.
+
+For learning rate decay, original Megatron pretrain default option of ``lr_decay_style`` is ``linear``,
+meaning that the learning rate will be linearly decayed from the initial learning rate to ``min_lr`` within the
+``lr_decay_steps``. However, in verl, to align with FSDP's default behavior, we set the default
+``lr_decay_style`` to ``constant``, meaning that the learning rate will be kept constant after the warmup stage.
+
+
+Critic Model
+~~~~~~~~~~~~
+
+Most parameters for Critic are similar to Actor Model.
+
+Reward Model
+~~~~~~~~~~~~
+
+.. code:: yaml
+
+   reward_model:
+     enable: False
+     model:
+       input_tokenizer: ${actor_rollout_ref.model.path}  # set this to null if the chat template is identical
+       path: ~/models/Anomy-RM-v0.1
+       external_lib: ${actor_rollout_ref.model.external_lib}
+       trust_remote_code: False
+       fsdp_config:
+         min_num_params: 0
+         param_offload: False
+     micro_batch_size_per_gpu: 16
+     max_length: null
+     reward_manager: naive
+
+- ``reward_model.enable``: Whether to enable reward model. If False, we
+  compute the reward only with the user-defined reward functions. In
+  GSM8K and Math examples, we disable reward model. For RLHF alignment
+  example using full_hh_rlhf, we utilize reward model to assess the
+  responses. If False, the following parameters are not effective.
+- ``reward_model.model``
+
+  - ``input_tokenizer``: Input tokenizer. If the reward model's chat
+    template is inconsistent with the policy, we need to first decode to
+    plaintext, then apply the rm's chat_template. Then score with RM. If
+    chat_templates are consistent, it can be set to null.
+  - ``path``: RM's HDFS path or local path. Note that RM only supports
+    AutoModelForSequenceClassification. Other model types need to define
+    their own RewardModelWorker and pass it from the code.
+  - ``trust_remote_code``: Whether to enable loading a remote code model,
+    default to False.
+- ``reward_model.reward_manager``:  Reward Manager. This defines the mechanism
+  of computing rule-based reward and handling different reward sources. Default
+  is ``naive``. If all verification functions are multiprocessing-safe, the reward
+  manager can be set to ``prime`` for parallel verification.
+
+Customized Reward Function
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+  
+   custom_reward_function:
+     path: null
+     name: compute_score
+
+- ``custom_reward_function.path``: The path to the file containing your customized reward function. If not specified, pre-implemented reward functions will be used.
+- ``custom_reward_function.name`` (Optional) : The name of the reward function within the specified file. Default is 'compute_score'.
+
+Algorithm
+~~~~~~~~~
+
+.. code:: yaml
+
+   algorithm:
+     gamma: 1.0
+     lam: 1.0
+     adv_estimator: gae
+     use_kl_in_reward: False
+     kl_penalty: kl  # how to estimate kl divergence
+     kl_ctrl:
+       type: fixed
+       kl_coef: 0.005
+       horizon: 10000
+       target_kl: 0.1
+
+- ``gamma``: discount factor
+- ``lam``: Trade-off between bias and variance in the GAE estimator
+- ``adv_estimator``: Support ``gae``, ``grpo``, ``reinforce_plus_plus``, ``reinforce_plus_plus_baseline``, ``rloo``
+- ``use_kl_in_reward``: Whether to enable in-reward kl penalty. Default is False.
+- ``kl_penalty``: Support ``kl``, ``abs``, ``mse``, ``low_var_kl`` and ``full``. How to
+  calculate the kl divergence between actor and reference policy. For
+  specific options, refer to `kl_penalty()` in `core_algos.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/core_algos.py>`_ .
+- ``kl_ctrl``: Config for in-reward kl_penalty controller
+  - ``kl_coef``: The (initial) coefficient of in-reward kl_penalty. Default is 0.001.
+  - ``type``: 'fixed' for FixedKLController and 'adaptive' for AdaptiveKLController.
+  - ``horizon`` and ``target_kl``: See source code of AdaptiveKLController for details.
+
+Trainer
+~~~~~~~
+
+.. code:: yaml
+
+   trainer:
+     total_epochs: 30
+     project_name: verl_examples
+     experiment_name: gsm8k
+     logger: ['console', 'wandb']
+     log_val_generations: 0
+     nnodes: 1
+     n_gpus_per_node: 8
+     save_freq: -1
+     val_before_train: True
+     test_freq: 2
+     critic_warmup: 0
+     default_hdfs_dir: null # hdfs checkpoint path
+     default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name} # local checkpoint path
+     resume_mode: auto # or disable or resume_path if resume_from_path is set
+     resume_from_path: null
+     remove_previous_ckpt_in_save: False
+     del_local_ckpt_after_load: False
+     ray_wait_register_center_timeout: 300
+
+- ``trainer.total_epochs``: Number of epochs in training.
+- ``trainer.project_name``: For wandb, swanlab, mlflow
+- ``trainer.experiment_name``: For wandb, swanlab, mlflow
+- ``trainer.logger``: Support console and wandb, swanlab, mlflow, tensorboard
+- ``trainer.log_val_generations``: The number of logged generation during validation (default ``0``)
+- ``trainer.nnodes``: Number of nodes used in the training.
+- ``trainer.n_gpus_per_node``: Number of GPUs per node.
+- ``trainer.save_freq``: The frequency (by iteration) to save checkpoint
+  of the actor and critic model.
+- ``trainer.val_before_train``: Whether to run validation before training.
+- ``trainer.test_freq``: The validation frequency (by iteration).
+- ``trainer.critic_warmup``: The number of iteration to train the critic
+  model before actual policy learning.
+- ``trainer.resume_mode``: The mode of resuming training. Support
+  ``disable``, ``auto`` and ``resume_path``. If set to ``auto`` as default, the
+  program will automatically resume from the latest checkpoint in the
+  ``default_local_dir``. If set to ``resume_path``, the program will resume
+  from the path specified in ``resume_from_path``.
+- ``trainer.resume_from_path``: The path to resume training from. Only
+  effective when ``resume_mode`` is set to ``resume_path``.
+- ``trainer.remove_previous_ckpt_in_save``: Whether to remove previous
+  checkpoints in the save directory. Default is False.
+- ``trainer.del_local_ckpt_after_load``: Whether to delete local
+  checkpoints after loading them. Default is False.
+- ``trainer.ray_wait_register_center_timeout``: The timeout for waiting
+  for the ray register center to be ready. Default is 300 seconds.
+
+
+This figure illustrates how the configurations affect the training.
+
+https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA
+
+.. image:: https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d
+
+
+evaluation.yaml
+---------------
+
+Data
+~~~~
+
+.. code:: yaml
+
+   data:
+     path: /tmp/math_Qwen2-7B-Instruct.parquet
+     prompt_key: prompt
+     response_key: responses
+     data_source_key: data_source
+     reward_model_key: reward_model
+
+- ``data.path``: Path to the dataset file (Parquet format).
+- ``data.prompt_key``: The field in the dataset where the prompt is located. Default is 'prompt'.
+- ``data.response_key``: The key holds the generated responses. This should be a list of strings representing the responses. Default is 'responses'.
+- ``data.data_source_key``: This is used to separate metric calculations for different data sources, ensuring that metrics are calculated independently for each source.
+- ``data.reward_model_key``: The key holds the reference answers. These reference answers typically serve as the ground truth or test cases for the task.
+
+Customized Reward Function
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: yaml
+  
+   custom_reward_function:
+     path: null
+     name: compute_score
+
+- ``custom_reward_function.path``: The path to the file containing your customized reward function. If not specified, pre-implemented reward functions will be used.
+- ``custom_reward_function.name`` (Optional) : The name of the reward function within the specified file. Default is 'compute_score'.
+
+sft_trainer.yaml for SFT FSDP Backend
+--------------------------------------
+
+
+Optim
+~~~~~~~
+
+.. code:: yaml
+
+   optim:
+     lr: 1e-5
+     weight_decay: 0.01
+     warmup_steps_ratio: 0.1
+     clip_grad: 1.0
+     lr_scheduler: cosine
+
+- ``optim.lr``: Learning rate for the optimizer.
+- ``optim.weight_decay``: Weight decay for the optimizer.
+- ``optim.warmup_steps_ratio``: Ratio of warmup steps to total training steps.
+- ``optim.clip_grad``: Gradient clipping value.
+- ``optim.lr_scheduler``: Learning rate scheduler type. Options:
+
+  - ``cosine``: Cosine learning rate scheduler with warmup (default).
+  - ``wsd``: Warmup-Stable-Decay scheduler that provides a stable learning rate phase between warmup and decay phases.
+
+Model
+~~~~~~~~~~~~
+
+Most parameters for Model are similar to Reward Model.
+
+.. code:: yaml
+
+   model:
+     partial_pretrain: ~/models/gemma-1.1-7b-it
+     fsdp_config:
+       model_dtype: fp32
+       wrap_policy:
+         min_num_params: 0
+       cpu_offload: False
+       offload_params: False
+     external_lib: null
+     enable_gradient_checkpointing: False
+     trust_remote_code: False
+     lora_rank: 0
+     lora_alpha: 16
+     target_modules: all-linear
+     use_liger: False
+
+- ``partial_pretrain``: HDFS path or local path for the pretrained model.
+- ``fsdp_config``
+
+  - ``model_dtype``: Model parameters type, default to ``fp32``.
+    Support: ``bf16``, ``fp16``, ``fp32``.
+  - ``cpu_offload``: Whether to enable CPU offloading for FSDP. If True,
+    the offload_params will be used as argument.
+  - ``offload_params``: Whether to offload parameters to CPU
+    when not involved in computation. If True, then this offloads gradients
+    to CPU as well, meaning that the optimizer step runs on CPU.
+
+- ``lora_rank``: The rank of the LoRA model, default to 0. If ``lora_rank``>0,
+  we will train LoRA modules instead of tuning the full model.
+- ``lora_alpha``: The alpha parameter for LoRA scaling, default to 16.
+- ``target_modules``: The names of the modules to apply the adapter to,
+  default to ``all-linear``. See `peft docs <https://huggingface.co/docs/peft/v0.15.0/en/package_reference/lora#peft.LoraConfig.target_modules>`_ for detail.
+
+- ``use_liger``: Whether to enable Liger kernel, default to False. If True,
+  we apply Liger kernel to the model (depends on `liger-kernel`).
--- a/docs/examples/gsm8k_example.rst
+++ b/docs/examples/gsm8k_example.rst
+GSM8K Example
+=============
+
+Last updated: 03/25/2025.
+
+Introduction
+------------
+
+In this example, we train an LLM to tackle the GSM8k task.
+
+Paper: https://arxiv.org/pdf/2110.14168
+
+Dataset: https://huggingface.co/datasets/gsm8k
+
+Note that the original paper mainly focuses on training a verifier (a
+reward model) to solve math problems via Best-of-N sampling. In this
+example, we train an RLHF agent using a rule-based reward model.
+
+Dataset Introduction
+--------------------
+
+GSM8k is a math problem dataset. The prompt is an elementary school
+problem. The LLM model is required to answer the math problem.
+
+The training set contains 7473 samples and the test set contains 1319
+samples.
+
+**An example**
+
+Prompt
+
+   Katy makes coffee using teaspoons of sugar and cups of water in the
+   ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups
+   of water, calculate the number of teaspoonfuls of sugar she used.
+
+Solution
+
+   The total ratio representing the ingredients she used to make the
+   coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the
+   number of teaspoons she used is 7/20, she used 7/20\ *120 =
+   <<7/20*\ 120=42>>42 #### 42
+
+Step 1: Prepare dataset
+-----------------------
+
+.. code:: bash
+
+   cd examples/data_preprocess
+   python3 gsm8k.py --local_dir ~/data/gsm8k
+
+Step 2: Download Model
+----------------------
+
+There're three ways to prepare the model checkpoints for post-training:
+
+- Download the required models from huggingface or modelscope
+
+.. code:: bash
+
+   huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct --local-dir-use-symlinks False
+   # or
+   modelscope download --model deepseek-ai/deepseek-math-7b-instruct --local_dir ~/models/deepseek-math-7b-instruct
+
+- Already store your store model in the local directory or HDFS path.
+- Also, you can directly use the model name in huggingface (e.g.,
+  deepseek-ai/deepseek-math-7b-instruct) in
+  ``actor_rollout_ref.model.path`` and ``critic.model.path`` field in
+  the run script. You can also download models from modelscope by setting environmental variable ``VERL_USE_MODELSCOPE=True``.
+  See examples/ppo_trainer/run_deepseek7b_llm_modelscope.sh for example.
+
+Noted that users should prepare checkpoints for actor, critic and reward
+model.
+
+[Optional] Step 3: SFT your Model
+---------------------------------
+
+We provide a SFT Trainer using PyTorch FSDP in
+`fsdp_sft_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_. 
+Users can customize their own SFT
+script using our FSDP SFT Trainer.
+
+We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft directory <https://github.com/volcengine/verl/blob/main/examples/sft/gsm8k/>`_.
+
+.. code:: shell
+
+   set -x
+
+   torchrun -m verl.trainer.fsdp_sft_trainer \
+       data.train_files=$HOME/data/gsm8k/train.parquet \
+       data.val_files=$HOME/data/gsm8k/test.parquet \
+       data.prompt_key=question \
+       data.response_key=answer \
+       data.micro_batch_size_per_gpu=8 \
+       model.partial_pretrain=deepseek-ai/deepseek-coder-6.7b-instruct \
+       trainer.project_name=gsm8k-sft \
+       trainer.experiment_name=gsm8k-sft-deepseek-coder-6.7b-instruct \
+       trainer.total_epochs=4 \
+       trainer.logger='["console","wandb"]'
+
+
+If you use AMD GPUs (ROCm kernel), you need to add the following environment variables into the run script:
+
+    .. code-block:: bash
+
+        export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+        export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+        export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+
+
+Step 4: Perform PPO training with your model on GSM8K Dataset
+-------------------------------------------------------------
+
+- Prepare your own run.sh script. Here's an example for GSM8k dataset
+  and deepseek-llm-7b-chat model.
+- Users could replace the ``data.train_files`` ,\ ``data.val_files``,
+  ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
+  their environment.
+- See :doc:`config` for detailed explanation of each config field.
+
+**Reward Model/Function**
+
+We use a rule-based reward model. We force the model to produce a final
+answer following 4 “#” as shown in the solution. We extract the final
+answer from both the solution and model's output using regular
+expression matching. We compare them and assign a reward of 1 to correct
+answer, 0.1 to incorrect answer and 0 to no answer.
+
+**Training Script**
+
+The training script example for FSDP and Megatron-LM backend are stored in examples/ppo_trainer directory.
+
+.. code:: bash
+
+   cd ../ppo_trainer
+   bash run_deepseek7b_llm.sh
+
+The script of run_deepseek7b_llm.sh
+
+.. code:: bash
+
+   set -x
+
+   python3 -m verl.trainer.main_ppo \
+      data.train_files=$HOME/data/gsm8k/train.parquet \
+      data.val_files=$HOME/data/gsm8k/test.parquet \
+      data.train_batch_size=1024 \
+      data.max_prompt_length=512 \
+      data.max_response_length=512 \
+      actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+      actor_rollout_ref.actor.optim.lr=1e-6 \
+      actor_rollout_ref.model.use_remove_padding=True \
+      actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+      actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+      actor_rollout_ref.actor.fsdp_config.param_offload=False \
+      actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+      actor_rollout_ref.model.enable_gradient_checkpointing=True \
+      actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
+      actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
+      actor_rollout_ref.rollout.name=vllm \
+      actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+      actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
+      actor_rollout_ref.ref.fsdp_config.param_offload=True \
+      critic.optim.lr=1e-5 \
+      critic.model.use_remove_padding=True \
+      critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
+      critic.model.enable_gradient_checkpointing=True \
+      critic.ppo_micro_batch_size_per_gpu=32 \
+      critic.model.fsdp_config.param_offload=False \
+      critic.model.fsdp_config.optimizer_offload=False \
+      algorithm.kl_ctrl.kl_coef=0.001 \
+      trainer.critic_warmup=0 \
+      trainer.logger='["console","wandb"]' \
+      trainer.project_name='verl_example_gsm8k' \
+      trainer.experiment_name='deepseek_llm_7b_function_rm' \
+      trainer.n_gpus_per_node=8 \
+      trainer.nnodes=1 \
+      trainer.save_freq=-1 \
+      trainer.test_freq=1 \
+      trainer.total_epochs=15 $@
+
+
+If you use AMD GPUs (ROCm kernel), you need to add the following environment variables into the run script:
+
+    .. code-block:: bash
+
+        export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+        export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+        export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+
+If you encounter any issues in using AMD GPUs running VeRL, feel free to contact me - `Yusheng Su <https://yushengsu-thu.github.io/>`_.
\ No newline at end of file
--- a/docs/examples/multi_modal_example.rst
+++ b/docs/examples/multi_modal_example.rst
+Multi-Modal Example Architecture
+=================================
+
+Last updated: 04/28/2025.
+
+Introduction
+------------
+
+Now, verl has supported multi-modal training. You can use fsdp and 
+vllm/sglang to start a multi-modal RL task. Megatron supports is also 
+on the way.
+
+Follow the steps below to quickly start a multi-modal RL task.
+
+Step 1: Prepare dataset
+-----------------------
+
+.. code:: python
+
+    # it will be saved in the $HOME/data/geo3k folder
+    python examples/data_preprocess/geo3k.py
+
+Step 2: Download Model
+----------------------
+
+.. code:: bash
+
+    # download the model from huggingface
+    python3 -c "import transformers; transformers.pipeline(model='Qwen/Qwen2.5-VL-7B-Instruct')"
+
+Step 3: Perform GRPO training with multi-modal model on Geo3K Dataset
+---------------------------------------------------------------------
+
+.. code:: bash
+
+    # run the task
+    bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh
+
+
+
+
+
+
+
+
--- a/docs/examples/ppo_code_architecture.rst
+++ b/docs/examples/ppo_code_architecture.rst
+PPO Example Architecture
+========================
+
+Last updated: 02/17/2025.
+
+Let's start with the Proximal Policy Optimization algorithm, which is
+most widely used algorithm in LLM post-training.
+
+The main entry point of the PPO algorithm example is:
+`main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py>`_.
+In this tutorial, we will go through the code architecture in `main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py>`_.
+
+Define the data
+---------------
+
+Users need to preprocess and store the dataset in parquet files.
+And we implement `RLHFDataset` to load and tokenize the parquet files.
+
+For ``RLHFDataset`` (Default), at least 1 fields are required:
+
+- ``prompt``: Contains the string prompt
+
+We already provide some examples of processing the datasets to parquet
+files in `data_preprocess directory <https://github.com/volcengine/verl/blob/main/examples/data_preprocess>`_. Currently, we support
+preprocess of GSM8k, MATH, Hellasage, Full_hh_rlhf datasets. See :doc:`../preparation/prepare_data` for
+more information.
+
+Define the reward functions for different datasets
+--------------------------------------------------
+
+In this main entry point, the users only need to define their own reward
+function based on the datasets (or applications) utilized in PPO
+training.
+
+For example, we already provide reward functions for `GSM8k <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py>`_ 
+and `MATH <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/math.py>`_
+datasets in the ``_select_rm_score_fn``. In the ``RewardManager``, we
+will compute the reward score based on the data_source to select
+corresponding reward functions. For some RLHF datasets (e.g.,
+full_hh_rlhf), the reward model is utilized to assess the responses
+without any reward functions. In this case, the ``RewardManager`` will
+return the ``rm_score`` computed by the reward model directly.
+
+See `reward functions <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_ for detailed implementation.
+
+Define worker classes
+---------------------
+
+.. code:: python
+
+   if config.actor_rollout_ref.actor.strategy in {"fsdp", "fsdp2"}: # for FSDP backend
+       assert config.critic.strategy in {"fsdp", "fsdp2"}
+       from verl.workers.fsdp_workers import ActorRolloutRefWorker, CriticWorker
+       from verl.single_controller.ray import RayWorkerGroup
+       ray_worker_group_cls = RayWorkerGroup
+
+   elif config.actor_rollout_ref.actor.strategy == 'megatron': # for Megatron backend
+       assert config.actor_rollout_ref.actor.strategy == config.critic.strategy
+       from verl.workers.megatron_workers import ActorRolloutRefWorker, CriticWorker
+       from verl.single_controller.ray.megatron import NVMegatronRayWorkerGroup
+       ray_worker_group_cls = NVMegatronRayWorkerGroup # Ray worker class for Megatron-LM
+
+   else:
+       raise NotImplementedError
+
+   from verl.trainer.ppo.ray_trainer import ResourcePoolManager, Role
+
+   role_worker_mapping = {
+       Role.ActorRollout: ActorRolloutRefWorker,
+       Role.Critic: CriticWorker,
+       Role.RefPolicy: ActorRolloutRefWorker
+   }
+
+   global_pool_id = 'global_pool'
+   resource_pool_spec = {
+       global_pool_id: [config.trainer.n_gpus_per_node] * config.trainer.nnodes,
+   }
+   mapping = {
+       Role.ActorRollout: global_pool_id,
+       Role.Critic: global_pool_id,
+       Role.RefPolicy: global_pool_id,
+   }
+
+Step 1: Construct the mapping between roles and workers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A role represents a group of workers in the same process. We have
+pre-defined several roles in `ray_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py#L38>`_.
+
+.. code:: python
+
+   class Role(Enum):
+       """
+       To create more roles dynamically, you can subclass Role and add new members
+       """
+       Actor = 0  # This worker only has Actor
+       Rollout = 1 # This worker only has Rollout
+       ActorRollout = 2 # This worker has both actor and rollout, it's a HybridEngine
+       Critic = 3 # This worker only has critic
+       RefPolicy = 4 # This worker only has reference policy
+       RewardModel = 5 # This worker only has reward model
+       ActorRolloutRef = 6 # This worker contains actor, rollout and reference policy simultaneously 
+
+Step 2: Define the worker class corresponding to this role
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- We have pre-implemented the ``ActorRolloutRefWorker``. Through
+  different configs, it can be a standalone actor, a standalone rollout,
+  an ActorRollout HybridEngine, or an ActorRolloutRef HybridEngine
+- We also pre-implemented workers for ``Actor``, ``Rollout``,
+  ``Critic``, ``Reward Model`` and ``Reference model`` on two different
+  backend: PyTorch FSDP
+  and Megatron-LM.
+  See `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_ 
+  and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py>`_
+  for more information.
+
+Step 3: Define resource pool id and resource pool spec
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Resource pool is a division of global GPU resources,
+  ``resource_pool_spec`` is a dict, mapping from id to # of GPUs
+
+  - In the above example, we defined a global resource pool:
+    global_pool_id, and then put all roles on this one resource pool
+    with all the GPUs in this post-training task. This refers to
+    *co-locate* placement where all the models share the same set of
+    GPUs.
+
+- See resource pool and placement for advance usage.
+
+Defining reward model/function
+------------------------------
+
+.. code:: python
+
+   # we should adopt a multi-source reward function here
+   # - for rule-based rm, we directly call a reward score
+   # - for model-based rm, we call a model
+   # - for code related prompt, we send to a sandbox if there are test cases
+   # - finally, we combine all the rewards together
+   # - The reward type depends on the tag of the data
+   if config.reward_model.enable:
+       from verl.workers.fsdp_workers import RewardModelWorker
+       role_worker_mapping[Role.RewardModel] = RewardModelWorker
+       mapping[Role.RewardModel] = global_pool_id
+    
+   reward_fn = RewardManager(tokenizer=tokenizer, num_examine=0)
+
+   # Note that we always use function-based RM for validation
+   val_reward_fn = RewardManager(tokenizer=tokenizer, num_examine=1)
+
+   resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)
+
+Since not all tasks use model-based RM, users need to define here
+whether it's a model-based RM or a function-based RM
+
+- If it's a model-based RM, directly add the ``RewardModel`` role in the
+  resource mapping and add it to the resource pool mapping.
+
+  - Note that the pre-defined ``RewardModelWorker`` only supports models
+    with the structure of huggingface
+    ``AutoModelForSequenceClassification``. If it's not this model, you
+    need to define your own RewardModelWorker in `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_ 
+    and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py>`_.
+
+- If it's a function-based RM, the users are required to classified the
+  reward function for each datasets.
+
+.. code:: python
+
+   def _select_rm_score_fn(data_source):
+       if data_source == 'openai/gsm8k':
+           return gsm8k.compute_score
+       elif data_source == 'lighteval/MATH':
+           return math.compute_score
+       else:
+           raise NotImplementedError
+
+See reward functions implemented in `directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/>`_ 
+for more information.
+
+Define, init and run the PPO Trainer
+------------------------------------
+
+.. code:: python
+
+   trainer = RayPPOTrainer(config=config,
+                           tokenizer=tokenizer,
+                           role_worker_mapping=role_worker_mapping,
+                           resource_pool_manager=resource_pool_manager,
+                           ray_worker_group_cls=ray_worker_group_cls,
+                           reward_fn=reward_fn,
+                           val_reward_fn=val_reward_fn)
+   trainer.init_workers()
+   trainer.fit()
+
+- We first initialize the ``RayPPOTrainer`` with user config, tokenizer
+  and all the above worker mapping, resource pool, worker group and
+  reward functions
+- We first call the ``trainer.init_workers()`` to initialize the models
+  on the allocated GPUs (in the resource pool)
+- The actual PPO training will be executed in ``trainer.fit()``
+
+verl can be easily extended to other RL algorithms by reusing the Ray
+model workers, resource pool and reward functions. See :doc:`extension<../advance/dpo_extension>` for
+more information.
+
+Details of the ``RayPPOTrainer`` is discussed in :doc:`Ray Trainer<../workers/ray_trainer>`.
--- a/docs/examples/sandbox_fusion_example.rst
+++ b/docs/examples/sandbox_fusion_example.rst
+Sandbox Fusion Example
+============================
+
+Last updated: 06/27/2025.
+
+Introduction
+------------
+
+Sandbox Fusion is a remote code sandbox service that provides a secure environment for running and evaluating code generated by Large Language Models (LLMs). This example demonstrates how to train an LLM and use Sandbox Fusion to verify generated code, enhancing both security and performance.
+
+By leveraging a remote code sandbox service with greater CPU resources for concurrent code verification, you can reduce the reward stage time by 10-30%, depending on the quality of the generated code.
+
+Step 1: Prepare the Dataset
+---------------------------
+
+We use the Eurus-2-RL-Data dataset for training. This dataset combines math and code questions, making it suitable for LLM training tasks. You can download it from HuggingFace: `Eurus-2-RL-Data Dataset <https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data>`_.
+
+Step 2: Set Up the Sandbox Fusion Service
+-----------------------------------------
+
+Sandbox Fusion is a remote code sandbox service designed to securely run and evaluate LLM-generated code. To use it:
+
+1. **Access Full Documentation**: For detailed setup instructions, refer to the `Sandbox Fusion Documentation <https://bytedance.github.io/SandboxFusion/>`_.
+2. **Deploy the Service**: Choose one of the following deployment methods:
+
+   - **Local Deployment**: Follow the guide `here <https://bytedance.github.io/SandboxFusion/docs/docs/get-started#local-deployment>`_.
+   - **FaaS Instance (Volcengine)**: Create an instance using the `Volcengine Documentation <https://www.volcengine.com/docs/6662/1539235>`_.
+
+After deployment, you will receive an API endpoint in the format: ``https://<ip-address-or-domain-name>/run_code``.
+
+Step 3: Configure the Training Script
+-------------------------------------
+
+To integrate Sandbox Fusion into your training script, configure the following parameters:
+
+**Key Settings for Sandbox Fusion**
+
+- ``reward_model.sandbox_fusion.url='<API-endpoint>'``: Enable Sandbox Fusion by specifying the API endpoint (must end with ``/run_code``).
+- ``reward_model.sandbox_fusion.max_concurrent=256``: Set the maximum number of concurrent API requests to the Sandbox Fusion service.
+- ``reward_model.sandbox_fusion.memory_limit_mb=1024``: Set the memory limit (in MB) for each sandbox instance. Defaults to 1024MB if not specified.
+
+**Additional Optimization**
+
+To further reduce code verification time, enable parallel processing with:  
+
+- ``reward_model.reward_manager=prime``: The Prime reward manager verifies code across multiple subprocesses concurrently.
+
+**Example Script**
+
+For a practical implementation, refer to the example script:  
+
+``examples/ppo_trainer/run_deepseek7b_llm_sandbox_fusion.sh``
+
+Once you’ve set your API endpoint in the script, you can start the training job.
\ No newline at end of file
--- a/docs/faq/faq.rst
+++ b/docs/faq/faq.rst
+Frequently Asked Questions
+====================================
+
+Last updated: 06/25/2025.
+
+Ray related
+------------
+
+How to add breakpoint for debugging with distributed Ray?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Please checkout the official debugging guide from Ray: https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html
+
+
+"Unable to register worker with raylet"
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The cause of this issue is due to some system setting, e.g., SLURM added some constraints on how the CPUs are shared on a node. 
+While `ray.init()` tries to launch as many worker processes as the number of CPU cores of the machine,
+some constraints of SLURM restricts the `core-workers` seeing the `raylet` process, leading to the problem.
+
+To fix this issue, you can set the config term ``ray_init.num_cpus`` to a number allowed by your system.
+
+Distributed training
+------------------------
+
+How to run multi-node post-training with Ray?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can start a ray cluster and submit a ray job, following the official guide from Ray: https://docs.ray.io/en/latest/ray-core/starting-ray.html
+
+Then in the configuration, set the ``trainer.nnode`` config to the number of machines for your job.
+
+How to use verl on a Slurm-managed cluster?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Ray provides users with `this <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_ official
+tutorial to start a Ray cluster on top of Slurm. We have verified the :doc:`GSM8K example<../examples/gsm8k_example>`
+on a Slurm cluster under a multi-node setting with the following steps.
+
+1. [Optional] If your cluster support `Apptainer or Singularity <https://apptainer.org/docs/user/main/>`_ and you wish
+to use it, convert verl's Docker image to an Apptainer image. Alternatively, set up the environment with the package
+manager available on your cluster or use other container runtimes (e.g. through `Slurm's OCI support <https://slurm.schedmd.com/containers.html>`_) available to you.
+
+.. code:: bash
+
+    apptainer pull /your/dest/dir/vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3.sif docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
+
+2. Follow :doc:`GSM8K example<../examples/gsm8k_example>` to prepare the dataset and model checkpoints.
+
+3. Modify `examples/slurm/ray_on_slurm.slurm <https://github.com/volcengine/verl/blob/main/examples/slurm/ray_on_slurm.slurm>`_ with your cluster's own information.
+
+4. Submit the job script to the Slurm cluster with `sbatch`.
+
+Please note that Slurm cluster setup may vary. If you encounter any issues, please refer to Ray's
+`Slurm user guide <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_ for common caveats.
+
+If you changed Slurm resource specifications, please make sure to update the environment variables in the job script if necessary.
+
+
+Install related
+------------------------
+
+NotImplementedError: TensorDict does not support membership checks with the `in` keyword. 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Detail error information: 
+
+.. code:: bash
+
+    NotImplementedError: TensorDict does not support membership checks with the `in` keyword. If you want to check if a particular key is in your TensorDict, please use `key in tensordict.keys()` instead.
+
+Cause of the problem: There is no suitable version of tensordict package for the linux-arm64 platform. The confirmation method is as follows:
+
+.. code:: bash
+
+    pip install tensordict==0.6.2
+
+Output example:
+
+.. code:: bash
+
+    ERROR: Could not find a version that satisfies the requirement tensordict==0.6.2 (from versions: 0.0.1a0, 0.0.1b0, 0.0.1rc0, 0.0.2a0, 0.0.2b0, 0.0.3, 0.1.0, 0.1.1, 0.1.2, 0.8.0, 0.8.1, 0.8.2, 0.8.3)
+    ERROR: No matching distribution found for tensordict==0.6.2
+
+Solution 1st:
+  Install tensordict from source code:
+
+.. code:: bash
+
+    pip uninstall tensordict
+    git clone https://github.com/pytorch/tensordict.git
+    cd tensordict/
+    git checkout v0.6.2
+    python setup.py develop
+    pip install -v -e .
+
+Solution 2nd:
+  Temperally modify the error takeplace codes: tensordict_var -> tensordict_var.keys()
+
+
+Illegal memory access
+---------------------------------
+
+If you encounter the error message like ``CUDA error: an illegal memory access was encountered`` during rollout, please check the vLLM documentation for troubleshooting steps specific to your vLLM version.
+
+Checkpoints
+------------------------
+
+If you want to convert the model checkpoint into huggingface safetensor format, please refer to ``verl/model_merger``.
+
+
+Triton ``compile_module_from_src`` error
+------------------------------------------------
+
+If you encounter triton compilation error similar to the stacktrace below, please set the ``use_torch_compile`` flag according to
+https://verl.readthedocs.io/en/latest/examples/config.html to disable just-in-time compilation for fused kernels.
+
+.. code:: bash
+
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
+    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 338, in run
+    return self.fn.run(*args, **kwargs)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
+    device = driver.active.get_current_device()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
+    self._initialize_obj()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
+    self._obj = self._init_fn()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
+    return actives[0]()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
+    self.utils = CudaUtils()  # TODO: make static
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
+    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
+    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in _build
+    ret = subprocess.check_call(cc_cmd)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/subprocess.py", line 369, in check_call
+    raise CalledProcessError(retcode, cmd)
+
+What is the meaning of train batch size, mini batch size, and micro batch size?
+------------------------------------------------------------------------------------------
+
+This figure illustrates the relationship between different batch size configurations.
+
+https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA
+
+.. image:: https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d
+
+How to generate ray timeline to analyse performance of a training job?
+------------------------------------------------------------------------------------------
+
+To generate the ray timeline file, you can set the config term ``ray_init.timeline_file`` to a json file path.
+For example:
+
+.. code:: bash
+
+    ray_init.timeline_file=/tmp/ray_timeline.json
+  
+The file will be generated in the specified path at the end of a training job.
+You can use tools like chrome://tracing or the Perfetto UI and view the ray timeline file.
+
+This figure shows the ray timeline file generated by from a training job on 1 node with 4 GPUs
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray_timeline.png?raw=true
+
+How to set proxy only for wandb?
+------------------------------------------------------------------------------------------
+
+If you need a proxy to access wandb, you can add below config in your training job script.
+Comparing to using global https_proxy env variable, this approach won't mess up other http requests, such as ChatCompletionScheduler.
+
+.. code:: bash
+
+  +trainer.wandb_proxy=http://<your proxy and port>
+
--- a/docs/hybrid_flow.rst
+++ b/docs/hybrid_flow.rst
+=========================================================
+HybridFlow Programming Guide
+=========================================================
+
+Last updated: 06/02/2025.
+
+.. _vermouth: https://github.com/vermouth1992
+
+Author: `Chi Zhang <https://github.com/vermouth1992>`_
+
+verl is an open source implementation of the paper `HybridFlow <https://arxiv.org/abs/2409.19256v2>`_ [1]_. In this section, we will introduce the basic concepts of HybridFlow, the motivation and how to program with verl APIs.
+
+Motivation and Design
+------------------------
+We use dataflow to represent RL systems. [4]_.
+
+DataFlow
+~~~~~~~~~~~~~~~~~~~~
+
+Dataflow is an abstraction of computations. Neural Network training is a typical dataflow. It can be represented by computational graph. 
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/dataflow.jpeg?raw=true
+   :alt: The dataflow graph from CS231n 2024 lecture 4
+
+This figure [2]_ represents the computation graph of a polynomial function followed by a sigmoid function. In the data flow of neural network computation, each node represents an operator, and each edge represents the direction of forward/backward propagation. The computation graph determines the architecture of the neural network.
+
+RL as a dataflow problem
++++++++++++++++++++++++++++++++++++++++++++++
+
+Reinforcement learning (RL) training can also be represented as a dataflow. Below is the dataflow graph that represents the PPO algorithm used in RLHF [3]_:
+
+.. image:: https://picx.zhimg.com/70/v2-cb8ab5ee946a105aab6a563e92682ffa_1440w.avis?source=172ae18b&biz_tag=Post
+  :alt: PPO dataflow graph, credit to Zhihu 低级炼丹师
+
+However, the dataflow of RL has fundamental differences compared with dataflow of neural network training as follows:
+
+--------------------------+--------------------------------------------------+---------------------+
+| Workload                 | Node                                             | Edge                |
+--------------------------+--------------------------------------------------+---------------------+
+| Neural Network Training  | Operator (+/-/matmul/softmax)                    | Tensor movement     |
+--------------------------+--------------------------------------------------+---------------------+
+| Reinforcement Learning   | High-level operators (rollout/model forward)     | Data Movement       |
+--------------------------+--------------------------------------------------+---------------------+
+
+In the case of tabular reinforcement learning, each operator is a simple scalar math operation (e.g., bellman update). In deep reinforcement learning(DRL), each operator is a high-level neural network computation such as model inference/update. This makes RL a two-level dataflow problem:
+
+- Control flow: defines how the high-level operators are executed (e.g., In PPO, we first perform rollout. Then, we perform advantage computation. Finally, we perform training). It expresses the **core logics of RL algorithms**.
+- Computation flow: defines the dataflow of **neural network computation** (e.g., model forward/backward/optimizer).
+
+
+Design Choices
+~~~~~~~~~~~~~~~~~~~~
+The model size used in DRL before the LLM era is typically small. Thus, the high-level neural network computation can be done in a single process. This enables embedding the computation flow inside the control flow as a single process.
+
+However, in the LLM era, the computation flow (e.g., training neural network) becomes a multi-process program. This naturally leads to two design choices:
+
+1. Convert the control flow into a multi-process program as well. Then colocate with computation flow (unified multi-controller)
+
+- Advantages:
+
+  - Achieves the **optimal performance** under fixed computation flow and control flow as the communication overhead in both training and data transfer is minimized.
+
+- Disadvantages:
+
+  - The computation and/or control flow is **hard to reuse** from software perspective as computation code is coupled with specific controller code. For example, the training loop of PPO is generic. Say we have an PPO training flow implemented with a specific computation flow such as FSDP. Neither the control flow or computation flow can be reused if we want to switch the computation flow from FSDP to Megatron, due to the coupling of control and computation flows.
+  - Requires more efforts from the user under flexible and dynamic control flows, due to the multi-process nature of the program.
+
+2. Separate the flows: single process for the control flow and multi-process for computation flow
+
+- Advantages:
+
+  - The computation flow defined elsewhere can be **easily reused** after the decoupling.
+  - The controller runs on a single process. Implementing a new RL algorithm with a **different control flow is simple and easy**.
+
+- Disadvantages:
+
+  - Additional **data communication overhead** each time the controller process and computatation processes interact. The data has to be sent back and forth.
+
+In verl, the latter strategy with separate control flow and computation flow is adopted. verl is designed to decouple the control flow of RL algorithms, and the implementation of computation engines.
+
+Overall Execution Diagram
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Below is a simplified diagram denoting the execution of a reinforcement learning job. In the diagram, the controller runs on a single process, while the generator/actor workers, critic workers run on multiple processes, placed with specific resource groups. For rollout, the controller passes the data to the generator to perform sample generation. When the rollout is done, the data is passed back to controller for the next step of the algorithm. Similar execution is done for other workers. With the hybrid controller design, the data flow and computation is decoupled to provide both efficiency in computation and flexibility in defining algorithm training loops.
+
+.. figure:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/driver_worker.png?raw=true
+   :alt: The execution diagram
+
+Codebase walkthrough (PPO)
+------------------------------------------------
+
+Entry function
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Code: https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py
+
+In this file, we define a remote function `main_task` that serves as the controller (driver) process as shown in the above figure. We also define a ``RewardManager``, where users can customize their reward function based on the data source in the dataset. Note that `RewardManager` should return the final token-level reward that is optimized by RL algorithms. Note that users can combine model-based rewards and rule-based rewards.
+The ``main_task`` constructs a RayPPOTrainer instance and launch the fit. Note that ``main_task`` **runs as a single process**.
+
+We highly recommend that the ``main_task`` is NOT scheduled on the head of the ray cluster because ``main_task`` will consume a lot of memory but the head usually contains very few resources.
+
+Ray trainer
+~~~~~~~~~~~~~~~~~~~~
+Code: https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py
+
+The RayPPOTrainer manages 
+
+- Worker and WorkerGroup construction
+- Runs the main loop of PPO algorithm
+
+Note that, the fit function of RayPPOTrainer **runs as a single process**.
+
+Worker and WorkerGroup construction
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Each workerGroup manages a list of workers that runs remotely. Note that the worker group runs in the process of its constructor.
+Each worker inside the WorkerGroup runs on a GPU. The worker group serves as a proxy for the controller process to interact with a list of workers, in order to perform certain computations. **In order to do so, we have to bind the methods of the worker into the method of the WorkerGroup and define the data dispatch and data collection**. This is done via simple decoration that will be introduced in the Worker definition section.
+
+For example, in PPO, we define 3 worker groups:
+
+- ActorRolloutRef: manages actor, rollout and reference policy. ActorRolloutRefWorker can be instantiated as a single actor, a single rollout, a single reference policy, a combined actor/rollout or a combined actor/rollout/ref. This design is aimed for the maximum code reuse in various scenarios. The reason for colocating actor and rollout is for fast weight transfer using nccl. The reason for coloating actor and reference is to implement an efficient lora PPO as the reference policy is simply the base model of PPO in lora. The colocation is done via ``verl.single_controller.ray.base.create_colocated_worker_cls``, where it creates a single ray remote class exposing all class methods from these roles.
+- Critic: manages the critic model
+- Reward: manages the reward model
+
+The worker group will be constructed on the resource pool it designates. The resource pool is a set of GPUs in the ray cluster.
+
+Worker definition
+~~~~~~~~~~~~~~~~~~~~
+
+.. _ActorRolloutRefWorker: https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py
+
+We take `ActorRolloutRefWorker <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_ for an example.
+The APIs it should expose to the controller process are:
+
+- init_model: build the underlying model
+- generate_sequences: given prompts, generate responses
+- compute_log_prob: compute the log-probability of a generated sequence using actor
+- compute_ref_log_prob: compute the log-probability of a generated sequence using reference policy
+- save_checkpoint: save the checkpoint
+
+Note that these methods are defined in the worker that can only be invoked via remote calls. For example, if the controller process wants to initialize the model, it has to call
+
+.. code-block:: python
+
+   for worker in actor_rollout_ref_wg:
+       worker.init_model.remote()
+
+If the controller process wants to generate sequences, it has to call
+
+.. code-block:: python
+
+   data = xxx
+   # split the data into dp chunks
+   data_dp_lst = data.split(dp_size)
+   output_dp_lst = []
+   for i, worker in enumerate(actor_rollout_ref_wg):
+       output_future = worker.generate_sequences.remote(data_dp_lst[i])
+       output_dp_lst.append(output_future)
+   output = torch.cat(ray.get(output_dp_lst), dim=0)
+
+We observe that controller process calling worker group methods in general can be divided into 3 parts:
+
+- Split the data into data parallel sizes
+- Dispatch the corresponding data into each worker
+- Collect and concatenate the data when the computation finishes
+
+In verl, we design a syntax sugar to encapsulate the 3 processes into a single call from the controller process.
+
+.. code-block:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def generate_sequences(data):
+       ...
+
+   # on the driver
+   output = actor_rollout_ref_wg.generate_sequences(data)
+
+We decorate the method of the worker with a ``register`` that explicitly defines how the input data should be split and dispatched to each worker, and how the output data should be collected and concatenated by the controller. For example, ``Dispatch.DP_COMPUTE_PROTO`` splits the input data into dp chunks, dispatch each data to each worker, collect the output and concatenate the results. Note that this function requires the input and output to be a DataProto defined here (https://github.com/volcengine/verl/blob/main/verl/protocol.py).
+
+
+PPO main loop
+~~~~~~~~~~~~~~~~~~~~
+With the aforementioned APIs, we can implement the main loop of PPO as if it is a single process program
+
+.. code-block:: python
+
+   for prompt in dataloader:
+       output = actor_rollout_ref_wg.generate_sequences(prompt)
+       old_log_prob = actor_rollout_ref_wg.compute_log_prob(output)
+       ref_log_prob = actor_rollout_ref_wg.compute_ref_log_prob(output)
+       values = critic_wg.compute_values(output)
+       rewards = reward_wg.compute_scores(output)
+       # compute_advantages is running directly on the control process
+       advantages = compute_advantages(values, rewards)
+       output = output.union(old_log_prob)
+       output = output.union(ref_log_prob)
+       output = output.union(values)
+       output = output.union(rewards)
+       output = output.union(advantages)
+       # update actor
+       actor_rollout_ref_wg.update_actor(output)
+       critic.update_critic(output)
+
+Takeaways
+~~~~~~~~~~~~~~~~~~~~
+- This programming paradigm enables users to use different computation backend without modification of the control process.
+- This programming paradigm enables flexible placement (by changing the mapping of WorkerGroup and ResourcePool) without modification of the control process.
+
+Repository organization
+------------------------------------------------
+
+Important code files in the repository are organized as below:
+
+.. code-block:: bash
+
+   verl # the verl package
+     trainer
+       main_ppo.py  # the entrypoint for RL training
+       ppo
+         ray_trainer.py  # the training loop for RL algorithms such as PPO
+       fsdp_sft_trainer.py  # the SFT trainer with FSDP backend
+     config
+       generation.yaml  # configuration template for rollout
+       ppo_trainer.yaml  # configuration template for the RL trainer
+     workers
+       protocol.py  # the interface of DataProto
+       fsdp_workers.py   # the FSDP worker interfaces: ActorRolloutRefWorker, CriticWorker, RewardModelWorker
+       megatron_workers.py  # the Megatron worker interfaces: ActorRolloutRefWorker, CriticWorker, RewardModelWorker
+       actor
+         dp_actor.py  #  data parallel actor with FSDP backend
+         megatron_actor.py  # nD parallel actor with Megatron backend
+       critic
+         dp_critic.py  # data parallel critic with FSDP backend
+         megatron_critic.py  # nD parallel critic with FSDP backend
+       reward_model
+         megatron
+           reward_model.py  # reward model with Megatron backend
+       rollout
+         vllm
+           vllm_rollout.py  # rollout with vllm backend
+         hf_rollout.py  # rollout with huggingface TGI backend
+       sharding_manager
+         fsdp_ulysses.py  # data and model resharding when using FSDP + ulysses
+         fsdp_vllm.py  # data and model resharding when using FSDP + ulysses + vllm
+         megatron_vllm.py  # data and model resharding when using Megatron + vllm
+     utils
+       dataset  # datasets for SFT/RM/RL
+       reward_score  # function based reward
+         gsm8k.py  # reward function for gsm8k dataset
+         math.py  # reward function for math dataset
+       seqlen_balancing.py  # the sequence balance optimization
+     models
+       llama  # Megatron implementation for llama, deepseek, mistral, etc
+       transformers  # ulysses integration with transformer models such as llama, qwen, etc
+       weight_loader_registery.py  # registry of weight loaders for loading hf ckpt into Megatron
+     third_party
+       vllm  # adaptor for vllm's usage in RL
+         vllm_spmd  # vllm >= v0.7 adaptor
+   examples  # example scripts
+   tests  # integration and unit tests
+   .github  # the configuration of continuous integration tests
+
+
+.. [1] HybridFlow: A Flexible and Efficient RLHF Framework: https://arxiv.org/abs/2409.19256v2
+.. [2] Data flow graph credit to CS231n 2024 lecture 4: https://cs231n.stanford.edu/slides/2024/lecture_4.pdf
+.. [3] PPO dataflow graph credit to 低级炼丹师 from Zhihu: https://zhuanlan.zhihu.com/p/635757674
+.. [4] RLFlow
--- a/docs/index.rst
+++ b/docs/index.rst
+Welcome to verl's documentation!
+================================================
+
+verl is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <https://arxiv.org/pdf/2409.19256>`_ paper.
+
+verl is flexible and easy to use with:
+
+- **Easy extension of diverse RL algorithms**: The hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
+
+- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM, vLLM and SGLang. Moreover, users can easily extend to other LLM training and inference frameworks.
+
+- **Flexible device mapping and parallelism**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
+
+- Ready integration with popular HuggingFace models
+
+
+verl is fast with:
+
+- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, verl achieves high generation and training throughput.
+
+- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
+
+--------------------------------------------
+
+.. _Contents:
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Quickstart
+
+   start/install
+   start/quickstart
+   start/multinode
+   start/ray_debug_tutorial
+   start/more_resources
+   start/agentic_rl
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Programming guide
+
+   hybrid_flow
+   single_controller
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Data Preparation
+
+   preparation/prepare_data
+   preparation/reward_function
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Configurations
+
+   examples/config
+
+.. toctree::
+   :maxdepth: 1
+   :caption: PPO Example
+
+   examples/ppo_code_architecture
+   examples/gsm8k_example
+   examples/multi_modal_example
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Algorithms
+
+   algo/ppo.md
+   algo/grpo.md
+   algo/dapo.md
+   algo/spin.md
+   algo/sppo.md
+   algo/entropy.md
+   algo/opo.md
+   algo/baseline.md
+   algo/gpg.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: PPO Trainer and Workers
+
+   workers/ray_trainer
+   workers/fsdp_workers
+   workers/megatron_workers
+   workers/sglang_worker
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Performance Tuning Guide
+
+   perf/dpsk.md
+   perf/perf_tuning
+   README_vllm0.8.md
+   perf/device_tuning
+   perf/nsight_profiling.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Adding new models
+
+   advance/fsdp_extension
+   advance/megatron_extension
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced Features
+
+   advance/checkpoint
+   advance/rope
+   advance/ppo_lora.rst
+   sglang_multiturn/multiturn.rst
+   sglang_multiturn/interaction_system.rst
+   advance/placement
+   advance/dpo_extension
+   examples/sandbox_fusion_example
+   advance/rollout_trace.rst
+   advance/one_step_off
+   advance/agent_loop
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Hardware Support
+
+   amd_tutorial/amd_build_dockerfile_page.rst
+   amd_tutorial/amd_vllm_page.rst
+   ascend_tutorial/ascend_quick_start.rst
+   ascend_tutorial/ascend_profiling.rst
+   ascend_tutorial/ascend_profiling_en.rst
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API References
+
+   api/data
+   api/single_controller.rst
+   api/trainer.rst
+   api/utils.rst
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: FAQ
+
+   faq/faq
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Development Notes
+
+   sglang_multiturn/sandbox_fusion.rst
+
+Contribution
+-------------
+
+verl is free software; you can redistribute it and/or modify it under the terms
+of the Apache License 2.0. We welcome contributions.
+Join us on `GitHub <https://github.com/volcengine/verl>`_, `Slack <https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA>`_ and `Wechat <https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG>`_ for discussions.
+
+Contributions from the community are welcome! Please check out our `project roadmap <https://github.com/volcengine/verl/issues/710>`_ and `good first issues <https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22>`_ to see where you can contribute.
+
+Code Linting and Formatting
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We use pre-commit to help improve code quality. To initialize pre-commit, run:
+
+.. code-block:: bash
+
+   pip install pre-commit
+   pre-commit install
+
+To resolve CI errors locally, you can also manually run pre-commit by:
+
+.. code-block:: bash
+
+   pre-commit run
+
+Adding CI tests
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+If possible, please add CI test(s) for your new feature:
+
+1. Find the most relevant workflow yml file, which usually corresponds to a ``hydra`` default config (e.g. ``ppo_trainer``, ``ppo_megatron_trainer``, ``sft_trainer``, etc).
+2. Add related path patterns to the ``paths`` section if not already included.
+3. Minimize the workload of the test script(s) (see existing scripts for examples).
+
+We are HIRING! Send us an `email <mailto:haibin.lin@bytedance.com>`_ if you are interested in internship/FTE opportunities in MLSys/LLM reasoning/multimodal alignment.
--- a/docs/perf/device_tuning.rst
+++ b/docs/perf/device_tuning.rst
+Hardware Resource Needed for RL
+===============================
+
+Last updated: 06/25/2025.
+
+Since RL requires more resources compared to regular training, 
+determining how much resources are needed to successfully run it before training 
+is a relatively difficult task. To provide more people with reference points for 
+resource selection when dealing with different models and tasks, this section is 
+mainly dedicated to introducing the environmental requirements based on experiments 
+we have conducted.
+
+However, due to limited staff and equipment resources, we also hope for more 
+contributions from the open-source community. When submitting a PR, it is necessary 
+to provide a script to be added to the example/tuning scripts.
+
+We need two types of scripts: one is the configuration that can run with the **minimum 
+resources(min)**, and the other is the configuration that runs with **recommended resources(recommended)**. For the former, 
+it can be understood as a script that can run after applying all memory optimization techniques 
+(e.g., offload, gradient checkpointing). For the latter, it can be understood as a script that 
+can run while avoiding operations that incur additional time overhead as much as possible (targetting best throughput).
+
+When defining script names, please follow this format: 
+``[model]_[task]_[gpunums]_[device]_[train]_[infer].sh``. This will effectively improve 
+the script's recognizability. You can place the script under the ``examples/tuning/`` directory.
+
+If you happen to have a configuration that has already been tested, we welcome you to submit 
+a PR and include a screenshot from Wandb or other verifiable evidence.
+
+----------------------------------------
+
+0.5B
+~~~
+
+.. list-table::
+    :widths: auto
+    :header-rows: 1
+    
+    * - Tag
+      - Model
+      - Task
+      - Resource
+      - MaxBatch
+      - Train
+      - Infer
+      - Link
+      - Contributor
+    * - MIN
+      - Qwen2.5-0.5B
+      - GRPO-LoRA
+      - 1*H100
+      - 116
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-0.5b_grpo-lora_1_h100_fsdp_vllm.sh <https://github.com/volcengine/verl/blob/main/examples/tuning/0.5b/qwen2-0.5b_grpo-lora_1_h100_fsdp_vllm.sh>`_
+      - `SimonHuang <thelongestusernameofall@gmail.com>`_
+
+1.5B
+~~~
+
+.. list-table::
+    :widths: auto
+    :header-rows: 1
+    
+    * - Tag
+      - Model
+      - Task
+      - Resource
+      - MaxBatch
+      - Train
+      - Infer
+      - Link
+      - Contributor
+    * - MIN
+      - Qwen2.5-1.5B
+      - GRPO-LoRA
+      - 1*H100
+      - 128
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-1.5b_grpo-lora_1_h100_fsdp_vllm.sh <https://github.com/volcengine/verl/blob/main/examples/tuning/1.5b/qwen2-1.5b_grpo-lora_1_h100_fsdp_vllm.sh>`_
+      - `SimonHuang <thelongestusernameofall@gmail.com>`_
+
+3B
+~~~
+
+.. list-table::
+    :widths: auto
+    :header-rows: 1
+    
+    * - Tag
+      - Model
+      - Task
+      - Resource
+      - MaxBatch
+      - Train
+      - Infer
+      - Link
+      - Contributor
+    * - MIN
+      - Qwen2.5-3B
+      - GRPO-LoRA
+      - 1*H100
+      - 62
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-3b_grpo-lora_1_h100_fsdp_vllm.sh <https://github.com/volcengine/verl/blob/main/examples/tuning/3b/qwen2-3b_grpo-lora_1_h100_fsdp_vllm.sh>`_
+      - `SimonHuang <thelongestusernameofall@gmail.com>`_
+
+7B
+~~~
+
+.. list-table::
+    :widths: auto
+    :header-rows: 1
+    
+    * - Tag
+      - Model
+      - Task
+      - Resource
+      - MaxBatch
+      - Train
+      - Infer
+      - Link
+      - Contributor
+    * - MIN
+      - Qwen2-7B
+      - GRPO
+      - 2*H800
+      - \
+      - fsdp
+      - vllm0.8.2
+      - `qwen2-7b_grpo_2_h800_fsdp_vllm <https://github.com/volcengine/verl/blob/main/examples/tuning/7b/qwen2-7b_grpo_2_h800_fsdp_vllm.sh>`_
+      - `Xiangyongan <xiangyongan@bytedance.com>`_
+    * - MIN
+      - Qwen2.5-7B
+      - GRPO-LoRA
+      - 1*H100
+      - 16
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-7b_grpo-lora_1_h100_fsdp_vllm.sh <https://github.com/volcengine/verl/blob/main/examples/tuning/7b/qwen2-7b_grpo-lora_1_h100_fsdp_vllm.sh>`_
+      - `SimonHuang <thelongestusernameofall@gmail.com>`_
+
+14B
+~~~
+
+.. list-table::
+    :widths: auto
+    :header-rows: 1
+    
+    * - Tag
+      - Model
+      - Task
+      - Resource
+      - MaxBatch
+      - Train
+      - Infer
+      - Link
+      - Contributor
+    * - MIN
+      - Qwen2-14B
+      - GRPO
+      - 4*H800
+      - \
+      - fsdp
+      - vllm0.8.2
+      - `qwen2-14b_grpo_4_h800_fsdp_vllm <https://github.com/volcengine/verl/blob/main/examples/tuning/14b/qwen2-14b_grpo_4_h800_fsdp_vllm.sh>`_
+      - `Xiangyongan <xiangyongan@bytedance.com>`_
+    * - MIN
+      - Qwen2.5-14B
+      - GRPO-LoRA
+      - 2*H100
+      - 116
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-14b_grpo-lora_2_h100_fsdp_vllm.sh <https://github.com/volcengine/verl/blob/main/examples/tuning/14b/qwen2-14b_grpo-lora_2_h100_fsdp_vllm.sh>`_
+      - `SimonHuang <thelongestusernameofall@gmail.com>`_
+
+32B
+~~~
+
+.. list-table::
+    :widths: auto
+    :header-rows: 1
+    
+    * - Tag
+      - Model
+      - Task
+      - Resource
+      - MaxBatch
+      - Train
+      - Infer
+      - Link
+      - Contributor
+    * - MIN
+      - Qwen2-32B
+      - GRPO
+      - 8*H20
+      - \
+      - megatron
+      - vllm0.8.2
+      - `qwen2-32b_grpo_8_h20_megatron_vllm <https://github.com/volcengine/verl/tree/main/examples/tuning/32b/qwen2_32B_grpo_8_h20_megatron_vllm.sh>`_
+      - `Xiangyongan <xiangyongan@bytedance.com>`_
+    * - MIN
+      - Qwen2.5-32B
+      - GRPO-LoRA
+      - 4*H100
+      - 180
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-32b_grpo-lora_4_h100_fsdp_vllm.sh <https://github.com/volcengine/verl/blob/main/examples/tuning/32b/qwen2-32b_grpo-lora_4_h100_fsdp_vllm.sh>`_
+      - `SimonHuang <thelongestusernameofall@gmail.com>`_
+
+70B
+~~~
+
+.. list-table::
+    :widths: auto
+    :header-rows: 1
+
+    * - Tag
+      - Model
+      - Task
+      - Resource
+      - MaxBatch
+      - Train
+      - Infer
+      - Link
+      - Contributor
+    * - MIN
+      - Qwen2-70B
+      - GRPO
+      - 32*H20
+      - \
+      - fsdp
+      - vllm0.8.2
+      - `qwen2-70b_grpo_32_h20_fsdp_vllm <https://github.com/volcengine/verl/blob/main/examples/tuning/70b/qwen2-70b_grpo_32_h20_fsdp_vllm.sh>`_
+      - `Xiangyongan <xiangyongan@bytedance.com>`_
+    * - MIN
+      - Qwen2-70B
+      - GRPO
+      - 32*H800
+      - \
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-70b_grpo_32_h800_fsdp_vllm <https://github.com/volcengine/verl/blob/main/examples/tuning/70b/qwen2-70b_grpo_32_h800_fsdp_vllm.sh>`_
+      - `Xiangyongan <xiangyongan@bytedance.com>`_
+    * - MIN
+      - Qwen2.5-72B
+      - GRPO-LoRA
+      - 8*H100
+      - 176
+      - fsdp
+      - vllm0.8.3
+      - `qwen2-72b_grpo-lora_8_h100_fsdp_vllm.sh <https://github.com/volcengine/verl/blob/main/examples/tuning/70b/qwen2-72b_grpo-lora_8_h100_fsdp_vllm.sh>`_
+      - `SimonHuang <thelongestusernameofall@gmail.com>`_
+
+405B
+~~~~
+
+.. table::
+   :widths: auto
+
+   ====== ====== ====== ======== ======== ====== ====== ======
+   tag    model  task   resource MaxBatch train  infer  link
+   ====== ====== ====== ======== ======== ====== ====== ======
+   \      \      \        \        \      \      \
+   ====== ====== ====== ======== ======== ====== ====== ======
+
+671B
+~~~~
+
+.. table::
+   :widths: auto
+
+   ====== ====== ====== ======== ======== ====== ====== ======
+   tag    model  task   resource MaxBatch train  infer  link
+   ====== ====== ====== ======== ======== ====== ====== ======
+   \      \      \        \        \      \      \
+   ====== ====== ====== ======== ======== ====== ====== ======
--- a/docs/perf/dpsk.md
+++ b/docs/perf/dpsk.md
+# Training DeepSeek 671b
+
+Last updated: 06/13/2025.
+
+verl integrates Megatron to support large MoE models such as `Qwen3-235B-A22B` and `deepseek-ai/DeepSeek-V3`. This is an ongoing community effort.
+
+In the journey the community added the following features and optimizations that enable verl with larger models:
+- per tensor weight resharding between rollout and training
+- context parallelism and expert parallelism enabled via megatron
+- dynamic batch size (sequence balance) for megatron
+- reduced ray-related serialization overhead
+- optimizer offloading, recomputation, and efficient kernels
+- various debugging metrics and utils
+
+and the megatron backend now has a wider list of models supported:
+- DeepSeek-V3
+- Moonlight
+- Qwen3
+- Qwen2.5-VL (to be merged soon)
+- Qwen2
+- Mixtral
+
+## Getting Started
+
+### DeepSeek 671b
+
+The recommended image with pre-built megatron dependency is `whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6.post5-mcore0.12.2-te2.3-deepseekv3`, built with the Dockerfile in [docker/Dockerfile.vllm.sglang.megatron.deepseek](https://github.com/volcengine/verl/blob/main/docker/Dockerfile.vllm.sglang.megatron.deepseek).
+
+For checkpoint loading, we rely on megatron dist-ckpt for resharding. A converted dist-ckpt for DeepSeek-V3 is available from [huggingface BearBiscuit05/dpsk-v3-671B-BF16-dist_ckpt](https://huggingface.co/BearBiscuit05/dpsk-v3-671B-BF16-dist_ckpt/tree/main).
+
+To run end-to-end training on the DAPO dataset, run [recipe/dapo/test_dapo_dspk_671b_megatron.sh](https://github.com/volcengine/verl/blob/main/recipe/dapo/test_dapo_dspk_671b_megatron.sh). It runs on 512 H20(96GB) GPUs with the following setup:
+- vllm rollout with TP=32, bfloat16
+- megatron training with attention DP, MoE EP=32, PP=16, bfloat16
+
+MTP is disabled during RL training.
+
+### Qwen3 236b
+
+For Qwen3-236b, please refer to [examples/grpo_trainer/run_qwen3-236b_megatron.sh](https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-236b_megatron.sh), which runs on 128 H20(96GB) GPUs.
+
+## Upcoming Optimizations
+
+The community continue to optimize large MoE models further, ongoing efforts include:
+- further optimizing memory consumption, and provide recommended/tuned configurations with various machine types
+- optimizing long context RL training performance
+- performance improvement with SGLang x Megatron
+
+We invite the community to try and improve verl together. Get connected with us on [slack](https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA)/[wechat](https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG)/[Github issues](https://github.com/volcengine/verl/issues/708)!
+
+## Acknowledgement
+@vermouth1992 @ISEEKYAN @ETOgaosion @yzlnew @ShareLer @BearBiscuit05 @ccclyu @ann-qin-lu @SwordFaith @zzong2006 @zhaochenyang20 @ocss884 @eric-haibin-lin
--- a/docs/perf/nsight_profiling.md
+++ b/docs/perf/nsight_profiling.md
+# NVIDIA Nsight Systems profiling in verl
+
+Last updated: 06/20/2025.
+
+This guide explains how to use NVIDIA Nsight Systems for profiling verl training runs.
+
+## Configuration
+
+Profiling in verl can be configured through several parameters in the trainer configuration file (ppo_trainer.yaml or other files like dapo_trainer.yaml):
+
+### Prerequisites
+
+Nsight Systems version is important, please reference `docker/Dockerfile.vllm.sglang.megatron` for the version we used.
+
+### Global profiling control
+
+verl has one single controller process and multiple worker processes. Both controller and worker processes can be profiled. Since the controller process can be executed in any nodes in the cluster, there is a message printed in the logging to indicate the controller process node hostname and process id.
+
+In `trainer`, three new config entries control the profiler behaviors:
+
+* **`trainer.profile_steps`**. List of step numbers at which profiling should be performed. For example: [1, 2, 5] will profile steps 1, 2, and 5. And ``null`` means no profiling.
+
+
+* **`controller_nsight_options`**. This config group is for the single controller. All fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. `ppo_trainer.yaml` provides a workable example. Users can reference [Nsight Systems manual](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Ray user guide](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html) for more details.
+
+* **`worker_nsight_options`**. This config group is for the worker processes. Similarly all fields in this config group will be just sent to Nsight Systems when Ray starts the controller process. Capture range is used to control the profiler when to start and stop. So `capture-range: "cudaProfilerApi"` is fixed and does not change it. Users can change `capture-range-end` with some accurate calculation or just leave it `null`.
+
+### Worker process profiling
+
+Verl manages mulitiple RL roles, _Actor_, _Ref_, _Rollout_, _Critic_, _Reward_, which are implemented in different Worker classes. And these workers can be combined into one Ray Actor, running in a process group. Each RL role has its own profiling config group, `profiler`, which consists of three fields:
+
+* **`all_ranks` and `ranks`**. When `all_ranks` is set `True` then all ranks will be profiled; when set `False`, `ranks` will be profiled. By default, verl profiles the whole training process in a series ` worker_process_<PID>.<RID>.nsys-rep` files for each process rank. PID is the process ID; RID is the capture range ID.
+
+* **`discrete`**. When set `False`, all the roles actions in one training step will be dumped in one database. When set `True`, the actions annotated by `DistProfiler.annotate` will be dumped into a discrete database. In this case, each role's action occupies one `<RID>`.
+
+* **`actor_rollout_ref`**. This Worker can be configured to contain at most 3 roles and executes together. So `actor_rollout_ref` has a `profiler` config and all the inside roles inherit it.
+
+* **Verl collocate mode**. Verl can combine two Worker sub classes to one Worker Actor. In this case, the user should take care that the combined Workers have consistent `discrete`. The Nsight Systems profiler uses a `torch.cuda.profiler.start()` and `stop()` pair to dump a `<step>` database anyway.
+
+### where to find the profiling data
+
+By default the `*.nsys-rep` files are saved in the directory `/tmp/ray/session_latest/logs/nsight/` at each node. According to the Ray manual, this default directory is not changeable. ["however, Ray preserves the `--output` option of the default config"](https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html).
+
+Some users may think it is not convenient, but it is understandable that Ray may start hundreds of processes and it would be a big network file system pressure if we save the files in one central place.
+
+## Usage Example
+
+To enable profiling for specific components and steps, modify your ppo_trainer.yaml like this:
+
+### Disable profiler
+```yaml
+    trainer:
+        profile_steps: null # disable profile
+```
+
+### Enable profiler and one database for one training step
+```yaml
+    trainer:
+        profile_steps: [1, 2, 5]
+    actor_rollout_ref:
+        profiler:
+            discrete: False
+            all_ranks: False
+            ranks: [0, 1]
+    critic:
+        profiler:
+            discrete: False
+            all_ranks: False
+            ranks: [0, 1]
+    reward_model:
+        profiler:
+            discrete: False
+            all_ranks: False
+            ranks: [0, 1]
+```
+
+### Enable profiler and multiple databases for one training step
+```yaml
+    trainer:
+        profile_steps: [1, 2, 5]
+    actor_rollout_ref:
+        profiler:
+            discrete: True
+            all_ranks: False
+            ranks: [0, 1]
+    critic:
+        profiler:
+            discrete: True
+            all_ranks: False
+            ranks: [0, 1]
+    reward_model:
+        profiler:
+            discrete: True
+            all_ranks: False
+            ranks: [0, 1]
+```
+
+## Profiling Output
+
+When profiling is enabled, verl will generate Nsight Systems profiles for the specified components and steps. The profiles will include:
+
+- CUDA kernel execution
+- Memory operations
+- CPU-GPU synchronization
+- NVTX markers for key operations
+
+Nsight Systems supports multi-report view, to open multiple databases together. In this mode, different processes and steps can be aligned in one time line for better analysis.
--- a/docs/perf/perf_tuning.rst
+++ b/docs/perf/perf_tuning.rst
+Performance Tuning Guide
+==============================
+
+Last updated: 07/17/2025.
+
+Author: `Guangming Sheng <https://github.com/PeterSH6>`_, `Jiali Zheng <https://github.com/CurryRice233>`_
+
+In this section, we will discuss how to tune the performance of all the stages in verl, including:
+
+1. Rollout generation throughput.
+
+2. Enable ``use_remove_padding=True`` for sequence packing (i.e., data packing and remove padding).
+
+3. Batch size tuning for forward and backward computation
+
+4. Enable ``use_dynamic_bsz=True`` for higher throughput.
+
+5. Utilize Ulysses Sequence Parallel for Long Context Training
+
+6. LigerKernel for SFT performance optimization
+
+7. Forward prefetch in FSDP training backend
+
+8. Memory optimization for entropy calculation from logits
+
+Rollout Generation Tuning
+--------------------------
+
+verl currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon). 
+
+Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend setting ``actor_rollout_ref.rollout.disable_log_stats=False`` so that rollout statistics are logged.
+
+- Increase ``gpu_memory_utilization``.
+
+  - For vLLM v0.7.0 and later, the vLLM instance will only use gpu_memory_utilization of the **total** memory.
+  - For SGLang, it's the fraction of the free GPU memory used for **static** memory like model weights and KV cache. However, the remaining (1-gpu_memory_utilization) will also be used during inference.
+
+  However, if model parameters and optimizer states are not offloaded, using too high a fraction can lead to OOM. 
+  A value between 0.5 and 0.7 often strikes a good balance between high throughput and avoiding OOM.
+
+  Note: since the definition of ``gpu_memory_utilization`` varies across inference engines, a value that works well for one engine may cause OOM for another.
+
+- Adjust ``max_num_seqs`` or ``max_num_batched_tokens``.
+  If the GPU cache utilization is relatively low in the log, increase ``max_num_seqs`` or ``max_num_batched_tokens`` 
+  can enlarge the effective batch size in the decoding stage, allowing more concurrent requests per batch. 
+  We recommend setting ``max_num_batched_tokens > 2048`` for higher throughput.
+
+- Use a smaller ``tensor_parallel_size``. 
+  When GPU resources allow, a smaller tensor parallel size spawns more vLLM replicas. 
+  Data parallelism (DP) can yield higher throughput than tensor parallelism (TP), but also increases KVCache consumption. 
+  Carefully balance the trade-off between more replicas and higher memory usage.
+  Our experiment in Sec. 8.4 of `HybridFlow paper <https://arxiv.org/pdf/2409.19256v2>`_ evaluate this trade-off.
+
+More tuning details such as dealing with Preemption and Chunked-prefill
+can be found in `vLLM official tuning guide <https://docs.vllm.ai/en/latest/performance/optimization.html>`_ 
+
+For optimal performance, we recommend using vLLM v0.8.3 or later. See https://github.com/volcengine/verl/blob/main/docs/README_vllm0.8.md for details.
+
+Enable remove padding (sequence packing)
+-----------------------------------------
+
+Currently, for llama, mistral, gemma1 and qwen based models, users can enable `use_remove_padding=True` to utilize the 
+sequence packing implementation provided by transformers library.
+
+For other models, transformers library may also support it but we haven't tested it yet.
+Users can add the desired model config to the  `test_transformer.py <https://github.com/volcengine/verl/blob/main/tests/models/test_transformer.py#L24>`_ file.
+And test its functionality by running the following command:
+
+.. code-block:: bash
+
+  pytest -s tests/models/test_transformer.py
+
+If the test passes, you can add your desired model into the model `registry.py <https://github.com/volcengine/verl/blob/main/verl/models/registry.py#L24>`_ file.
+Then, you can enjoy the performance boost of sequence packing
+and welcome to PR your tested model to verl!
+
+
+Batch Size Tuning
+-----------------
+
+To achieve higher throughput in experience preparation (i.e., model fwd) and model update (i.e., actor/critic fwd/bwd), 
+users may need to tune the ``*micro_batch_size_per_gpu`` for different computation.
+
+In verl, the core principle for setting batch sizes is:
+
+- **Algorithmic metrics** (train batch size, PPO mini-batch size) are *global* (from a single-controller perspective), 
+  normalized in each worker. See the `normalization code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py#L120-L122>`_.
+
+- **Performance-related parameters** (micro batch size, max token length for dynamic batch size) are *local* parameters that define the per-GPU data allocations. 
+  See the `normalization code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py#L127>`_.
+
+.. note:: In your training script, please use ``*micro_batch_size_per_gpu`` instead of ``*micro_batch_size``. 
+  So that you don't need to consider the normalization of the ``micro_batch_size`` and ``micro_batch_size`` will be deprecated.
+
+Batch Size Tuning tips
+""""""""""""""""""""""
+
+Therefore, users may need to tune the ``*micro_batch_size_per_gpu`` to accelerate training. Here're some tips:
+
+1. **Enable gradient checkpointing**: 
+   Set ``actor_rollout_ref.model.enable_gradient_checkpointing=True`` and ``critic.model.enable_gradient_checkpointing=True``. 
+   This often allows for larger micro-batch sizes and will be beneficial for large mini-batch training.
+
+2. Increase the ``*micro_batch_size_per_gpu`` as much as possible till equals to normalized ``mini_batch_size``.
+
+3. **Use larger forward-only parameters**: 
+   Forward only parameter, such as ``actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu``, 
+   ``actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu``, ``critic.forward_micro_batch_size_per_gpu`` could be larger (e.g., 2x) than training related micro batch sizes,
+   such as ``actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu``, ``critic.ppo_micro_batch_size_per_gpu``.
+
+4. **Allow larger micro-batch sizes for Critic and Reward models**:
+   micro batch size of Critic and Reward model could be larger than Actor model. This is because the actor model has much larger vocab size in the final layer.
+
+5. **Enable activation offloading**:
+   Set ``actor_rollout_ref.model.enable_activation_offload=True`` and ``critic.model.enable_activation_offload=True``.
+   This often works together with gradient checkpointing to get larger micro-batch sizes and it's only available in FSDP backend now.
+
+Tuning for Dynamic Batch Size
+-----------------------------
+
+Dynamic batch size is a technique that allows the model to process similar number of tokens in a single forward pass (with different actual batch sizes).
+This can significantly improve the training efficiency and reduce the memory usage.
+
+To utilize this technique, users can set ``use_dynamic_bsz=True`` in actor, ref, critic and reward models.
+With ``use_dynamic_bsz=True``, users don't need to tune ``*micro_batch_size_per_gpu``. 
+Instead, users should tune the following parameters:
+
+- ``actor_rollout_ref.actor.ppo_max_token_len_per_gpu``, ``critic.ppo_max_token_len_per_gpu``: 
+  The maximum number of tokens to be processed in fwd and bwd of ``update_policy`` and ``update_critic``.
+
+- ``actor_rollout_ref.ref.log_prob_max_token_len_per_gpu`` and ``actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu``: 
+  The maximum number of tokens to be processed in a the fwd computation of ``compute_log_prob`` and ``compute_ref_log_prob``.
+
+- ``critic.forward_micro_batch_size_per_gpu``, ``reward_model.forward_micro_batch_size_per_gpu``: 
+  The maximum number of tokens to be processed in a the fwd computation of ``compute_values``, ``compute_rm_score``.
+
+Dynamic Batch Size Tuning tips
+""""""""""""""""""""""""""""""
+
+Here're some tips to tune the above parameters:
+
+1. **Increase** ``actor_rollout_ref.actor.ppo_max_token_len_per_gpu``  
+   Make it at least 2 x (max_prompt_length + max_response_length). We set it to 3x in `run_qwen2-7b_rm_seq_balance.sh <https://github.com/volcengine/verl/blob/main/examples/ppo_trainer/run_qwen2-7b_rm_seq_balance.sh#L25>`_.
+   Try to increase it to get higher throughput.
+
+2. **Forward-only parameters can be larger**: 
+   Similar to the non-dynamic-batch scenario, forward-only token limits can exceed those used in forward/backward operations.
+ 
+3. **Use larger limits for Critic and Reward models**:
+   Critic and Reward parameters can be set at least 2× the Actor’s limits. For instance, we set them to 4× here:  
+   `run_qwen2-7b_rm_seq_balance.sh <https://github.com/volcengine/verl/blob/main/examples/ppo_trainer/run_qwen2-7b_rm_seq_balance.sh#L40>`_
+   
+.. :math:`\text{critic.ppo_max_token_len_per_gpu}  = 2 \times  \text{actor.ppo_max_token_len_per_gpu})`.
+
+Ulysses Sequence Parallel for Long Context Training
+----------------------------------------------------
+
+To utilize this technique, users can set ``ulysses_sequence_parallel_size>1`` in actor, ref, critic and reward models.
+
+We support different model utilize different ulysses_sequence_parallel_size sizes.
+
+To train long sequence (>32k), users may need to decrease the ``*micro_batch_size_per_gpu`` and ``*max_token_len_per_gpu`` to avoid OOM.
+
+LigerKernel for SFT
+----------------------
+
+LigerKernel is a high-performance kernel for Supervised Fine-Tuning (SFT) that can improve training efficiency. To enable LigerKernel in your SFT training:
+
+1. Install liger-kernel via ``pip3 install liger-kernel``. In your SFT configuration file (e.g., ``verl/trainer/config/sft_trainer.yaml``), set the ``use_liger`` parameter:
+
+   .. code-block:: yaml
+
+      model:
+        use_liger: True  # Enable LigerKernel for SFT
+
+2. The default value is ``False``. Enable it only when you want to use LigerKernel's optimizations.
+
+3. LigerKernel is particularly useful for improving training performance in SFT scenarios.
+
+Forward prefetch in FSDP training backend
+----------------------
+
+During the training phase, users can enable forward prefetching in FSDP by setting ``fsdp_config.forward_prefetch=True``. For example, ``actor_rollout_ref.actor.fsdp_config.forward_prefetch=True``. This configuration prefetches the next forward-pass all-gather operation before completing the current forward computation, overlapping communication with computation and improving efficiency. For further details, refer to the `FSDP forward_prefetch <https://docs.pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp>`_ documentation.
+
+.. note::
+    Backward prefetch is unsupported because the ``BACKWARD_POST`` policy may prefetch incorrectly in nested-module cases. For details, see the `FSDP documentation <https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md?plain=1#L70>`_
+
+Migrating to FSDP2
+----------------------
+
+FSDP2 offers notable improvements over FSDP1. According to `PyTorch TorchTitan benchmarks <https://arxiv.org/abs/2410.06511v1>`_:
+
+- 7% lower GPU memory usage on average
+- 1.5% throughput improvement with BF16 training
+- Better composability with DTensor and per-parameter sharding
+
+**Enabling FSDP2 in VERL:**
+
+   .. code-block:: python
+
+    # Enable FSDP2 in actor configuration
+    actor_rollout_ref.actor.strategy="fsdp2"
+
+.. note:: 
+   FSDP2 requires PyTorch 2.1+ and is recommended for models with transformer architecture.
+
+Memory optimization for entropy calculation from logits
+----------------------
+
+The ``logits`` tensor (typically of shape ``[bsz*seq_len, voc]``) can consume significant memory. When using ``compute_entropy_from_logits``, memory usage reaches approximately ``[bsz*seq_len, voc] × (4 bytes (float32) + 2 bytes (autocast for softmax+logsumexp) + 1 byte (softmax output))``.
+
+To reduce this memory peak, enable chunked computation by setting:
+``actor_rollout_ref.ref.entropy_from_logits_with_chunking = True``
+This processes the tensor in chunks of shape ``[chunk_size, voc]`` (e.g., 2048) rather than the full sequence length, exclusively during the model's forward pass.
+
+Additionally, during training, standard gradient checkpointing (``enable_gradient_checkpointing=True``) does not apply to entropy calculations. To reduce memory peaks in this context, set:
+``actor_rollout_ref.actor.entropy_checkpointing = True``
+This enables entropy recomputation specifically for the entropy calculation, lowering memory usage during training.