Initial commit

7f6cc211 · jerrrrry · 7f6cc211 · 7f6cc211 · 7f6cc211 · 7f6cc211
Commit 7f6cc211 authored Aug 05, 2025 by jerrrrry
20 changed files
--- a/docs/preparation/prepare_data.rst
+++ b/docs/preparation/prepare_data.rst
+Prepare Data for Post-Training
+========================================
+
+Last updated: 02/09/2025.
+
+Before starting the post-training job, we need to prepare the data for
+the policy training. The data should be stored in the parquet format.
+
+We provide several data preprocess scripts for different datasets,
+including GSM8K, MATH, HelloSwag, Full_hh_rlhf. To prepare other datasets, we need
+to follow the following steps: The data preprocess script can be divided
+into two parts:
+
+1. The first part is the common part, which loads the dataset from
+   huggingface's ``datasets`` package. Then preprocess the datasets with
+   the ``make_map_fn`` and then store in the parquet format.
+
+.. code:: python
+
+   import re
+   import os
+   import datasets
+
+   from verl.utils.hdfs_io import copy, makedirs
+   import argparse
+
+   # To extract the solution for each prompts in the dataset
+   # def extract_solution(solution_str): 
+   # ...
+
+
+   if __name__ == '__main__':
+       parser = argparse.ArgumentParser()
+       parser.add_argument('--local_dir', default='/opt/tiger/gsm8k')
+       parser.add_argument('--hdfs_dir', default=None)
+
+       args = parser.parse_args()
+
+       num_few_shot = 5
+       data_source = 'openai/gsm8k'
+
+       dataset = datasets.load_dataset(data_source, 'main')
+
+       train_dataset = dataset['train']
+       test_dataset = dataset['test']
+
+           # Construct a `def make_map_fn(split)` for the corresponding datasets.
+       # ...
+           
+       train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+       test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+       local_dir = args.local_dir
+       hdfs_dir = args.hdfs_dir
+
+       train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+       test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+       makedirs(hdfs_dir)
+
+       copy(src=local_dir, dst=hdfs_dir)
+
+2. The users are required to implement the ``make_map_fn()`` function
+   (as well as the ``extract_solution``) on their own to support
+   different datasets or tasks.
+
+We already implemented the data preprocess of GSM8k, MATH, Hellaswag and Full_hh_rlhf
+datasets. And we take the GSM8k dataset as an example:
+
+**GSM8K**
+
+In the ``make_map_fn``, each data field should consist of the following
+5 fields:
+
+1. ``data_source``: The name of the dataset. To index the corresponding
+   reward function in the ``RewardModule``
+2. ``prompt``: This field should be constructed in the format of
+   huggingface chat_template. The tokenizer in ``RLHFDataset`` will
+   apply chat template and tokenize the prompt.
+3. ``ability``: Define the task category.
+4. ``reward_model``: Currently, we only utilize the ``ground_truth``
+   field during evaluation. The ``ground_truth`` is computed by the
+   ``extract_solution`` function. **NOTED** that the implementation of
+   the corresponding reward function should align with this extracted
+   ``ground_truth``.
+5. ``extra_info``: Record some information of the current prompt. Not
+   use for now.
+
+.. code:: python
+
+   def extract_solution(solution_str):
+       solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str) # extract the solution after ####
+       assert solution is not None
+       final_solution = solution.group(0)
+       final_solution = final_solution.split('#### ')[1].replace(',', '')
+       return final_solution
+
+   instruction_following = "Let's think step by step and output the final answer after \"####\"."
+
+   # add a row to each data item that represents a unique id
+   def make_map_fn(split):
+
+       def process_fn(example, idx):
+           question = example.pop('question')
+
+           question = question + ' ' + instruction_following
+
+           answer = example.pop('answer')
+           solution = extract_solution(answer)
+           data = {
+               "data_source": data_source,
+               "prompt": [{
+                   "role": "user",
+                   "content": question
+               }],
+               "ability": "math",
+               "reward_model": {
+                   "style": "rule",
+                   "ground_truth": solution
+               },
+               "extra_info": {
+                   'split': split,
+                   'index': idx
+               }
+           }
+           return data
+
+       return process_fn
--- a/docs/preparation/reward_function.rst
+++ b/docs/preparation/reward_function.rst
+Implement Reward Function for Dataset
+======================================
+
+Last updated: 06/02/2025.
+
+For each dataset, we need to implement a reward function or utilize a reward model to compute the rewards for the generated responses.
+We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
+You can also use customized reward functions.
+
+Currently, we support reward functions for GSM8k and MATH datasets. For RLHF datasets (e.g.,
+full_hh_rlhf) and Code Generation (e.g., APPS), we utilize reward model
+and SandBox (will opensource soon) for evaluation respectively.
+
+RewardManager
+-------------
+
+In the entrypoint of the PPO Post-Training script `main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py#L33>`_,
+we implement a ``RewardManager`` that utilize pre-implemented reward functions to compute the scores for each response.
+
+In the ``RewardManager``, we implemented a ``__call__`` function to
+compute the score for each response. 
+All the reward functions are executed by ``compute_score_fn``.
+The input is a ``DataProto``, which includes:
+
+- ``input_ids``, ``attention_mask``: ``input_ids`` and ``attention_mask`` after applying
+  chat_template, including prompt and response
+- ``responses``: response tokens
+- ``ground_truth``: The ground truth string of the current prompt.
+  Stored in ``non_tensor_batch`` in the ``DataProto``, which should be
+  preprocessed in the parquet files.
+- ``data_source``: The dataset name of the current prompt. Stored in
+  ``non_tensor_batch`` in the ``DataProto``, which should be
+  preprocessed in the parquet files.
+
+After detokenize the responses, the responses string and the ground
+truth string will be input to the ``compute_score_fn`` to compute the
+score for each response.
+
+Reward Functions
+----------------
+
+Pre-implemented
+~~~~~~~~~~~~~~~
+
+We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
+
+- In the `GSM8k example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py>`_, we
+  force the response to output the final answer after four ####, then
+  use string matching to compare with the ground truth. If completely
+  correct, score 1 point; if the format is correct, score 0.1 points; if
+  the format is incorrect, score 0 points.
+- In the `MATH example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/math.py>`_, we follow
+  the implementation in `lm-evaluation-harness repository <https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hendrycks_math/utils.py>`_.
+
+Customized
+~~~~~~~~~~
+
+You can implement customized reward functions in a separate file and specify them using ``custom_reward_function.path`` and ``custom_reward_function.name``. For the set of them, please refer to :ref:`config-explain-page`.
+
+The parameters of your reward function should be ``data_source``, ``solution_str``, ``ground_truth``, and ``extra_info``.
+For example:
+
+.. code:: python
+
+  def my_reward_fn(data_source, solution_str, ground_truth, extra_info=None):
+    return len(solution_str)/100
+
+If you are testing only a single customized reward function, you can simply name it 'compute_score' and leave ``custom_reward_function.name`` unset.
+
+To run multiple tests with different customized reward functions, you can modify both ``custom_reward_function.path`` and ``custom_reward_function.name`` for each trial. 
+For instance, you might create a single `my_reward.py` file and implement multiple reward functions within it. This way, for different trials, you only need to adjust ``custom_reward_function.name``, making it more convenient to conduct multiple tests within scripts.
--- a/docs/requirements-docs.txt
+++ b/docs/requirements-docs.txt
+# markdown support
+recommonmark
+myst_parser
+# markdown table support
+sphinx-markdown-tables
+
+# theme default rtd
+
+# crate-docs-theme
+sphinx-rtd-theme
+
+# pin tokenizers version to avoid env_logger version req
+tokenizers==0.21
--- a/docs/sglang_multiturn/interaction_system.rst
+++ b/docs/sglang_multiturn/interaction_system.rst
+Interaction System for Multi-turn RL Training
+=============================================
+
+Last updated: 06/25/2025.
+
+Overview
+--------
+
+The verl interaction system enables dynamic, multi-turn conversational feedback during reinforcement learning training. This system allows models to engage in iterative problem-solving scenarios where interaction agents can provide corrective feedback, guidance, or evaluation based on the model's responses.
+
+**New in Multi-Interaction Support**: The system now supports multiple named interactions within a single training session, enabling sophisticated training scenarios where different samples can use different interaction strategies. This allows for curriculum learning, domain-specific feedback, and flexible agent switching at the sample level.
+
+Key features:
+
+- **Async-based Architecture**: Non-blocking interaction processing for distributed training
+- **Instance Management**: Stateful session handling with unique instance IDs for concurrent interactions
+- **SGLang Integration**: Seamless integration with SGLang rollout system for multi-turn conversations
+- **Configuration-driven**: Dynamic agent loading via YAML configuration files
+- **Multi-Interaction Support**: Registry system enabling multiple named interactions per rollout
+- **Sample-Level Selection**: Each sample can specify which interaction to use via configuration
+- **Reward Integration**: Turn-level scoring mechanism integrated with verl's reward system
+
+Architecture
+------------
+
+The interaction system follows a plugin-based architecture with clear separation of concerns:
+
+.. code-block::
+
+    Interaction Registry System
+         ↓
+    BaseInteraction (Abstract Interface)
+         ↓
+    Multiple Named Interactions (e.g., Gsm8kInteraction, CustomInteraction)
+         ↓
+    SGLang Rollout Integration (interaction_map)
+         ↓
+    Sample-Level Interaction Selection
+         ↓
+    Async Request Lifecycle Management
+
+Core Components
+~~~~~~~~~~~~~~~
+
+**Interaction Registry System**
+
+The interaction registry system allows loading and managing multiple named interactions:
+
+.. code-block:: python
+
+    from verl.interactions.utils.interaction_registry import initialize_interactions_from_config
+    
+    # Load multiple interactions from config
+    interaction_map = initialize_interactions_from_config("config.yaml")
+    
+    # Access specific interaction by name
+    gsm8k_interaction = interaction_map["gsm8k"]
+    custom_interaction = interaction_map["custom_solver"]
+
+**BaseInteraction Interface**
+
+All interaction agents must implement the ``BaseInteraction`` abstract class:
+
+.. code-block:: python
+
+    from verl.interactions.base import BaseInteraction
+    from typing import Dict, Any, List, Tuple, Optional
+
+    class BaseInteraction:
+        def __init__(self, config: Dict[str, Any]):
+            self.config = config
+            self.name: str = config.get("name", "interaction_agent")
+        
+        async def start_interaction(self, instance_id: Optional[str] = None, **kwargs) -> str:
+            """Initialize interaction session, return instance_id"""
+            
+        async def generate_response(self, instance_id: str, messages: List[Dict[str, Any]], **kwargs) -> Tuple[bool, str, float, Dict[str, Any]]:
+            """Generate response, return (should_terminate, response, score, metadata)"""
+            
+        async def calculate_score(self, instance_id: str, **kwargs) -> float:
+            """Calculate turn-level score for RL training"""
+            
+        async def finalize_interaction(self, instance_id: str, **kwargs) -> None:
+            """Clean up resources"""
+
+**Request Lifecycle**
+
+The interaction system integrates with SGLang's async rollout via state management:
+
+1. ``PENDING`` → Initialize interaction via ``start_interaction()``
+2. ``GENERATING`` → Model generates response
+3. ``INTERACTING`` → Process response via ``generate_response()``
+4. ``GENERATING`` → Continue if not terminated, otherwise ``COMPLETED``
+
+Configuration
+-------------
+
+**Basic Setup**
+
+Enable interaction in your rollout configuration:
+
+.. code-block:: yaml
+
+    actor_rollout_ref:
+        rollout:
+            multi_turn:
+                enable: true
+                interaction_config_path: "path/to/interaction_config.yaml"
+                max_user_turns: 10
+                max_assistant_turns: 10
+
+**Interaction Configuration File**
+
+Create an interaction configuration file (e.g., ``interaction_config.yaml``):
+
+**Single Interaction (Legacy Format)**
+
+.. code-block:: yaml
+
+    interaction:
+      - name: "gsm8k"
+        class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
+        config: {}
+
+**Multiple Interactions (New Format)**
+
+.. code-block:: yaml
+
+    interaction:
+      - name: "gsm8k"
+        class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
+        config: {}
+      - name: "custom_solver"
+        class_name: "custom.interactions.CustomInteraction"
+        config: 
+          solver_type: "advanced"
+          timeout: 30
+      - name: "code_verifier"
+        class_name: "verl.interactions.base.BaseInteraction"
+        config: 
+          verification_mode: "strict"
+
+**Automatic Name Generation**
+
+If no ``name`` field is provided, the system will automatically generate one from the class name:
+
+.. code-block:: yaml
+
+    interaction:
+      - class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
+        config: {}
+        # Automatically generates name: "gsm8k"
+
+The system will dynamically load all specified interaction classes and make them available by name.
+
+Implementation Example: GSM8K
+-----------------------------
+
+The GSM8K interaction demonstrates a complete implementation for math problem-solving scenarios:
+
+.. code-block:: python
+
+    from verl.interactions.base import BaseInteraction
+    from verl.utils.reward_score import gsm8k
+    from uuid import uuid4
+
+    class Gsm8kInteraction(BaseInteraction):
+        def __init__(self, config: dict):
+            super().__init__(config)
+            self._instance_dict = {}
+
+        async def start_interaction(self, instance_id=None, ground_truth=None, **kwargs):
+            if instance_id is None:
+                instance_id = str(uuid4())
+            self._instance_dict[instance_id] = {
+                "response": "",
+                "ground_truth": ground_truth,
+                "reward": 0.0,
+            }
+            return instance_id
+
+        async def generate_response(self, instance_id, messages, **kwargs):
+            # Extract last user message content
+            content = ""
+            for item in reversed(messages):
+                if item.get("role") == "assistant":
+                    content = item.get("content", "")
+                    break
+
+            # Ensure GSM8K format (#### prefix)
+            self._instance_dict[instance_id]["response"] = content
+
+            reward = await self.calculate_score(instance_id)
+            if reward == 1.0:
+                return True, "Your response is correct!", 1.0, {}
+            else:
+                return False, "Your response is incorrect! You need to reflect on your answer and try again.", 0.0, {}
+
+        async def calculate_score(self, instance_id, **kwargs):
+            return gsm8k.compute_score(
+                self._instance_dict[instance_id]["response"],
+                self._instance_dict[instance_id]["ground_truth"],
+                method="strict", format_score=0.0, score=1.0,
+            )
+
+        async def finalize_interaction(self, instance_id, **kwargs):
+            del self._instance_dict[instance_id]
+
+Training Integration
+--------------------
+
+**Training Script Configuration**
+
+Include interaction configuration in your training command:
+
+.. code-block:: bash
+
+    python3 -m verl.trainer.main_ppo \\
+        --config-path="$CONFIG_PATH" \\
+        --config-name='gsm8k_multiturn_grpo_w_interaction' \\
+        algorithm.adv_estimator=grpo \\
+        data.train_batch_size=512 \\
+        data.return_raw_chat=True \\
+        actor_rollout_ref.rollout.name=sglang \\
+        actor_rollout_ref.rollout.multi_turn.interaction_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/interaction_config/gsm8k_interaction_config.yaml" \\
+        trainer.total_epochs=15
+
+**Data Requirements**
+
+Ensure your dataset includes interaction parameters with the ``name`` field for interaction selection:
+
+.. code-block:: python
+
+    # Dataset should include interaction_kwargs in non_tensor_batch
+    interaction_kwargs = [
+        {"name": "gsm8k", "query": "What is 2+2?", "ground_truth": "4"},
+        {"name": "custom_solver", "query": "Solve: x^2 + 5x + 6 = 0", "ground_truth": "x = -2, -3"},
+        {"name": "gsm8k", "query": "What is 3+3?", "ground_truth": "6"},
+    ]
+
+**Sample-Level Interaction Selection**
+
+Each sample can specify which interaction to use via the ``name`` field. This enables flexible training scenarios where different samples use different interaction strategies:
+
+.. code-block:: python
+
+    # Example: Math problems use GSM8K interaction, code problems use code verifier
+    data_samples = [
+        {
+            "prompt": "What is 15% of 200?",
+            "interaction_kwargs": {
+                "name": "gsm8k",
+                "query": "What is 15% of 200?", 
+                "ground_truth": "30"
+            }
+        },
+        {
+            "prompt": "Write a function to check if a number is prime",
+            "interaction_kwargs": {
+                "name": "code_verifier",
+                "code_type": "python",
+                "expected_behavior": "return True for prime numbers"
+            }
+        }
+    ]
+
+**Backward Compatibility**
+
+If no ``name`` field is provided in ``interaction_kwargs``, the system defaults to ``"gsm8k"`` for backward compatibility.
+
+Best Practices
+--------------
+
+**Resource Management**
+
+- Always implement proper cleanup in ``finalize_interaction()``
+- Use unique instance IDs to avoid conflicts in concurrent training
+- Handle edge cases like empty messages or malformed content
+
+**Performance Optimization**
+
+- Keep interaction logic lightweight to avoid blocking training
+- Use async/await properly to maintain non-blocking behavior
+- Consider caching expensive computations within interaction instances
+
+**Testing**
+
+Comprehensive testing is essential for interaction systems:
+
+.. code-block:: python
+
+    import pytest
+    from unittest.mock import patch
+
+    @pytest.mark.asyncio
+    async def test_interaction_workflow():
+        interaction = YourInteraction({})
+        
+        # Test complete workflow
+        instance_id = await interaction.start_interaction(ground_truth="expected_answer")
+        
+        messages = [{"role": "user", "content": "user_content"}, {"role": "assistant", "content": "assistant_response"}]
+        should_terminate, response, reward, metadata = await interaction.generate_response(instance_id, messages)
+        
+        assert should_terminate in [True, False]
+        assert isinstance(reward, float)
+        
+        await interaction.finalize_interaction(instance_id)
+
+Advanced Usage
+--------------
+
+**Multi-Interaction Training Strategies**
+
+You can design sophisticated training scenarios using multiple interactions:
+
+.. code-block:: python
+
+    # Example: Progressive difficulty with different interaction agents
+    class MathTrainingPipeline:
+        def create_interaction_config(self):
+            return {
+                "interaction": [
+                    {
+                        "name": "basic_math",
+                        "class_name": "verl.interactions.gsm8k_interaction.Gsm8kInteraction",
+                        "config": {"difficulty": "easy"}
+                    },
+                    {
+                        "name": "advanced_math", 
+                        "class_name": "custom.interactions.AdvancedMathInteraction",
+                        "config": {"difficulty": "hard", "allow_hints": True}
+                    },
+                    {
+                        "name": "competition_math",
+                        "class_name": "custom.interactions.CompetitionMathInteraction", 
+                        "config": {"time_limit": 300, "show_steps": False}
+                    }
+                ]
+            }
+    
+        def create_curriculum_data(self, epoch):
+            if epoch < 5:
+                return [{"name": "basic_math", ...} for _ in samples]
+            elif epoch < 10:
+                return [{"name": "advanced_math", ...} for _ in samples]
+            else:
+                return [{"name": "competition_math", ...} for _ in samples]
+
+**Custom Scoring Functions**
+
+You can integrate custom reward functions:
+
+.. code-block:: python
+
+    async def calculate_score(self, instance_id, **kwargs):
+        response = self._instance_dict[instance_id]["response"]
+        ground_truth = self._instance_dict[instance_id]["ground_truth"]
+        
+        # Custom evaluation logic
+        if custom_evaluation_function(response, ground_truth):
+            return 1.0
+        else:
+            return 0.0
+
+**Multi-step Interactions**
+
+For complex scenarios requiring multiple feedback rounds:
+
+.. code-block:: python
+
+    async def generate_response(self, instance_id, messages, **kwargs):
+        instance = self._instance_dict[instance_id]
+        instance["attempts"] += 1
+        
+        # Evaluate current response
+        reward = await self.calculate_score(instance_id)
+        
+        if reward > 0.8:
+            return True, "Excellent work!", reward, {}
+        elif instance["attempts"] < 3:
+            return False, "Good attempt, but try to improve...", reward, {}
+        else:
+            return True, "Maximum attempts reached.", reward, {}
+
+Troubleshooting
+---------------
+
+**Common Issues**
+
+1. **Instance ID Conflicts**: Ensure unique instance IDs across concurrent sessions
+2. **Memory Leaks**: Always call ``finalize_interaction()`` to clean up resources
+3. **Blocking Operations**: Keep interaction logic async and non-blocking
+4. **Configuration Errors**: Verify interaction config path and class name are correct
+5. **Interaction Name Conflicts**: Ensure all interactions have unique names in the configuration
+6. **Missing Interaction**: Verify the ``name`` field in ``interaction_kwargs`` matches available interactions
+7. **Backward Compatibility**: When migrating from single to multi-interaction, add ``name`` fields to existing data
+
+**Debugging**
+
+Enable debug logging to trace interaction flow:
+
+.. code-block:: bash
+
+    export VERL_LOGGING_LEVEL=DEBUG
+
+**Performance Monitoring**
+
+Monitor interaction performance impact on training throughput and adjust accordingly.
+
+Related Documentation
+--------------------
+
+- :doc:`multiturn`: Basic multi-turn rollout configuration
+- :doc:`sandbox_fusion`: Tool integration with SGLang
+- :doc:`search_tool_example`: Search tool implementation example
\ No newline at end of file
--- a/docs/sglang_multiturn/multiturn.rst
+++ b/docs/sglang_multiturn/multiturn.rst
+Multi-turn Rollout Support
+==========================
+
+Last updated: 06/27/2025.
+
+Basic Configuration
+~~~~~~~~~~~~~~~~~~~
+
+To enable multi-turn rollout, make sure to configure the following fields in your rollout configuration:
+
+.. code-block:: yaml
+
+    actor_rollout_ref: 
+        rollout: 
+            multi_turn: True
+            name: "sglang"
+
+These configuration activates the sglang engine for multi-turn interaction during rollout.
+
+Custom Tool Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For custom environment interaction tools, you can implement your own tools based on ``verl.tools.base_tool.BaseTool``. Then, specify your tool configurations in a YAML file:
+
+.. code-block:: yaml
+
+    tools:
+      - class_name: ""
+        config: 
+            type: native
+        tool_schema:
+
+You may refer to GSM8KTool_example_configuration_, which is one example of the tool configurations. Its implementation can be found in gsm8k_tool.py_.
+
+Finally, set the ``tools_config_file`` in your rollout config:
+
+.. code-block:: yaml
+
+    actor_rollout_ref:
+        rollout:
+            tool_kwargs:
+                tools_config_file: <path_to_tool_yaml_file>
+
+This allows integration of customized tool behaviors during actor rollout steps.
+
+If you want rollout with simulated interaction, you can set the ``interaction_config_file`` in your rollout config:
+
+.. code-block:: yaml
+
+    interaction:
+      - class_name: ""
+        config: {}
+
+.. code-block:: yaml
+
+    actor_rollout_ref:
+        rollout:
+            interaction_config_file: <path_to_interaction_yaml_file>
+
+If your tool creates multi-modal inputs, you should return a list of multi-modal inputs in your tool.execute() implementation.
+
+Image and video should be processed before returning. For example, if you are using Qwen2.5-VL, you can use the following code to get the representations:
+
+.. code-block:: python
+
+    async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
+        ...
+        from verl.utils.dataset.vision_utils import process_image, process_video
+
+        img1 = process_image(img1)
+        video1 = process_video(video1)
+
+        # due to the (image | video) key is ("image" | "video") instead of ("images" | "videos") in vllm, we need to use ("image" | "video") to specify list of images/videos
+        # link: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
+        return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
+
+remeber to set ``return_multi_modal_inputs: False`` in your dataset config in order to process the multi-modal inputs in the rollout correctly.
+Refer to the `Handling Multi-Modal Inputs in Datasets`_ section for more details.
+
+MCP Tool Configuration
+~~~~~~~~~~~~~~~~~~~~~~
+
+For MCP interaction tools, you can flexibly configure them using a YAML file. The typical setup is as follows:
+
+.. code-block:: yaml
+
+    tools:
+      - class_name: ""
+        config:
+            type: mcp
+        mcp:
+            mcp_servers_config_path: ./mcp_server.json
+            tool_selected_list: {}
+
+The ``tool_selected_list`` field is optional and specifies which tools to use from the servers. If you want to enable all available tools, simply omit this attribute. Besides, ``mcp_servers_config_path`` points to a JSON file containing the MCP server configurations. For example:
+
+.. code-block:: json
+
+      {
+          "mcpServers": {
+              "SSE Server": {
+                  "url": "your_server_url",
+                  "auth_token": "your_server_api_token"
+              },
+              "STDIO Server": {
+                  "command": "npx",
+                  "args": ["-y", "server-mcp@0.2.1"],
+                  "env": {
+                    "SERVER_API_KEY": "your_server_api_token"
+                  }
+              }
+          }
+      }
+
+Since the content formats returned by the MCP server may vary, users can inherit from ``MCPBaseTool`` and override the ``_parse_tool_result`` method to implement custom parsing logic.
+
+.. code-block:: python
+
+   class MCPYourTool(MCPBaseTool):
+       def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
+           super().__init__(config, tool_schema)
+
+       def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
+           ...
+
+Overall, you may refer to mcp_search_tool.py_ and mcp_tool_config.yaml_ for custom implementation and configuration.
+
+Multi-turn Tokenization
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Tokenizing multi-turn rollouts poses a challenge: after applying the chat template and tokenizing the full message list, it's hard to identify which tokens belong to assistant messages. Since the token list is flat, it lacks direct alignment with the message roles.
+
+To address this, we adopt a **delta-based tokenization** strategy. Each time the LLM generates a new message, we:
+
+1. Apply the chat template to all prior messages (`messages[:i]`).
+2. Apply the chat template again including the latest message (`messages[:i+1]`).
+3. Tokenize only the *delta* between these two serialized message strings.
+
+This ensures that only tokens generated by the assistant are included in the loss mask.
+
+.. code-block:: python
+
+   # When using tokenizer
+   # Exclude the assistant prompt (e.g., "<|im_start|>assistant") from the loss by setting add_generation_prompt=True
+   prev = tokenizer.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
+   curr = tokenizer.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
+   token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
+   loss_mask += [1] * len(token_ids)  # Mask only the new assistant tokens
+
+.. code-block:: python
+
+   # When using processor
+   # Exclude the assistant prompt (e.g., "<|im_start|>assistant") from the loss by setting add_generation_prompt=True
+   prev = processor.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
+   prev_model_inputs = processor(text=prev, images=images, videos=videos, return_tensors="pt")[0].tolist()
+   curr = processor.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
+   curr_model_inputs = processor(text=curr, images=images, videos=videos, return_tensors="pt")[0].tolist()
+   token_ids += curr_model_inputs["input_ids"][len(prev_model_inputs["input_ids"]):]
+   loss_mask += [1] * len(token_ids)  # Mask only the new assistant tokens
+
+While we've validated this produces consistent results with full message tokenization, future models' chat template could break compatibility. To guard against silent inconsistencies, we compare the delta-based tokenization with full-tokenization results by default at the end of each rollout.
+
+If you see the following warning, you can check the mismatched substring in the log:
+
+.. code-block::
+
+    Inconsistent training and inference tokenization detected. This may lead to unexpected behavior during training. Please review your chat template to determine if this is intentional. For more information, refer to the multiturn README.md.
+
+The tokenization sanity check mode can be configured using the ``actor_rollout_ref.rollout.multi_turn.tokenization_sanity_check_mode`` parameter, which accepts the following values:
+
+- ``strict`` (default): Performs strict comparison between delta-based and full tokenization results, raising warnings for any differences.
+
+- ``ignore_strippable``: Ignores differences in whitespace characters (``\n``, ``\t``, ``\r``, spaces) while still checking for meaningful text mismatches. This is useful when debugging chat template issues where whitespace variations are expected and acceptable.
+
+- ``disable``: Completely disables the tokenization sanity check. Only use this if you have thoroughly validated that tokenization discrepancies are expected and won't impact training.
+
+Example configuration:
+
+.. code-block:: yaml
+
+    actor_rollout_ref:
+        rollout:
+            multi_turn:
+                tokenization_sanity_check_mode: "ignore_strippable"  # Choose from: "disable", "ignore_strippable", "strict"
+
+Handling Multi-Modal Inputs in Datasets
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your dataset includes multi-modal inputs (such as images or videos), you can control whether these are pre-processed and included in each sample by setting the return_multi_modal_inputs flag in your dataset config (used by RLHFDataset).
+
+- ``return_multi_modal_inputs: True`` (default): The dataset will pre-process and include a multi_modal_inputs dictionary for each sample. This dict contains the model-ready representations (e.g., image tensors, video tensors, etc.) as produced by your processor. This is useful for single-turn or SFT-style training, where the model expects all modalities to be present in the batch.
+
+- ``return_multi_modal_inputs: False``: The dataset will not include the multi_modal_inputs field. This is recommended for multi-turn RL or tool-augmented rollouts, where the model may generate new multi-modal inputs dynamically during rollout, and you want to avoid conflicts or redundant data in the batch.
+
+
+Special Cases
+^^^^^^^^^^^^^
+
+Some models (e.g., Qwen/QwQ-32B and Qwen3 series) remove internal reasoning content during chat template rendering. As a result, the message content can vary across turns, making the delta-based tokenization inaccurate.
+
+For example, for the following conversation:
+
+.. code-block:: python
+
+    messages = [
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "What is 2 + 2?"},
+        {"role": "assistant", "content": "<think>user asked about a simple math question.</think> 2 + 2 = 4."},
+        {"role": "user", "content": "Explain why."},
+        {"role": "assistant", "content": "<think>user wants to know the reasoning behind the answer. Search for a good explanation</think>",
+         "tool_calls": [{"id": "tool1", "type": "search", "arguments": {"query": "Why is 2 + 2 = 4?"}}]},
+        {"role": "tool", "content": "The sum of two and two is four because it is a basic arithmetic operation."},
+        {"role": "assistant", "content": "<think>The tool provided a good explanation.</think>The sum of two and two is four because it is a basic arithmetic operation."}
+    ]
+
+1. Qwen/QwQ-32B will remove all reasoning content except the last assistant message after applying the chat template.
+
+.. code-block:: text
+
+    <|im_start|>system
+    You are a helpful assistant.<|im_end|>
+    <|im_start|>user
+    What is 2 + 2?<|im_end|>
+    <|im_start|>assistant
+     2 + 2 = 4.<|im_end|>
+    <|im_start|>user
+    Explain why.<|im_end|>
+    <|im_start|>assistant
+    <tool_call>
+    {"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
+    </tool_call><|im_end|>
+    <|im_start|>user
+    <tool_response>
+    The sum of two and two is four because it is a basic arithmetic operation.
+    </tool_response><|im_end|>
+    <|im_start|>assistant
+    <think>The tool provided a good explanation.</think> The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>
+
+2. Qwen3 series will remove all reasoning content before the last user message.
+
+.. code-block:: text
+
+    <|im_start|>system
+    You are a helpful assistant.<|im_end|>
+    <|im_start|>user
+    What is 2 + 2?<|im_end|>
+    <|im_start|>assistant
+     2 + 2 = 4.<|im_end|>
+    <|im_start|>user
+    Explain why.<|im_end|>
+    <|im_start|>assistant
+    <think>
+    user wants to know the reasoning behind the answer. Search for a good explanation
+    </think>
+
+    <tool_call>
+    {"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
+    </tool_call><|im_end|>
+    <|im_start|>user
+    <tool_response>
+    The sum of two and two is four because it is a basic arithmetic operation.
+    </tool_response><|im_end|>
+    <|im_start|>assistant
+    <think>
+    The tool provided a good explanation.
+    </think>
+
+    The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>
+
+To handle this, we fall back to a **fixed base conversation** containing only a single system and user message. Since this base doesn't include assistant messages or reasoning content, it remains consistent across turns.
+
+.. code-block:: python
+
+    BASE_CHAT_HISTORY = [
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "I am a user."}
+    ]
+    prev = tokenizer.apply_chat_template(BASE_CHAT_HISTORY, add_generation_prompt=True, tokenize=False)
+    curr = tokenizer.apply_chat_template([*BASE_CHAT_HISTORY, messages[i]], add_generation_prompt=False, tokenize=False)
+    token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
+    loss_mask += [1] * len(token_ids)
+
+This method works well for Qwen3 series. However, Qwen/QwQ-32B currently has a bug in its chat template. A fix_ has been proposed but not yet adopted. Until then, use the following command to download the fixed model revision:
+
+.. code-block:: bash
+
+    pip install huggingface_hub
+    huggingface-cli download Qwen/QwQ-32B --revision refs/pr/81
+
+.. _fix: https://huggingface.co/Qwen/QwQ-32B/discussions/81
+
+Discrepancy Between Training and Inference Templates
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although the above approach fixes the delta mismatch issue, the removal of reasoning content in the inference-time chat template introduces a new discrepancy: training uses the full reasoning content, while inference does not.
+
+This mismatch can affect model performance in unpredictable ways. To avoid it, we default to using the full response (including reasoning) for both training and rollout.
+
+However, this approach comes with trade-offs:
+
+1. Long reasoning contents can easily exceed the model's context window, especially in multi-turn rollout.
+2. There's a mismatch between rollout and production environment now—models will not have reasoning content from past turns if you use the default chat template in production.
+
+We are still evaluating the impact of these issues. If you experience context length problems or prefer rollouts that match production (i.e., exclude reasoning), you can enable:
+
+``actor_rollout_ref.rollout.multi_turn.use_inference_chat_template = True``
+
+GSM8K Multi-turn Training Performance  
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+See the training performance of multi-turn rollout on the GSM8K task HERE_.
+
+.. _HERE: https://wandb.ai/zhaochenyang20/gsm8k_async_rl/runs/1ro1r7om?nw=nwuserzhaochenyang20
+
+.. _GSM8KTool_example_configuration: https://github.com/volcengine/verl/blob/main/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml
+
+.. _gsm8k_tool.py: https://github.com/volcengine/verl/blob/main/verl/tools/gsm8k_tool.py
+
+.. _mcp_search_tool.py: https://github.com/volcengine/verl/blob/main/verl/tools/mcp_search_tool.py
+
+.. _mcp_tool_config.yaml: https://github.com/volcengine/verl/blob/main/examples/sglang_multiturn/config/tool_config/mcp_tool_config.yaml
+
+Interaction System
+~~~~~~~~~~~~~~~~~~
+
+For dynamic conversational feedback during RL training, see:
+
+.. toctree::
+   :maxdepth: 1
+
+   interaction_system
+
+Search Tool Integration
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. toctree::
+   :maxdepth: 1
+
+   search_tool_example
+
+Code Walkthrough
+~~~~~~~~~~~~~~~~~~~~~~~
+If you want to learn more in depth about the code execution flow, please read https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/rlhf/verl/multi-turn/code-walk-through
--- a/docs/sglang_multiturn/sandbox_fusion.rst
+++ b/docs/sglang_multiturn/sandbox_fusion.rst
+===============================
+Sandbox Fusion Tool Integration
+===============================
+
+Last updated: 06/10/2025.
+
+Motivations
+===========
+
+- As users of verl, we want to allow the model to call certain tools during Actor rollout, incorporating the results into the training process.
+- A colleague from ByteDance proposed a paper aimed at enhancing model capability through code execution tools.
+- We aim to support tool-calling capabilities of inference engines using `sandbox-fusion` as the code execution system, providing the community with a reimplementation of `retools`.
+
+Reward Compute with Sandbox Fusion + FaaS Integration
+=====================================================
+
+- In current datasets and tasks, similar work already exists (e.g., Prime), which uses local processes as runners to execute model-generated code for reward computation.
+- On this basis, #1429 has advanced the design by integrating FaaS as the runner for reward computation.
+
+Goals
+=====
+
+- Adapt to the `sglang` tool-calling protocol and define tools for sandbox fusion.
+- Integrate with the `async-rollout` process, ensuring sandbox fusion tools follow asyncIO conventions.
+- Design and implement a basic rate limiter to prevent issues such as 429 errors.
+
+Non-Goals
+=========
+
+- Training effectiveness is out of scope.
+- Observability metrics are not considered.
+- Distributed failover and component fault tolerance are not addressed.
+
+Design Details
+==============
+
+Tool Schema Definition
+----------------------
+
+- Currently, only code execution is considered, requiring a `code` field in the JSON from the model.
+- Only Python code is supported for now, so no `language` parameter is defined.
+
+.. code-block:: python
+
+   OpenAIFunctionToolSchema(
+       type="function",
+       function=OpenAIFunctionSchema(
+           name="code_interpreter",
+           description="A tool for executing code.",
+           parameters=OpenAIFunctionParametersSchema(
+               type="object",
+               properties={
+                   "code": OpenAIFunctionPropertySchema(
+                       type="string",
+                       description="The code to execute.",
+                       enum=None,
+                   )
+               },
+               required=["code"],
+           ),
+           strict=False,
+       )
+   )
+
+Configuration Parameters
+--------------------------
+
+----------------------------+--------------------------------------------------------------+
+| Parameter Name             | Description                                                  |
+============================+==============================================================+
+| `num_workers`              | Number of worker threads/processes per DP to request runner. |
+----------------------------+--------------------------------------------------------------+
+| `rate_limit`               | Global limit of concurrent code executions. Default: 10      |
+----------------------------+--------------------------------------------------------------+
+| `default_timeout`          | Timeout (in seconds) for each code execution. Default: 30    |
+----------------------------+--------------------------------------------------------------+
+| `default_language`         | Default programming language. Default: "python"              |
+----------------------------+--------------------------------------------------------------+
+| `enable_global_rate_limit` | Whether to enable global rate limiting. Default: True        |
+----------------------------+--------------------------------------------------------------+
+| `sandbox_fusion_url`       | URL for the veFaas sandbox execution service                 |
+----------------------------+--------------------------------------------------------------+
+
+Rate Limiting Design
+-----------------------
+
+Objective:
+
+- Limit the number of inflight requests using a token bucket model.
+
+- Ensure ordered submission to code runners to avoid starvation due to backoff.
+
+Design Highlights:
+
+- Use Ray Global Actor as a singleton distributed counter at cluster level.
+  
+- Semaphore used for counting, with `acquire` and `release` in separate thread pools to preserve order.
+  
+- Use Ray’s cloud-pickle to serialize functions for decoupled `ExecutionWorker`.
+
+.. code-block:: python
+
+   @ray.remote(concurrency_groups={"acquire": 1,"release": 10})
+   class TokenBucketWorker:
+       def __init__(self, rate_limit: int):
+           self.rate_limit = rate_limit
+           self.current_count = 0
+           self._semaphore = threading.Semaphore(rate_limit)
+
+       @ray.method(concurrency_group="acquire")
+       def acquire(self):
+           self._semaphore.acquire()
+           self.current_count += 1
+
+       @ray.method(concurrency_group="release")
+       def release(self):
+           self._semaphore.release()
+           self.current_count -= 1
+
+       def get_current_count(self):
+           return self.current_count
+
+   class ExecutionWorker:
+       def __init__(self, enable_global_rate_limit=True, rate_limit=10):
+           self.rate_limit_worker = self._init_rate_limit(rate_limit) if enable_global_rate_limit else None
+
+       def _init_rate_limit(self, rate_limit):
+           return TokenBucketWorker.options(name="rate-limiter", get_if_exists=True).remote(rate_limit)
+
+       def execute(self, fn: Callable[..., T], *fn_args, **fn_kwargs) -> T:
+           with ExitStack() as stack:
+               stack.callback(self.rate_limit_worker.release.remote)
+               ray.get(self.rate_limit_worker.acquire.remote())
+               try:
+                   return fn(*fn_args, **fn_kwargs)
+               except Exception as e:
+                   logger.warning(f"Error when executing code: {e}")
+
+   def init_execution_pool(num_workers: int, enable_global_rate_limit=True, rate_limit=10, mode: PoolMode=PoolMode.ThreadMode):
+       if mode == PoolMode.ThreadMode:
+           return ray.remote(ExecutionWorker).options(max_concurrency=num_workers).remote(
+               enable_global_rate_limit=enable_global_rate_limit,
+               rate_limit=rate_limit
+           )
+       else:
+           raise NotImplementedError("Process mode is not implemented yet")
+
+Tool Implementation
+-------------------
+
+- Use `instance_id` to identify requests across multiple dialogue rounds.
+  
+- Use `execution_pool` to implement async invocation.
+  
+- Cleanup state after rollout completion.
+
+.. code-block:: python
+
+   class SandboxFusionTool(BaseTool):
+       def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
+           ...
+           self.execution_pool = init_execution_pool(...)
+           ...
+
+       async def create(self, instance_id: Optional[str] = None, ...):
+           ...
+
+        async def execute(self, instance_id: str, parameters: dict[str, Any], **kwargs) -> Tuple[str, float, dict]:
+            code = parameters.get("code", "")
+            timeout = parameters.get("timeout", self.default_timeout)
+            language = parameters.get("language", self.default_language)
+            if not isinstance(code, str):
+                code = str(code)
+
+            result = await self.execution_pool.execute.remote(self.execute_code,instance_id,code,timeout,language)
+            self._instance_dict[instance_id]["reward"].append(result.strip())
+
+            return result, result, {}
+
+        def execute_code(self,instance_id,code,timeout=30,language="python"):
+            result_status, metadata  = _process_single_case(0, None, None,self.sandbox_fusion_url, code, timeout, language)
+            # we should always expect this since we don't have correct answer
+            if metadata["run_status"] == "Finished":
+                actual_output = metadata["stdout"] if metadata["stdout"] is not None else ""
+                return actual_output
+            else:
+                return "no stdout here"
+
+       async def calc_reward(self, instance_id: str, ...):
+           ...
+
+       async def release(self, instance_id: str, ...):
+           ...
+
+Test Plan
+=========
+
+Unit Tests
+----------
+
+- **test_tools_registration**: Test tool registration and initialization.
+- **test_rollout_req_creation**: Validate that `AsyncRolloutReq` is built correctly.
+- **test_over_size_case**: Ensure rollout terminates early when exceeding `max_seq_len`.
+- **test_tool_call_basic_case**: Mock `sglang` output, validate tool call and result.
+- **test_tool_call_batch_case**: Test batch processing of tool calls.
+- **test_basic_multi_process_init**: Validate Ray global actor behaves as singleton.
+- **TestSingleNodeRateLimiterCase**: Verify rate limiter works in single-node mode.
+- **test_rotten_execution**: Ensure rate limiter recovers from function errors.
+- **TestMultiNodeRateLimiterCase**: Verify behavior in multi-node environments.
+
+e2e Tests
+----------
+we provide e2e test scripts in `tests/special_e2e` folder, named `tests/special_e2e/run_gsm8k_fsdp_sgl_multiturn_sf_tool.sh`
+
+by setting 'trainer.rollout_data_dir' you can dump the rollout data to local disk. here is an sample taken from the rollout data:
+
+.. code-block:: python
+
+   {
+     "input": "
+     
+     system\nYou are a math expert. You are given a question and you need to solve it step by step. Reasoning step by step before any tool call. You should use the `calc_gsm8k_reward` tool after step by step solving the question, before generate final answer at least once and refine your answer if necessary. Put your final answer in the format of `#### <answer>`.\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"code_interpreter\", \"description\": \"A tool for executing code.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"code\": {\"type\": \"string\", \"description\": \"The code to execute.\", \"enum\": null}}, \"required\": [\"code\"]}, \"strict\": false}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n
+     
+     user\nJohn gets a bonus that's the same percentage every year.  Last year he made $100,000 and got a $10,000 bonus.  This year he makes $200,000.  How much will John make this year when adding both his total pay and bonus together? Let's think step by step and output the final answer after `####`.\n
+     
+     assistant\n",
+     "output": "assistant\n<think>\nOkay, let's see. John gets a bonus that's the same percentage every year. Last year, he made $100,000 and got a $10,000 bonus. This year, he makes $200,000. The question is asking how much he'll make this year when adding both his total pay and bonus together.\n\nFirst, I need to figure out the percentage bonus he gets. Since the bonus is the same percentage each year, if I can find that percentage based on last year's details, I can apply it to this year's earnings to find this year's bonus. Then, I'll add that bonus to his total pay this year.\n\nLast year, he earned $100,000 and got a $10,000 bonus. To find the percentage, I can divide the bonus by his total earnings. So, 10,000 divided by 100,000. Let me calculate that.\n\n10,000 / 100,000 equals 0.1, which is 10% as a percentage. So, his bonus is 10% of his pay each year.\n\nThis year, he makes $200,000. So, the bonus this year would be 10% of 200,000. Let me compute that using the code interpreter to ensure accuracy.\n
+     
+     <tool_call>\n{\"name\": \"code_interpreter\", \"arguments\": {\"code\": \"total_pay_this_year = 200000\\nbonus_percentage = 10 / 100\\nbonus_this_year = total_pay_this_year * bonus_percentage\\ntotal_income_this_year = total_pay_this_year + bonus_this_year\\nprint(total_income_this_year)\", \"executes\": \"True\"}}\n</tool_call>\n
+     
+     tool\n220000.0\n\n
+     
+     assistant\n<think>\nSo the code shows that John's total income this year would be $220,000. This is consistent with the 10% bonus on last year's pay of $100,000, which was $10,000. Therefore, this year's bonus of $20,000 added to his total pay of $200,000 results in a total income of $220,000.\n\nI need to verify that there are no other factors at play. The problem states it's a constant percentage bonus each year, so there's no mention of changing percentages or other variables. Therefore, applying a straightforward percentage of 10% on this year's partial salary amount would result in $20,000. Adding that to his total pay gives $220,000. The calculation is correct and aligns with the question's parameters.\n</think>\n\n#### 220000.0",
+     "score": 0,
+     "step": 1
+   }
+
+here is the readable format version:
+
+.. code-block:: python
+
+   [system]
+   
+   You are a math expert. You are given a question and you need to solve it step by step. Reasoning step by step before any tool call. You should use the `calc_gsm8k_reward` tool after step by step solving the question, before generate final answer at least once and refine your answer if necessary. Put your final answer in the format of `#### <answer>`.
+   
+   # Tools
+   
+   You may call one or more functions to assist with the user query.
+   
+   You are provided with function signatures within <tools></tools> XML tags:
+   <tools>
+   {"type": "function", "function": {"name": "code_interpreter", "description": "A tool for executing code.", "parameters": {"type": "object", "properties": {"code": {"type": "string", "description": "The code to execute.", "enum": null}}, "required": ["code"]}, "strict": false}}
+   </tools>
+   
+   For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
+   <tool_call>
+   {"name": <function-name>, "arguments": <args-json-object>}
+   </tool_call>
+   
+   [user]
+   
+   John gets a bonus that's the same percentage every year.  Last year he made $100,000 and got a $10,000 bonus.  This year he makes $200,000.  How much will John make this year when adding both his total pay and bonus together? Let's think step by step and output the final answer after `####`.
+   
+   [assistant]
+   
+   <think>
+   Okay, let's see. John gets a bonus that's the same percentage every year. Last year, he made $100,000 and got a $10,000 bonus. This year, he makes $200,000. The question is asking how much he'll make this year when adding both his total pay and bonus together.
+   
+   First, I need to figure out the percentage bonus he gets. Since the bonus is the same percentage each year, if I can find that percentage based on last year's details, I can apply it to this year's earnings to find this year's bonus. Then, I'll add that bonus to his total pay this year.
+   
+   Last year, he earned $100,000 and got a $10,000 bonus. To find the percentage, I can divide the bonus by his total earnings. So, 10,000 divided by 100,000. Let me calculate that.
+   
+   10,000 / 100,000 equals 0.1, which is 10% as a percentage. So, his bonus is 10% of his pay each year.
+   
+   This year, he makes $200,000. So, the bonus this year would be 10% of 200,000. Let me compute that using the code interpreter to ensure accuracy.
+   <tool_call>
+   {"name": "code_interpreter", "arguments": {"code": "total_pay_this_year = 200000\nbonus_percentage = 10 / 100\nbonus_this_year = total_pay_this_year * bonus_percentage\ntotal_income_this_year = total_pay_this_year + bonus_this_year\nprint(total_income_this_year)", "executes": "True"}}
+   </tool_call>
+   
+   [tool]
+   
+   220000.0
+   
+   [assistant]
+   
+   <think>
+   So the code shows that John's total income this year would be $220,000. This is consistent with the 10% bonus on last year's pay of $100,000, which was $10,000. Therefore, this year's bonus of $20,000 added to his total pay of $200,000 results in a total income of $220,000.
+   
+   I need to verify that there are no other factors at play. The problem states it's a constant percentage bonus each year, so there's no mention of changing percentages or other variables. Therefore, applying a straightforward percentage of 10% on this year's partial salary amount would result in $20,000. Adding that to his total pay gives $220,000. The calculation is correct and aligns with the question's parameters.
+   </think>
+   
+   #### 220000.0
+
+
+You can also use the `RolloutViewer` TUI tool to view the dumped rollout data:
+
+
+.. code-block:: bash
+
+    python scripts/rollout_viewer.py ${trainer.rollout_data_dir}
+
+
+.. image:: https://github.com/user-attachments/assets/e34e5157-2880-4a21-afb2-73885d0dfb11
+   :alt: RolloutViewer screenshot
\ No newline at end of file
--- a/docs/sglang_multiturn/search_tool_example.rst
+++ b/docs/sglang_multiturn/search_tool_example.rst
+=======================
+Search Tool Integration
+=======================
+
+Last updated: 05/30/2025.
+
+Introduction
+------------
+- We have added a search tool calling function to Multi-Turn RL, enabling the model to initiate retrieval requests during Actor rollout and directly use retrieval results for training. **We support using a local dense retriever as the retrieval tool, as well as integrating with your own local retrieval engine.**
+
+
+
+Quick Reproduction
+------------------
+
+Create a New Docker Container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: bash
+
+   docker run \
+       -it \
+       --shm-size 32g \
+       --gpus all \
+       -v {Huggingface-Cache-Path}:/root/.cache \
+       --ipc=host \
+       --network=host \
+       --privileged \
+       --name sglang_{your-name} \
+       lmsysorg/sglang:dev \
+       /bin/zsh
+
+If you need to restart after exiting the container:
+
+.. code:: bash
+
+   docker start -i sglang_{your-name}
+
+Update Python and Configure the Virtual Environment using uv
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: bash
+
+   apt update
+   apt install -y python3.10 python3.10-venv
+
+   # Create a virtual environment
+   python3 -m venv ~/.python/verl-multiturn-rollout
+
+   # Activate the virtual environment
+   source ~/.python/verl-multiturn-rollout/bin/activate
+
+   # Install uv
+   python3 -m pip install uv
+
+Install verl Upstream
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: bash
+
+   cd ~
+   git clone https://github.com/volcengine/verl.git
+   cd verl
+
+   # Install verl
+   python3 -m uv pip install .
+   python3 -m uv pip install -r ./requirements_sglang.txt
+
+   # Manually install flash-attn
+   python3 -m uv pip install wheel
+   python3 -m uv pip install packaging
+   python3 -m uv pip install flash-attn --no-build-isolation --no-deps
+
+Set Up a Local Retrieval Engine
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you are using your own local retrieval service, you can skip this
+step. We chose the local dense retriever provided in the search-R1
+example; detailed instructions are in the `searchR1
+docs <https://raw.githubusercontent.com/PeterGriffinJin/Search-R1/refs/heads/main/docs/retriever.md>`__.
+In brief:
+
+-  The GPU version offers higher accuracy and speed; each GPU uses about
+   5–7 GB of memory.
+-  The CPU version can be used for simple testing but has lower
+   retrieval precision, which will degrade training performance. See the
+   `retriever
+   documentation <https://github.com/PeterGriffinJin/Search-R1/blob/main/docs/retriever.md>`__
+   in search-R1 for details.
+-  Recommend using Conda to install faiss-gpu=1.8.0; venv may cause errors.
+
+**Note**: To start both the training process and the local retrieval
+service, we launch two separate Python environments. The training uses
+uv in the verl-multiturn-rollout environment, while the retriever uses
+conda to install ``faiss-gpu``.
+
+.. code:: bash
+
+   # Download the Miniconda installer script
+   wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
+
+   # Install to $HOME/miniconda3 in batch mode
+   bash ~/miniconda.sh -b -p $HOME/miniconda3
+
+   # Activate conda (only in the current shell)
+   eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
+
+   # (Optional) Add conda to your default shell startup
+   conda init
+
+   # Reload shell config
+   source ~/.bashrc
+
+   # Create and activate the retriever environment with Python 3.10
+   conda create -n retriever python=3.10 -y
+   conda activate retriever
+
+   # Install PyTorch (with GPU support) and related libraries
+   conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
+
+   # Install other Python packages
+   pip install transformers datasets pyserini huggingface_hub
+
+   # Install the GPU version of faiss
+   conda install faiss-gpu=1.8.0 -c pytorch -c nvidia -y
+
+   # Install the API service framework
+   pip install uvicorn fastapi
+
+Download the Indexing and Corpus
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The local retrieval files are large—prepare sufficient disk space.
+Downloading is about 60–70 GB, and uncompressed takes about 132 GB:
+
+.. code:: bash
+
+   conda activate retriever
+
+   save_path=/the/path/to/save
+   python examples/sglang_multiturn/search_r1_like/local_dense_retriever/download.py --save_path $save_path
+   cat $save_path/part_* > $save_path/e5_Flat.index
+   gzip -d $save_path/wiki-18.jsonl.gz
+
+Start the Local flat e5 Retrieval Server
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. The first startup will download models and load the index.
+2. Apart from the download, startup takes about 1–2 minutes.
+3. After startup, each GPU uses about 5–7 GB of memory, leaving the rest
+   for multi-turn RL training.
+
+.. code:: bash
+
+   conda activate retriever
+
+   index_file=$save_path/e5_Flat.index
+   corpus_file=$save_path/wiki-18.jsonl
+   retriever_name=e5
+   retriever_path=intfloat/e5-base-v2
+
+   python examples/sglang_multiturn/search_r1_like/local_dense_retriever/retrieval_server.py \
+     --index_path $index_file \
+     --corpus_path $corpus_file \
+     --topk 3 \
+     --retriever_name $retriever_name \
+     --retriever_model $retriever_path \
+     --faiss_gpu
+
+Set Up WANDB_API_KEY
+~~~~~~~~~~~~~~~~~~~~
+
+.. code:: bash
+
+   export WANDB_API_KEY={YOUR_WANDB_API_KEY}
+
+   # Define a timestamp function
+   function now() {
+       date '+%Y-%m-%d-%H-%M'
+   }
+
+**Preprocess the Dataset**
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   **Note:** The following data processing and training commands must be
+   run in the verl-multiturn-rollout environment.
+
+.. code:: bash
+
+   python3 examples/data_preprocess/preprocess_search_r1_dataset.py
+
+Testing on 8 x H20
+~~~~~~~~~~~~~~~~~~
+
+.. code:: bash
+
+   # Ensure the now() function is defined
+   # Create a logs directory
+   mkdir -p logs
+
+   # Set GPUs and run with a suitable log path
+   export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+   nohup bash examples/sglang_multiturn/search_r1_like/run_qwen2.5-3b_instruct_search_multiturn.sh \
+     trainer.experiment_name=qwen2.5-3b-it_rm-searchR1-like-sgl-multiturn-$(now) \
+     > logs/searchR1-like$(now).log 2>&1 &
+
+Custom Search Configuration
+---------------------------
+
+To enable multi-turn reasoning, set the following fields in your config:
+
+.. code:: yaml
+
+   actor_rollout_ref:
+     rollout:
+       name: "sglang"
+       multi_turn:
+         enable: True
+
+You must specify ``retrieval_service_url`` in ``examples/sglang_multiturn/config/tool_config/search_tool_config.yaml``, and properly configure concurrency. For more details on concurrency, refer to the Sandbox Fusion example:
+
+.. code:: yaml
+
+   tools:
+     - class_name: verl.tools.search_tool.SearchTool
+       config:
+         retrieval_service_url: http://127.0.0.1:8000/retrieve
+         num_workers: 120
+         rate_limit: 120
+         timeout: 30
+
+The retriever input/output formats are as follows. If your service
+parameters match, only modify ``retrieval_service_url``. You can also
+customize in ``search_r1_like_utils.py``.
+
+.. code:: python
+
+   Input format:
+   {
+     "queries": ["What is Python?", "Tell me about neural networks."],
+     "topk": 3,
+     "return_scores": true
+   }
+
+   Output format (when return_scores=True, similarity scores are returned):
+   {
+       "result": [
+           [   # Results for each query
+               {
+                   "document": doc, "score": score
+               },
+               # ... more documents
+           ],
+           # ... results for other queries
+       ]
+   }
+
+Notes
+-----
+
+1. The total training time is about 27 hours; meanwhile, the validation
+   dataset is very large (51 k), and each validation takes about 6000 s.
+   (Therefore, ``val_before_train=False`` by default)
--- a/docs/single_controller.rst
+++ b/docs/single_controller.rst
+The Design of ``verl.single_controller``
+==============================================
+
+Last updated: 05/21/2025.
+
+**Author:**\  `Wang Zhang <https://github.com/zw0610>`__
+
+Preface
+-------
+
+We prepared this document for developers of ``verl``, particularly those
+interested in understanding or contributing to the
+``verl.single_controller`` module. It is not intended for end users, but
+for contributors seeking to understand the architectural rationale and
+internal mechanics.
+
+--------------
+
+Origin
+------
+
+The ``single_controller`` module originated from a request I received —
+to adapt a toy single-process RLHF script into a distributed system with
+minimal changes, while maintaining ease of debugging.
+
+Common practice — such as using PyTorch’s Distributed Data Parallel
+(DDP) — typically involves wrapping ``nn.Module`` and launching multiple
+processes that execute the same function under different ranks. However,
+this approach presents two main limitations in the context of
+distributed RLHF: - Difficulty representing multiple DAGs as required by
+PPO; - Difficulty inspecting intermediate tensors during training.
+
+To maintain debuggability, we opted for a different approach — breaking
+the training loop into well-defined stages like ``generate_sequences``,
+``compute_advantages``, and so on.
+
+We selected `Ray <https://www.ray.io/>`__ as the initial backend for
+``verl`` due to its ability to expose Python class methods as RPC
+endpoints. However, Ray’s default model only supports **one method call,
+one RPC**, while training LLMs typically requires coordination across
+multiple processes.
+
+To hide this multi-Ray actors invocation for a single method from users,
+we introduced the following components:
+
+-  ``WorkerGroup`` – manages a group of remote workers and provides
+   a unified interface for multi-process distributed computation;
+-  ``ResourcePool`` – binds computational resources to worker
+   processes;
+-  ``ClassWithArgs`` – enables delayed remote instantiation with
+   specified initialization arguments.
+
+--------------
+
+A Running Example: ``generate_sequences``
+-----------------------------------------
+
+To illustrate the design, we walk through how the ``generate_sequences``
+method in the ``ActorRolloutRefWorker`` class is registered and invoked
+across distributed workers.
+
+--------------
+
+Step 1: Register with a Decorator
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The first step is to define the ``generate_sequences`` and decorate it
+with ``@register`` as it will be called in driver script.
+
+**Source:**
+`fsdp_workers.py <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/workers/fsdp_workers.py#L528>`__
+
+.. code:: python
+
+   class ActorRolloutRefWorker(Worker):
+       ...
+       @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+       def generate_sequences(self, prompts: DataProto):
+           prompts = prompts.to(torch.cuda.current_device())
+           ...
+
+The ``@register`` decorator adds metadata to the ``generate_sequences``
+method. Currently, it doesn’t alter functionality, but attaches
+attributes via a magic key (``MAGIC_ATTR``):
+
+**Source:**
+`decorator.py <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L411>`__
+
+.. code:: python
+
+   def register(dispatch_mode=Dispatch.ALL_TO_ALL, execute_mode=Execute.ALL, blocking=True, materialize_futures=True):
+       ...
+       def decorator(func):
+           @wraps(func)
+           def inner(*args, **kwargs):
+               if materialize_futures:
+                   args, kwargs = _materialize_futures(*args, **kwargs)
+               return func(*args, **kwargs)
+
+           attrs = {"dispatch_mode": dispatch_mode, "execute_mode": execute_mode, "blocking": blocking}
+           setattr(inner, MAGIC_ATTR, attrs)
+           return inner
+
+       return decorator
+
+As the code shows, values of ``dispatch_mode``, ``execute_mode`` and
+``blocking`` is attached the ``generate_sequences`` method.
+
+--------------
+
+Step 2: Binding During Initialization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These attached attributes are extracted and utilized when
+``ActorRolloutRefWorker``, wrapped in a ``RayClassWithArgs``, is passed
+into a ``RayWorkerGroup``.
+
+**Source:**
+`main_generation.py <https://github.com/volcengine/verl/blob/4ae9a0fdab229f75f080e9478807783ed4c97154/verl/trainer/main_generation.py#L82>`__
+
+.. code:: python
+
+   ray_cls_with_init = RayClassWithInitArgs(cls=ray.remote(ActorRolloutRefWorker), config=config, role="rollout")
+   resource_pool = RayResourcePool(process_on_nodes=[config.trainer.n_gpus_per_node] * config.trainer.nnodes)
+   wg = RayWorkerGroup(resource_pool=resource_pool, ray_cls_with_init=ray_cls_with_init)
+
+During the
+`initialization <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/ray/base.py#L184>`__
+of ``RayWorkerGroup``, two key steps occur:
+
+1. Worker instances (Ray actors) are created:
+   `RayWorkerGroup._init_with_resource_pool <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/ray/base.py#L211>`__
+2. Methods decorated with ``@register`` are bound to ``RayWorkerGroup``:
+   `RayWorkerGroup._bind_worker_method <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/ray/base.py#L214>`__
+
+.. figure:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/worker_group_init.png?raw=true
+   :alt: initialization_and_binding_of_worker_group
+
+   initialization_and_binding_of_worker_group
+
+The binding procedure is the heart of ``verl.single_controller``.
+
+**Key function:**
+`WorkerGroup._bind_worker_method <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/worker_group.py#L143>`__
+
+.. code:: python
+
+   def _bind_worker_method(self, user_defined_cls, func_generator):
+       ...
+       for method_name in dir(user_defined_cls):
+           try:
+               method = getattr(user_defined_cls, method_name)
+               assert callable(method)
+           except Exception:
+               continue  # Skip properties
+           <<<to be continue 1>>>
+
+When a method has the ``MAGIC_ATTR``, the attributes set by
+``@register`` are extracted:
+
+.. code:: python
+
+           <<<continue 1>>>
+           if hasattr(method, MAGIC_ATTR):
+               attribute = getattr(method, MAGIC_ATTR)
+               dispatch_mode = attribute["dispatch_mode"]
+               execute_mode = attribute["execute_mode"]
+               blocking = attribute["blocking"]
+
+               <<<to be continue 2>>>
+
+As show in the flow chart above, these attributes are fed into
+``func_generator``. However, ``func_generator`` takes ``method_name``,
+``dispatch_fn``, ``collect_fn``, ``execute_fn``, ``blocking``. We need
+to find the corresponding ``dispatch_fn`` and ``collect_fn`` associated
+with the ``dispatch_mode`` (``DP_COMPUTE_PROTO``) from
+`DISPATCH_MODE_FN_REGISTRY <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L387>`__:
+
+.. code:: python3
+
+   DISPATCH_MODE_FN_REGISTRY = {
+       Dispatch.ONE_TO_ALL: {
+           "dispatch_fn": dispatch_one_to_all,
+           "collect_fn": collect_all_to_all,
+       },
+       ...
+       Dispatch.DP_COMPUTE_PROTO: {
+           "dispatch_fn": dispatch_dp_compute_data_proto,
+           "collect_fn": collect_dp_compute_data_proto,
+       },
+       ...
+   }
+
+Similarly, the ``execute_fn`` is selected by ``execute_mode`` and
+extracted by:
+
+.. code:: python
+
+               <<<continue 2>>>
+               # get execute_fn_name
+               execute_mode = get_predefined_execute_fn(execute_mode=execute_mode)
+               wg_execute_fn_name = execute_mode["execute_fn_name"]
+
+               # get execute_fn from string
+               try:
+                   execute_fn = getattr(self, wg_execute_fn_name)
+                   assert callable(execute_fn), "execute_fn must be callable"
+               except Exception:
+                   print(f"execute_fn {wg_execute_fn_name} is invalid")
+                   raise
+               <<<to be continue 3>>>
+
+In this ``generate_sequences`` cases: -
+``dispatch_mode = Dispatch.DP_COMPUTE_PROTO`` -
+``dispatch_fn = dispatch_dp_compute_data_proto`` -
+``collect_fn = collect_dp_compute_data_proto`` -
+``execute_fn = RayWorkerGroup.execute_all``
+
+ONE_TO_ALL v.s. DP_COMPUTE_PROTO
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``dispatch_mode`` is associated with a ``dispatch_fn`` and a
+``collect_fn``. As the name implies, ``dispatch_fn`` processes the input
+arguments in ``WorkerGroup`` and generate a batch (list) of input
+arguments, each of which will be fed into a worker attached to the
+``WorkerGroup``.
+
+``dispatch_fn`` of ``ONE_TO_ALL`` is
+`dispatch_one_to_all <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L119>`__,
+which just duplicates all the input arguments into N replicas, where N
+equals the number of Workers attached to the ``worker_group``:
+
+.. code:: python
+
+   def dispatch_one_to_all(worker_group, *args, **kwargs):
+       args = tuple([arg] * worker_group.world_size for arg in args)
+       kwargs = {k: [v] * worker_group.world_size for k, v in kwargs.items()}
+       return args, kwargs
+
+``dispatch_fn`` of ``DP_COMPUTE_PROTO`` is
+`dispatch_dp_compute_data_proto <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L350>`__,
+which uses ``DataProto.chunk`` to split a large ``DataProto`` into N
+smaller ``DataProto``, where N equals the world_size (number of the
+workers) of the ``worker_group``:
+
+.. code:: python
+
+   def dispatch_dp_compute_data_proto(worker_group, *args, **kwargs):
+       from verl.single_controller.base.worker_group import WorkerGroup
+
+       assert isinstance(worker_group, WorkerGroup)
+       # Note: enable auto padding for dp compute DatapProto
+       splitted_args, splitted_kwargs = _split_args_kwargs_data_proto_with_auto_padding(
+           worker_group.world_size,
+           *args,
+           **kwargs,
+       )
+       return splitted_args, splitted_kwargs
+
+The ``collect_fn`` follows the same pattern and process a batch (list)
+of returned value from all workers of a ``WorkerGroup`` and merge it
+into a list as ``collect_all_to_all`` does or a large ``DataProto`` as
+``collect_dp_compute_data_proto`` does.
+
+Finally, a new method is dynamically generated using ``func_generator``
+and added to the ``WorkerGroup`` instance:
+
+.. code:: python
+
+               <<<continue 3>>>
+               # bind a new method to the RayWorkerGroup
+               func = func_generator(
+                   self,
+                   method_name,
+                   dispatch_fn=dispatch_fn,
+                   collect_fn=collect_fn,
+                   execute_fn=execute_fn,
+                   blocking=blocking,
+               )
+
+               try:
+                   setattr(self, method_name, func)
+                   method_names.append(method_name)
+               except Exception as e:
+                   raise ValueError(f"Fail to set method_name {method_name}") from e
+
+This makes the method invocable via the ``WorkerGroup`` interface.
+
+--------------
+
+Step 3: Call Chain
+~~~~~~~~~~~~~~~~~~
+
+All the machinery above ensures that distributed calls feel identical to
+single-process ones. In the original single-process script, the code
+looks like:
+
+.. code:: python
+
+   rollout = Rollout()
+   rollout.generate_sequences(batch)
+
+With ``verl``, the multiprocess program becomes:
+
+.. code:: python
+
+   rollout = RayWorkerGroup(resource_pool=[4], RayClassWithArgs(Rollout))
+   rollout.generate_sequences(batch)
+
+.. figure:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/call_generate_sequences.png?raw=true
+   :alt: call_chain_of_generate_sequences
+
+   call_chain_of_generate_sequences
+
+Behind this simple call: - ``dispatch_fn`` splits input across workers -
+``execute_fn`` performs the actual remote invocation - ``collect_fn``
+gathers the results
+
+All of this is abstracted away, enabling developers to write distributed
+code with minimal changes to their existing logic.
+
+--------------
+
+Beyond RL Post-Training: Generalizing ``verl.single_controller``
+----------------------------------------------------------------
+
+The ``verl.single_controller`` module generalizes well beyond
+reinforcement learning. It provides a clean abstraction to batch-process
+remote method calls, with automatic input/output handling.
+
+By minimizing the gap between single-process and multi-process scripts,
+``verl.single_controller`` opens the door to distributed computing in
+broader domains — not limited to RL post-training.
+
+We hope this design inspires more examples and extensions from the
+community.
--- a/docs/start/agentic_rl.rst
+++ b/docs/start/agentic_rl.rst
+Agentic RL Training
+===================
+
+Last updated: 07/15/2025.
+
+Overview
+----------
+The goal of Agentic RL is to improve the performance of backend models from reinforcement learning to the Agent. During the training process, a series of features are developed:
+
+1. Server-based asynchronous rollout
+2. Multi-turn conversations and tool calls
+3. LangGraph-based Agent
+
+
+This document explains the system principles and usage involved to help users implement Agentic RL.
+
+
+Server-based Asynchronous Rollout
+---------------------------------
+
+Since Agents need to interact with the environment through various tool calls, in order to avoid GPU idling while waiting for tool call return results, an asyncio based co-routing mechanism is utilized to execute each rollout requests asynchronously, thereby improving training performance. To support asynchronous rollout, the inference engine (server) and the agent (client) are architecturally separated, implementing a server-based system with the following objectives:
+
+1. Enabling load balancing mechanisms to balance loads across multiple GPUs and reduce the impact of long-tail requests on performance. For this purpose, scheduling capabilities in stream mode (recipe\stream_mode) are implemented as a recipe.
+2. Preventing agent specific features such as tracing from affecting the inference engine.
+
+System Architecture
+~~~~~~~~~~~~~~~~~~~
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop.png?raw=true
+
+For more detail on internal design, please refer to :doc:`Agent Loop<../advance/agent_loop>`.
+
+System Components
+~~~~~~~~~~~~~~~~~
+
+--------------------------+----------------------------------------------------------------------------+
+| Component                | Role                                                                       |
+==========================+============================================================================+
+| AgentLoop                | Client, implements Agent functions                                         |
+--------------------------+----------------------------------------------------------------------------+
+| AsyncLLMServerManager    | Inference gateway, provides generate interface for AgentLoop               |
+--------------------------+----------------------------------------------------------------------------+
+| AsyncServer              | Server, each instance is connected to one DP group of the inference engine |
+--------------------------+----------------------------------------------------------------------------+
+
+**"generate" Interface**
+
+The "generate" function based on ray actor is used between the Client and Server instead of the standard chat completion API. This is because the conversion between tokens and text can be irreversible. For example, the token converted from "<think>" will be different from that generated by the LLM. During the training phase, it is necessary to strictly use the tokens generated by LLM inference to avoid inaccurate in computing advantage, which may affect model performance. Having the Server provide a token-based API helps the Client maintain the relationship between the text generated by tool calls and the tokens returned by the LLM, so as to output correct tokens for training.
+
+
+**Inference Engine Adaptation**
+AsyncServer uniformly provides a generate function to the upper layer, with separate implementations for SGLang and vLLM to hide underlying differences:
+
+1. The SGLang AsyncServer uses the async_generate interface of the SGLang engine, which is located on the first GPU of each TP group. Therefore, AsyncServer needs to remotely call async_generate through ray actor.
+2. The vLLM AsyncServer uses the generate interface of the vLLM engine, which can communicate with the GPUs in the TP group through ZMQ and can be directly called in AsyncServer.
+
+
+Usage Example
+~~~~~~~~~~~~~
+
+Follow :doc:`GSM8K example<../examples/gsm8k_example>` to prepare the dataset and model checkpoints.
+
+There are two options required to use agent loop:
+
+- `data.return_raw_chat=True`
+- `actor_rollout_ref.rollout.mode=async`
+
+This example uses the sglang inference engine by default, and you can also modify rollout_name to use vllm.
+
+.. code-block:: bash
+
+    bash examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
+
+
+Multi-turn Conversations and Tool Calls
+---------------------------------------
+
+Follow :doc:`Multi-turn Rollout Support<../sglang_multiturn/multiturn>` to prepare tool and configuration files.
+
+The Tool Agent Loop has an additional requirement: adding an "agent_name" field to the dataset. During rollout, it will choose to use tool_agent_loop or single_turn_agent (default) based on this field.
+
+Usage Example
+~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    # install mlflow to view toolcall and llm trace
+    pip install mlflow
+
+    # This will download and preprocess the GSM8K dataset into ~/data/gsm8k/ and add the "agent_name" field.
+    python examples/data_preprocess/gsm8k_tool_agent_loop.py
+
+    # Start training with tool calls and enabled mlflow based trace helping to debug the rollout details
+    bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_tool_agent_mlflow.sh
+
+    # When training is done, start a mlflow server to view trace
+    mlflow ui -h 0.0.0.0 -p 5000 --backend-store-uri sqlite:////tmp/mlruns.db
+
+    # then you can open http://<your ip address>:5000 from browser to view trace
+
+
+Note: During training, because the model may sometimes fail to generate correct toolcall tags, an error message "Failed to decode tool call" will be output to the console, which does not indicate an abnormality in training.
+
+
+Follow :doc:`Rollout trace<../advance/rollout_trace>` to known more about trace feature.
+
+
+
+Agent Framework
+---------------
+
+System Architecture
+~~~~~~~~~~~~~~~~~~~
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/langgraph_agent.png?raw=true
+
+System Components
+~~~~~~~~~~~~~~~~~
+
+--------------------------+-----------------------------------------------------------------------------------------------+
+| Component                | Role                                                                                          |
+==========================+===============================================================================================+
+| ChatModel                | LLM object of LangChain, used to adapt to the “generate” api provided by AsyncLLMServerManager|
+--------------------------+-----------------------------------------------------------------------------------------------+
+| RectAgentLoop            | Agent adaptation layer, which by default supports a naive LangGraph Agentic.                  |
+|                          | New classes can be derived to support user-defined Agents, and the run function needs to be   |
+|                          | implemented to complete Agent calls.                                                          |
+--------------------------+-----------------------------------------------------------------------------------------------+
+| AsyncServer              | Server, each instance is connected to one DP group of the inference engine.                   |
+--------------------------+-----------------------------------------------------------------------------------------------+
+
+
+Follow doc "recipe/langgraph_agent/example/README.md" for more details.
\ No newline at end of file
--- a/docs/start/install.rst
+++ b/docs/start/install.rst
+Installation
+============
+
+Requirements
+------------
+
+- **Python**: Version >= 3.9
+- **CUDA**: Version >= 12.1
+
+verl supports various backends. Currently, the following configurations are available:
+
+- **FSDP** and **Megatron-LM** (optional) for training.
+- **SGLang**, **vLLM** and **TGI** for rollout generation.
+
+Choices of Backend Engines
+----------------------------
+
+1. Training:
+
+We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in :doc:`FSDP Workers<../workers/fsdp_workers>`.
+
+For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.12.2 <https://github.com/NVIDIA/Megatron-LM/tree/core_v0.12.2>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
+
+
+2. Inference:
+
+For inference, vllm 0.8.3 and later versions have been tested for stability. We recommend turning on env var `VLLM_USE_V1=1` for optimal performance.
+
+For SGLang, refer to the :doc:`SGLang Backend<../workers/sglang_worker>` for detailed installation and usage instructions. SGLang rollout is under extensive development and offers many advanced features and optimizations. We encourage users to report any issues or provide feedback via the `SGLang Issue Tracker <https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/106>`_.
+
+For huggingface TGI integration, it is usually used for debugging and single GPU exploration.
+
+Install from docker image
+-------------------------
+
+We provide pre-built Docker images for quick setup. And from this version,
+we utilize a new image release hierarchy for productivity and stability.
+
+The image types are divided into three large categories:
+
+- **Base Image**: Without inference and training frameworks, only basic dependencies are installed.
+  Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
+- **Application Image**: Stable version with inference and training frameworks installed.
+- **Community Image**: Unstable version with the latest frameworks and features.
+
+The first two types of images are hosted on dockerhub `verlai/verl <https://hub.docker.com/r/verlai/verl>`_ repository, while the preview images are hosted on community repository.
+
+.. note::
+
+    The image versions are mapped with verl releases, for example, image with tag ``verl0.4`` is built for verl release ``v0.4.x``.
+
+Base Image
+::::::::::
+
+The stable base image is ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4`` and ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4``, with different Pytorch version for vLLM and sglang. The installed package versions can be found from tags, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.base``.
+
+The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.
+
+Application Image
+:::::::::::::::::
+
+From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like Pytorch and FlashInfer.
+
+There are four types of application images available:
+
+- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.5-vllm0.9.1-mcore0.12.2-te2.2``
+- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.2-te2.2``
+
+Docker images with Megatron backends are runnable with large language model like ``Qwen/Qwen3-235B-A22B``, ``deepseek-ai/DeepSeek-V3-0324`` post-training. Refer to the :doc:`Large Language Model Post-Training documentation<../perf/dpsk>` for more details.
+
+Application images can be updated frequently, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.app.[frameworks]``. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.
+
+Community Image
+:::::::::::::::
+
+Community images are provided by the community, including the latest versions of vLLM and SGLang, and may include experimental features or configurations. And also works for other hardwares or platforms like AMD GPUs with ROCM or AWS EFA and Sagemaker.
+
+For latest vLLM with FSDP, please refer to `hiyouga/verl <https://hub.docker.com/r/hiyouga/verl>`_ repository and the latest version is ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
+
+For latest SGLang with FSDP, please refer to `ocss884/verl-sglang <https://hub.docker.com/r/ocss884/verl-sglang>`_ repository and the latest version is ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.
+
+See files under ``docker/`` for NGC-based image or if you want to build your own.
+
+Note that For aws instances with EFA net interface (Sagemaker AI Pod),
+you need to install EFA driver as shown in ``docker/Dockerfile.extenstion.awsefa``
+
+Installation from Docker
+::::::::::::::::::::::::
+
+After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:
+
+1. Launch the desired Docker image and attach into it:
+
+.. code:: bash
+
+    docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
+    docker start verl
+    docker exec -it verl bash
+
+
+2.	If you use the images provided, you only need to install verl itself without dependencies:
+
+.. code:: bash
+
+    # install the nightly version (recommended)
+    git clone https://github.com/volcengine/verl && cd verl
+    pip3 install --no-deps -e .
+
+[Optional] If you hope to switch between different frameworks, you can install verl with the following command:
+
+.. code:: bash
+
+    # install the nightly version (recommended)
+    git clone https://github.com/volcengine/verl && cd verl
+    pip3 install -e .[vllm]
+    pip3 install -e .[sglang]
+
+
+Install from custom environment
+---------------------------------------------
+
+We recommend to use docker images for convenience. However, if your environment is not compatible with the docker image, you can also install verl in a python environment.
+
+
+Pre-requisites
+::::::::::::::
+
+For training and inference engines to utilize better and faster hardware support, CUDA/cuDNN and other dependencies are required,
+and some of the dependencies are easy to be overridden when installing other packages,
+so we put them in the :ref:`Post-installation` step.
+
+.. note::
+
+    The installation steps below are recommended configurations for the latest version of verl.
+    If you are trying to customize your own environment, please ignore the strict constraints.
+
+We need to install the following pre-requisites:
+
+- **CUDA**: Version >= 12.4
+- **cuDNN**: Version >= 9.8.0
+- **Apex**
+
+CUDA above 12.4 is recommended to use as the docker image,
+please refer to `NVIDIA's official website <https://developer.nvidia.com/cuda-toolkit-archive>`_ for other version of CUDA.
+
+.. code:: bash
+
+    # change directory to anywher you like, in verl source code directory is not recommended
+    wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
+    dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
+    cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
+    apt-get update
+    apt-get -y install cuda-toolkit-12-4
+    update-alternatives --set cuda /usr/local/cuda-12.4
+
+
+cuDNN can be installed via the following command,
+please refer to `NVIDIA's official website <https://developer.nvidia.com/rdp/cudnn-archive>`_ for other version of cuDNN.
+
+.. code:: bash
+
+    # change directory to anywher you like, in verl source code directory is not recommended
+    wget https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
+    dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
+    cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/
+    apt-get update
+    apt-get -y install cudnn-cuda-12
+
+NVIDIA Apex is required for Megatron-LM and FSDP training.
+You can install it via the following command, but notice that this steps can take a very long time.
+It is recommended to set the ``MAX_JOBS`` environment variable to accelerate the installation process,
+but do not set it too large, otherwise the memory will be overloaded and your machines may hang.
+
+.. code:: bash
+
+    # change directory to anywher you like, in verl source code directory is not recommended
+    git clone https://github.com/NVIDIA/apex.git && \
+    cd apex && \
+    MAX_JOB=32 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
+
+
+Install dependencies
+::::::::::::::::::::
+
+.. note::
+
+    We recommend to use a fresh new conda environment to install verl and its dependencies.
+
+    **Notice that the inference frameworks often strictly limit your pytorch version and will directly override your installed pytorch if not paying enough attention.**
+
+    As a countermeasure, it is recommended to install inference frameworks first with the pytorch they needed. For vLLM, if you hope to use your existing pytorch,
+    please follow their official instructions
+    `Use an existing PyTorch installation <https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#build-wheel-from-source>`_ .
+
+
+1. First of all, to manage environment, we recommend using conda:
+
+.. code:: bash
+
+   conda create -n verl python==3.10
+   conda activate verl
+
+
+2. Then, execute the ``install.sh`` script that we provided in verl:
+
+.. code:: bash
+
+    # Make sure you have activated verl conda env
+    # If you need to run with megatron
+    bash scripts/install_vllm_sglang_mcore.sh
+    # Or if you simply need to run with FSDP
+    USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
+
+
+If you encounter errors in this step, please check the script and manually follow the steps in the script.
+
+
+Install verl
+::::::::::::
+
+For installing the latest version of verl, the best way is to clone and
+install it from source. Then you can modify our code to customize your
+own post-training jobs.
+
+.. code:: bash
+
+   git clone https://github.com/volcengine/verl.git
+   cd verl
+   pip install --no-deps -e .
+
+
+Post-installation
+:::::::::::::::::
+
+Please make sure that the installed packages are not overridden during the installation of other packages.
+
+The packages worth checking are:
+
+- **torch** and torch series
+- **vLLM**
+- **SGLang**
+- **pyarrow**
+- **tensordict**
+- **nvidia-cudnn-cu12**: For Magetron backend
+
+If you encounter issues about package versions during running verl, please update the outdated ones.
+
+
+Install with AMD GPUs - ROCM kernel support
+------------------------------------------------------------------
+
+When you run on AMD GPUs (MI300) with ROCM platform, you cannot use the previous quickstart to run verl. You should follow the following steps to build a docker and run it.
+If you encounter any issues in using AMD GPUs running verl, feel free to contact me - `Yusheng Su <https://yushengsu-thu.github.io/>`_.
+
+Find the docker for AMD ROCm: `docker/Dockerfile.rocm <https://github.com/volcengine/verl/blob/main/docker/Dockerfile.rocm>`_
+::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
+
+.. code-block:: bash
+
+    #  Build the docker in the repo dir:
+    # docker build -f docker/Dockerfile.rocm -t verl-rocm:03.04.2015 .
+    # docker images # you can find your built docker
+    FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+
+    # Set working directory
+    # WORKDIR $PWD/app
+
+    # Set environment variables
+    ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+
+    # Install vllm
+    RUN pip uninstall -y vllm && \
+        rm -rf vllm && \
+        git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
+        cd vllm && \
+        MAX_JOBS=$(nproc) python3 setup.py install && \
+        cd .. && \
+        rm -rf vllm
+
+    # Copy the entire project directory
+    COPY . .
+
+    # Install dependencies
+    RUN pip install "tensordict<0.6" --no-deps && \
+        pip install accelerate \
+        codetiming \
+        datasets \
+        dill \
+        hydra-core \
+        liger-kernel \
+        numpy \
+        pandas \
+        datasets \
+        peft \
+        "pyarrow>=15.0.0" \
+        pylatexenc \
+        "ray[data,train,tune,serve]" \
+        torchdata \
+        transformers \
+        wandb \
+        orjson \
+        pybind11 && \
+        pip install -e . --no-deps
+
+Build the image
+::::::::::::::::::::::::
+
+.. code-block:: bash
+
+    docker build -t verl-rocm .
+
+Launch the container
+::::::::::::::::::::::::::::
+
+.. code-block:: bash
+
+    docker run --rm -it \
+      --device /dev/dri \
+      --device /dev/kfd \
+      -p 8265:8265 \
+      --group-add video \
+      --cap-add SYS_PTRACE \
+      --security-opt seccomp=unconfined \
+      --privileged \
+      -v $HOME/.ssh:/root/.ssh \
+      -v $HOME:$HOME \
+      --shm-size 128G \
+      -w $PWD \
+      verl-rocm \
+      /bin/bash
+
+If you do not want to root mode and require assign yourself as the user,
+Please add ``-e HOST_UID=$(id -u)`` and ``-e HOST_GID=$(id -g)`` into the above docker launch script.
+
+verl with AMD GPUs currently supports FSDP as the training engine, vLLM and SGLang as the inference engine. We will support Megatron in the future.
--- a/docs/start/more_resources.rst
+++ b/docs/start/more_resources.rst
+More Resources
+==============
+
+Last updated: 06/30/2025.
+
+- Introduction to verl (`Slides <https://tongyx361.github.io/blogs/posts/verl-intro>`_)
+- verl Code Walkthrough (`Slides <https://tongyx361.github.io/blogs/posts/verl-tutorial>`_, `Talk in Chinese <https://hcqnc.xetlk.com/sl/3vACOK>`_) 
--- a/docs/start/multinode.rst
+++ b/docs/start/multinode.rst
+Multinode Training
+==================
+
+Last updated: 06/10/2025.
+
+.. _wuxibin89: https://github.com/wuxibin89
+
+Author: `Xibin Wu <https://github.com/wuxibin89>`_, `Yusheng Su <https://yushengsu-thu.github.io/>`_.
+
+Manual
+------
+
+Set up multinode ray cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+1. Start head node with ``ray start --head --dashboard-host=0.0.0.0``, there're 2 address you should care about:
+
+- GCS address: ``ray start --address=<address>``, where worker node should connect to.
+- Dashboard address: ``<address>:8265``, where you should submit job to the cluster.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/head.png?raw=true
+
+2. Start worker node with ``ray start --address=<address>`` you get above.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/worker.png?raw=true
+
+3. Now you should see the cluster have 2 nodes with ``ray status``.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/status.png?raw=true
+
+4. Additionally, you can access dashboard in the browser with the address you get above. 
+
+*Firewall rules maybe need configure to access the dashboard, if there's any trouble, please contact your network administrator.*
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/overview.png?raw=true
+
+Submit job to ray cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~
+1. Submit ray job to cluster with the dashboard address you get above.
+
+.. code-block:: bash
+
+    ray job submit --address="http://127.0.0.1:8265" \
+        --runtime-env=verl/trainer/runtime_env.yaml \
+        --no-wait \
+        -- \
+        python3 -m verl.trainer.main_ppo \
+        trainer.n_gpus_per_node=8 \
+        trainer.nnodes=2 \
+        ...
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/submit.png?raw=true
+
+2. Then you can check the job status with the following commands:
+
+- ray job list: list all jobs submitted to the cluster.
+- ray job logs <Submission ID>: query the logs of the job.
+- ray job status <Submission ID>: query the status of the job.
+- ray job stop <Submission ID>: request the job to be stopped.
+
+3. You can also access driver/task/actor logs in ``/tmp/ray/session_latest/logs/``, driver log is ``job-driver-raysubmit_<Submission ID>.log``.
+
+4. We strongly recommend you to view job detail from dashboard in multinode training, because it provide more structure way to view the job information.
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job.png?raw=true
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job_detail.png?raw=true
+
+
+Slurm
+-----
+TBD
+
+dstack
+------
+`dstackai/dstack <https://github.com/dstackai/dstack>`_ is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments
+without the need to use K8S or Slurm.
+
+Prerequisite
+~~~~~~~~~~~~
+Once dstack is `installed <https://dstack.ai/docs/installation>`_, initialize the directory as a repo with ``dstack init``. 
+
+.. code-block:: bash
+
+    mkdir myproject && cd myproject
+    dstack init
+
+**Create a fleet**
+
+Before submitting distributed training jobs, create a `dstack` `fleet <https://dstack.ai/docs/concepts/fleets>`_.
+
+Run a Ray cluster task
+~~~~~~~~~~~~~~~~~~~~~~
+
+Once the fleet is created, define a Ray cluster task, e.g. in ``ray-cluster.dstack.yml``:
+
+.. code-block:: yaml
+
+    type: task
+    name: ray-verl-cluster
+
+    nodes: 2
+
+    env:
+        - WANDB_API_KEY
+        - PYTHONUNBUFFERED=1
+        - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    
+    image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
+    commands:
+        - git clone https://github.com/volcengine/verl
+        - cd verl
+        - pip install --no-deps -e .
+        - pip install hf_transfer hf_xet
+        - |
+        if [ $DSTACK_NODE_RANK = 0 ]; then
+            python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
+            python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" 
+            ray start --head --port=6379;
+        else
+            ray start --address=$DSTACK_MASTER_NODE_IP:6379
+        fi
+
+    # Expose Ray dashboard port
+    ports:
+        - 8265
+
+    resources:
+        gpu: 80GB:8
+        shm_size: 128GB
+
+    # Save checkpoints on the instance
+    volumes:
+        - /checkpoints:/checkpoints
+
+Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`.
+
+.. code-block:: bash
+
+    dstack apply -f ray-cluster.dstack.yml
+
+As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution
+
+Submit Ray jobs
+~~~~~~~~~~~~~~~
+
+Before you can submit Ray jobs, ensure to install `ray` locally:
+   
+.. code-block:: shell
+
+    pip install ray
+
+Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``:
+   
+.. code-block:: shell
+
+    $ RAY_ADDRESS=http://localhost:8265
+    $ ray job submit \
+        -- python3 -m verl.trainer.main_ppo \
+        data.train_files=/root/data/gsm8k/train.parquet \
+        data.val_files=/root/data/gsm8k/test.parquet \
+        data.train_batch_size=256 \
+        data.max_prompt_length=512 \
+        data.max_response_length=256 \
+        actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+        critic.optim.lr=1e-5 \
+        critic.model.path=Qwen/Qwen2.5-7B-Instruct \
+        critic.ppo_micro_batch_size_per_gpu=4 \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.project_name=ppo_training \
+        trainer.experiment_name=qwen-2.5-7B \
+        trainer.val_before_train=False \
+        trainer.n_gpus_per_node=8 \
+        trainer.nnodes=2 \
+        trainer.default_local_dir=/checkpoints \
+        trainer.save_freq=10 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15 2>&1 | tee verl_demo.log \
+        trainer.resume_mode=disable
+
+
+For more details on how `dstack` works, check out its `documentation <https://dstack.ai/docs>`_.
+
+How to debug?
+---------------------
+
+
+Ray Distributed Debugger VSCode Extension (Recommended)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Starting with Ray 2.39, Anyscale has introduced the `Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>`_ VSCode extension. Follow the extension’s installation instructions, then add your cluster using the dashboard URL you obtained earlier.
+
+   .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true
+      :alt: Ray Distributed Debugger VSCode extension screenshot
+
+2. Prerequisites.
+
+   Ensure the following are installed (see the extension README for more detail):
+
+   - Visual Studio Code  
+   - `ray[default]` >= 2.9.1  
+   - `debugpy` >= 1.8.0  
+
+   .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/c7098b755ff689859837773a916c857.png?raw=true
+      :alt: VSCode with Ray prerequisites
+
+3. Environment Variables.
+
+   To enable post‑mortem debugging, set:
+
+   .. code-block:: bash
+
+      export RAY_DEBUG_POST_MORTEM=1
+
+   .. admonition:: Note
+      :class: important
+
+      Be sure to remove any legacy flags before starting Ray:
+
+      - `RAY_DEBUG=legacy`  
+      - `--ray-debugger-external`
+
+4. Configuring BreakpointsSet up breakpoint() in your code, and submit job to cluster. Then the extension will show the breakpoint information.
+
+
+   1. Insert `breakpoint()` calls into your remote functions.  
+   2. Submit your job to the cluster.  
+
+   The extension will detect active breakpoints and display them in VSCode.
+
+   .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true
+      :alt: Detected breakpoint in VSCode
+
+   **Note:** Breakpoints are only supported inside functions decorated with `@ray.remote`.
+
+5. Launching the Debugger.
+
+   Run your job directly from the command line (do not use a `launch.json`):
+
+   .. code-block:: bash
+
+      python job.py
+
+6. Attaching to a Breakpoint.
+
+ Once the process hits the first `breakpoint()`, click the Ray Distributed Debugger icon in the VSCode sidebar to attach the debugger.
+
+   .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true
+      :alt: Attaching VSCode debugger to Ray process
+
+7. Debugging With Multiple breakpoint().
+
+   For each subsequent task, first disconnect the current debugger session, then click the extension icon again to attach to the next breakpoint.
+
+   .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/6e83c910a62c82fecb89c6619e001cd.png?raw=true
+      :alt: Disconnecting and reconnecting the debugger
+
+Legacy Ray Debugger
+~~~~~~~~~~~~~~~~~~~
+1. Ray has a builtin legacy `debugger <https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html>`_ that allows you to debug your distributed applications. To enable debugger, start ray cluster with ``RAY_DEBUG=legacy`` and ``--ray-debugger-external``.
+
+.. code-block:: bash
+
+    # start head node
+    RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external
+    # start worker node
+    RAY_DEBUG=legacy ray start --address='10.124.46.192:6379' --ray-debugger-external
+
+2. Set up breakpoint in your code, and submit job to cluster. Then run ``ray debug`` to wait breakpoint:
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/legacy.png?raw=true
+
+
+Multi-node training on AMD clusters
+---------------------------------------------------------------------------------------
+
+If you want to run multi-node training with slurm with Docker/Podman container on AMD Cluster, you can use the following script. 
+
+If you encounter any issues in using AMD GPUs running verl, please contact `Yusheng Su <https://yushengsu-thu.github.io/>`_.
+
+.. note::
+    1. You need to use ``podman`` or ``docker`` in the following script. We will release the apptainer script later.
+    2. If you want to use ``podman``, you just replace ``docker`` with ``podman`` in the following script.
+
+The script includes the following steps:
+
+1. SLURM Configuration
+2. Environment Setup
+3. Docker/Podman Container Setup
+4. Ray Cluster Initialization
+5. Data Preprocessing
+6. Model Setup
+7. Training Launch
+
+
+slurm_script.sh
+~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+    #!/bin/bash
+
+    #SBATCH --job-name=verl-ray-on-slurm
+    #SBATCH --nodes=2
+    #SBATCH --ntasks-per-node=2
+    #SBATCH --mem=200G
+    #SBATCH --time=30-00:00:00
+    #SBATCH --gpus-per-node=8
+    #SBATCH --cpus-per-task=28
+    #SBATCH --output=../verl_log/slurm-%j.out
+    #SBATCH --error=../verl_log/slurm-%j.err
+    #SBATCH --nodelist=gpu-[0,1]
+
+
+    # load necessary modules
+    ### Run this setup
+    # [Cluster]: Use docker
+    # docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+
+
+    ##########################################################################
+    ###The following setting should be set in different project and cluster###
+    ##########################################################################
+
+    ### Project
+    CONTAINER_NAME="multinode_verl_training"
+    IMG="verl.rocm"
+    DOCKERFILE="docker/Dockerfile.rocm"
+    # echo $PWD
+    verl_workdir="${HOME}/projects/verl_upstream"
+    export TRANSFORMERS_CACHE="${HOME}/.cache/huggingface"
+    export HF_HOME=$TRANSFORMERS_CACHE
+
+    ### Cluster Network Setting
+    export NCCL_DEBUG=TRACE
+    export GPU_MAX_HW_QUEUES=2
+    export TORCH_NCCL_HIGH_PRIORITY=1
+    export NCCL_CHECKS_DISABLE=1
+    # export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 
+    export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9
+    export NCCL_IB_GID_INDEX=3
+    export NCCL_CROSS_NIC=0
+    export CUDA_DEVICE_MAX_CONNECTIONS=1
+    export NCCL_PROTO=Simple
+    export RCCL_MSCCL_ENABLE=0
+    export TOKENIZERS_PARALLELISM=false
+    export HSA_NO_SCRATCH_RECLAIM=1
+    ##########################################################################
+
+    ### For rocm and training script
+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+    export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+
+
+    # Build and launch the Docker container
+    srun bash -c "
+        # Exit on any error
+        set -e 
+
+        # Clean up dangling images (images with <none> tag)
+        docker image prune -f
+
+        # Need to pull the docker first
+        docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+        
+        if ! docker images --format "{{.Repository}}:{{.Tag}}" | grep -q "${IMG}"; then
+            echo \"Building ${IMG} image...\"
+            docker build -f \"${DOCKERFILE}\" -t \"${IMG}\" .
+        else
+            echo \"${IMG} image already exists, skipping build\"
+        fi
+
+        # Removing old container if exists
+        docker rm \"${CONTAINER_NAME}\" 2>/dev/null || true
+
+        # Checking network devices
+        ibdev2netdev
+
+        # Launch the docker
+        docker run --rm -d \
+        -e HYDRA_FULL_ERROR=1 \
+        -e HIP_VISIBLE_DEVICES=${HIP_VISIBLE_DEVICES} \
+        -e ROCR_VISIBLE_DEVICES=${ROCR_VISIBLE_DEVICES} \
+        -e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \
+        -e NCCL_DEBUG=${NCCL_DEBUG} \
+        -e GPU_MAX_HW_QUEUES=${GPU_MAX_HW_QUEUES} \
+        -e TORCH_NCCL_HIGH_PRIORITY=${TORCH_NCCL_HIGH_PRIORITY} \
+        -e NCCL_CHECKS_DISABLE=${NCCL_CHECKS_DISABLE} \
+        -e NCCL_IB_HCA=${NCCL_IB_HCA} \
+        -e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX} \
+        -e NCCL_CROSS_NIC=${NCCL_CROSS_NIC} \
+        -e CUDA_DEVICE_MAX_CONNECTIONS=${CUDA_DEVICE_MAX_CONNECTIONS} \
+        -e NCCL_PROTO=${NCCL_PROTO} \
+        -e RCCL_MSCCL_ENABLE=${RCCL_MSCCL_ENABLE} \
+        -e TOKENIZERS_PARALLELISM=${TOKENIZERS_PARALLELISM} \
+        -e HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM} \
+        -e TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE} \
+        -e HF_HOME=${HF_HOME} \
+        --network host \
+        --device /dev/dri \
+        --device /dev/kfd \
+        --device /dev/infiniband \
+        --group-add video \
+        --cap-add SYS_PTRACE \
+        --security-opt seccomp=unconfined \
+        --privileged \
+        -v \${HOME}:\${HOME} \
+        -v \${HOME}/.ssh:/root/.ssh \
+        -w "${verl_workdir}" \
+        --shm-size 128G \
+        --name \"${CONTAINER_NAME}\" \
+        \"${IMG}\" \
+        tail -f /dev/null
+
+        echo \"Container setup completed\"
+    "
+        # (Optional): If you do not want to root mode and require assign yuorself as the user
+        # Please add `-e HOST_UID=$(id -u)` and `-e HOST_GID=$(id -g)` into the above docker launch script. 
+
+
+
+
+
+    ### Ray launch the nodes before training
+
+    # Getting the node names
+    nodes_array=($(scontrol show hostnames "$SLURM_JOB_NODELIST" | tr '\n' ' '))
+
+    head_node=${nodes_array[0]}
+    head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+
+    # if we detect a space character in the head node IP, we'll
+    # convert it to an ipv4 address. This step is optional.
+    if [[ "$head_node_ip" == *" "* ]]; then
+        IFS=' ' read -ra ADDR <<<"$head_node_ip"
+    if [[ ${#ADDR[0]} -gt 16 ]]; then
+        head_node_ip=${ADDR[1]}
+    else
+        head_node_ip=${ADDR[0]}
+    fi
+        echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
+    fi
+
+    port=6379
+    ip_head=$head_node_ip:$port
+    export ip_head
+    echo "IP Head: $ip_head"
+
+    # make sure we set environment variables before Ray initialization
+
+    # Print out all env variables
+    printenv
+
+    echo "Starting HEAD at $head_node"
+    srun --nodes=1 --ntasks=1 -w "$head_node" \
+        docker exec "${CONTAINER_NAME}" \
+            ray start --head --node-ip-address="$head_node_ip" --port=$port \
+            --dashboard-port=8266 \
+            --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
+    # optional, though may be useful in certain versions of Ray < 1.0.
+    sleep 10
+
+    # number of nodes other than the head node
+    worker_num=$((SLURM_JOB_NUM_NODES - 1))
+
+    for ((i = 1; i <= worker_num; i++)); do
+        node_i=${nodes_array[$i]}
+        echo "Debug: Starting worker on node_i = ${node_i}"
+        if [ -z "$node_i" ]; then
+            echo "Error: Empty node name for worker $i"
+            continue
+        fi
+        echo "Starting WORKER $i at $node_i"
+        srun --nodes=1 --ntasks=1 -w "$node_i" \
+            docker exec "${CONTAINER_NAME}" \
+                ray start --address "$ip_head" --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
+        sleep 5
+    done
+
+
+
+
+    # Ray initlization test (See whether any error in the above execution)
+    echo "Testing Ray initialization in the slurm nodes..."
+    docker exec "${CONTAINER_NAME}" python3 -c '
+    import ray
+    try:
+        ray.init(address="auto")
+        print("\n=== Ray Cluster Status ===")
+        print(f"Number of nodes: {len(ray.nodes())}")
+        for node in ray.nodes():
+            print("Node: {}, Status: {}".format(node["NodeManagerHostname"], node["Alive"]))
+            # print(f"Node: {node}")
+        ray.shutdown()
+        print("Ray initialization successful!")
+    except Exception as e:
+        print(f"Ray initialization failed: {str(e)}")
+    '
+    echo "=== Ray test completed ==="
+    ######
+
+
+
+    # Run data preprocessing
+
+    echo "Starting data preprocessing..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 "examples/data_preprocess/gsm8k.py" "--local_dir" "../data/gsm8k"
+
+    echo "Starting data preprocessing..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 "examples/data_preprocess/math_dataset.py" "--local_dir" "../data/math"
+
+    train_files="../data/gsm8k/train.parquet"
+    val_files="../data/gsm8k/test.parquet"
+
+    # Download and test model
+    echo "Loading model..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
+    MODEL_PATH="Qwen/Qwen2-7B-Instruct"
+
+    # Set model path after pipeline test
+    MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"
+
+    echo "== Data and model loading Done =="
+
+    echo "Start to train..."
+
+    docker exec "${CONTAINER_NAME}" \
+        python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
+    MODEL_PATH="Qwen/Qwen2-7B-Instruct"
+
+
+    PYTHONUNBUFFERED=1 srun --overlap --nodes=${SLURM_NNODES} --ntasks=1 -w "$head_node" \
+        docker exec "${CONTAINER_NAME}" \
+        python3 -m verl.trainer.main_ppo \
+        data.train_files=$train_files \
+        data.val_files=$val_files \
+        data.train_batch_size=1024 \
+        data.max_prompt_length=1024 \
+        data.max_response_length=1024 \
+        actor_rollout_ref.model.path=$MODEL_PATH \
+        actor_rollout_ref.model.enable_gradient_checkpointing=False \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.model.use_remove_padding=True \
+        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=False \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.name=vllm \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=True \
+        critic.optim.lr=1e-5 \
+        critic.model.use_remove_padding=True \
+        critic.model.path=$MODEL_PATH \
+        critic.model.enable_gradient_checkpointing=False \
+        critic.ppo_micro_batch_size_per_gpu=8 \
+        critic.model.fsdp_config.param_offload=False \
+        critic.model.fsdp_config.optimizer_offload=False \
+        algorithm.kl_ctrl.kl_coef=0.0001 \
+        trainer.critic_warmup=0 \
+        trainer.logger='["console","wandb"]' \
+        trainer.project_name='verl_example' \
+        trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
+        trainer.n_gpus_per_node=${SLURM_GPUS_PER_NODE} \
+        trainer.val_before_train=False \
+        trainer.nnodes=${SLURM_NNODES} \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15
+
+
+Run multi-node training with above slurm_script.sh
+~~~~~~~~~~~~~~~~~~~~
+Just sbatch your slurm_script.sh
+
+.. code-block:: bash
+
+    sbatch slurm_script.sh
+
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
+.. _quickstart:
+
+=========================================================
+Quickstart: PPO training on GSM8K dataset
+=========================================================
+
+Post-train a LLM using GSM8K dataset.
+
+Introduction
+------------
+
+.. _hf_dataset_gsm8k: https://huggingface.co/datasets/gsm8k
+
+In this example, we train an LLM to tackle the `GSM8k <hf_dataset_gsm8k>`_ task with function-based rewards. [1]_
+
+Prerequisite:
+
+- the latest version of ``verl`` and its dependencies installed following the installation guide. Using the docker image is recommended.
+
+- a GPU with at least 24 GB HBM
+
+
+Dataset Introduction
+--------------------
+
+GSM8k is a math problem dataset. The prompt is an elementary school
+problem. The LLM model is asked to solve the math problem. Below is an example:
+
+Prompt
+
+   Katy makes coffee using teaspoons of sugar and cups of water in the
+   ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups
+   of water, calculate the number of teaspoonfuls of sugar she used.
+
+Solution
+
+   The total ratio representing the ingredients she used to make the
+   coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the
+   number of teaspoons she used is 7/20, she used 7/20\ *120 =
+   <<7/20*\ 120=42>>42 #### 42
+
+Step 1: Prepare the dataset
+----------------------------
+
+We preprocess the dataset in parquet format so that (1) it contains necessary fields for computing RL rewards and (2) is faster to read.
+
+.. code-block:: bash
+
+   python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
+
+Step 2: Download a model for post-training
+-------------------------------------------
+
+In this example, we start with the ``Qwen2.5-0.5B-Instruct`` model.
+
+If you want to perform SFT before RL, refer to the :doc:`Complete GSM8K Example<../examples/gsm8k_example>`, the `sft directory <https://github.com/volcengine/verl/blob/main/examples/sft/gsm8k>`_ and `SFT Trainer <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_ for further details.
+
+.. code-block:: bash
+
+   python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')"
+
+Step 3: Perform PPO training with the instruct model
+----------------------------------------------------------------------
+
+**Reward Model/Function**
+
+We use a pre-defined rule-based reward model. We force the model to produce a final
+answer following 4 “#” as shown in the solution. We extract the final
+answer from both the solution and model's output using regular
+expression matching. We assign a reward of 1 to correct
+answer, 0.0 to incorrect answer and 0 to no answer. 
+
+For more details, please refer to `verl/utils/reward_score/gsm8k.py <https://github.com/volcengine/verl/blob/v0.4.1/verl/utils/reward_score/gsm8k.py>`_.
+
+**Training Script**
+
+Now let's run PPO training with the dataset and model above. [2]_
+
+
+Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on your dataset and model names or paths.
+You may set ``VERL_USE_MODELSCOPE=True`` to download models from `modelscope <https://www.modelscope.cn>`_ instead of `huggingface <https://huggingface.co>`_.
+
+.. code-block:: bash
+
+   PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=256 \
+    data.max_prompt_length=512 \
+    data.max_response_length=256 \
+    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+    critic.optim.lr=1e-5 \
+    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+    critic.ppo_micro_batch_size_per_gpu=4 \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.logger=console \
+    trainer.val_before_train=False \
+    trainer.n_gpus_per_node=1 \
+    trainer.nnodes=1 \
+    trainer.save_freq=10 \
+    trainer.test_freq=10 \
+    trainer.total_epochs=15 2>&1 | tee verl_demo.log
+
+You are expected to see the following logs, indicating training in progress. The key metric ``val/test_score/openai/gsm8k`` is computed every ``trainer.test_freq`` steps:
+
+.. code-block:: bash
+
+    step:0 - timing/gen:21.470 - timing/ref:4.360 - timing/values:5.800 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - timing/adv:0.109 - timing/update_critic:15.664 - critic/vf_loss:14.947 - critic/vf_clipfrac:0.000 - critic/vpred_mean:-2.056 - critic/grad_norm:1023.278 - critic/lr(1e-4):0.100 - timing/update_actor:20.314 - actor/entropy_loss:0.433 - actor/pg_loss:-0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:1.992 - actor/lr(1e-4):0.010 - critic/score/mean:0.004 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.004 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:2.360 - critic/advantages/min:-2.280 - critic/returns/mean:0.003 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.045 - critic/values/max:9.500 - critic/values/min:-14.000 - response_length/mean:239.133 - response_length/max:256.000 - response_length/min:77.000 - prompt_length/mean:104.883 - prompt_length/max:175.000 - prompt_length/min:68.000
+    step:1 - timing/gen:23.020 - timing/ref:4.322 - timing/values:5.953 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty:0.001 - timing/adv:0.118 - timing/update_critic:15.646 - critic/vf_loss:18.472 - critic/vf_clipfrac:0.384 - critic/vpred_mean:1.038 - critic/grad_norm:942.924 - critic/lr(1e-4):0.100 - timing/update_actor:20.526 - actor/entropy_loss:0.440 - actor/pg_loss:0.000 - actor/pg_clipfrac:0.002 - actor/ppo_kl:0.000 - actor/grad_norm:2.060 - actor/lr(1e-4):0.010 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:0.000 - critic/advantages/max:2.702 - critic/advantages/min:-2.616 - critic/returns/mean:0.000 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.280 - critic/values/max:11.000 - critic/values/min:-16.000 - response_length/mean:232.242 - response_length/max:256.000 - response_length/min:91.000 - prompt_length/mean:102.398 - prompt_length/max:185.000 - prompt_length/min:70.000
+
+Checkout ``Algorithm Baselines`` page for full training and validation logs for reference.
+
+The checkpoint is saved at the following dir by default: ``checkpoints/${trainer.project_name}/${trainer.experiment_name}``. You can merge the saved checkpoints to huggingface model using ``verl.model_merger`` module, for example:
+
+.. code-block:: bash
+
+    python3 -m verl.model_merger merge \
+        --backend fsdp \
+        --local_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor \
+        --target_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor/huggingface
+
+For more details about checkpoint and model merging, please refer to :ref:`checkpoint-page`.
+
+To enable ``wandb`` for experiment tracking, set the following configs:
+
+.. code-block:: bash
+
+    trainer.logger='["console","wandb"]' \
+    trainer.project_name=$YOUR_PROJECT_NAME \
+    trainer.experiment_name=$YOUR_RUN_NAME \
+
+If you encounter out of memory issues with HBM less than 32GB, enable the following configs would help:
+
+.. code-block:: bash
+
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
+    critic.ppo_micro_batch_size_per_gpu=1 \
+
+For the full set of configs, please refer to :ref:`config-explain-page` for detailed explanation and performance tuning.
+
+
+.. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model.
+.. [2] More training script examples for FSDP and Megatron-LM backend are stored in `examples/ppo_trainer <https://github.com/volcengine/verl/tree/main/examples/ppo_trainer>`_ directory.
--- a/docs/start/ray_debug_tutorial.rst
+++ b/docs/start/ray_debug_tutorial.rst
+Ray Debug Tutorial
+==================
+
+Last updated: 04/23/2025
+
+
+.. _wuxibin89: https://github.com/wuxibin89
+
+Author: `Ao Shen <https://aoshen524.github.io/>`_.
+
+How to debug?
+---------------------
+
+
+Ray Distributed Debugger VSCode Extension (Recommended)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Starting with Ray 2.39, Anyscale has introduced the `Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>`_ VSCode extension. Follow the extension’s installation instructions, then add your cluster using the dashboard URL you obtained earlier.
+
+   .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true
+      :alt: Ray Distributed Debugger VSCode extension screenshot
+
+2. Prerequisites.
+
+   Ensure the following are installed (see the extension README for more detail):
+
+   - Visual Studio Code  
+   - `ray[default]` >= 2.9.1  
+   - `debugpy` >= 1.8.0  
+
+   .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/readme.png?raw=true
+      :alt: VSCode with Ray prerequisites
+
+3. Environment Variables.
+
+   To enable post‑mortem debugging, set:
+
+   .. code-block:: bash
+
+      export RAY_DEBUG_POST_MORTEM=1
+
+   .. admonition:: Note
+      :class: important
+
+      Be sure to remove any legacy flags before starting Ray:
+
+      - `RAY_DEBUG=legacy`  
+      - `--ray-debugger-external`
+
+4. Configuring BreakpointsSet up breakpoint() in your code, and submit job to cluster. Then the extension will show the breakpoint information.
+
+
+   1. Insert `breakpoint()` calls into your remote functions.  
+   2. Submit your job to the cluster.  
+
+   The extension will detect active breakpoints and display them in VSCode.
+
+   **Note:** Breakpoints are only supported inside functions decorated with `@ray.remote`.
+
+5. Launching the Debugger.
+
+   Run your job directly from the command line (do not use a `launch.json`):
+
+   .. code-block:: bash
+
+      python job.py
+
+6. Attaching to a Breakpoint.
+
+ Once the process hits the first `breakpoint()`, click the Ray Distributed Debugger icon in the VSCode sidebar to attach the debugger.
+
+   .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/launch.png?raw=true
+      :alt: Attaching VSCode debugger to Ray process
+
+7. Debugging With Multiple breakpoint().
+
+   For each subsequent task, first disconnect the current debugger session, then click the extension icon again to attach to the next breakpoint.
+
+   .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/disconnect.png?raw=true
+      :alt: Disconnecting and reconnecting the debugger
+
+Legacy Ray Debugger
+~~~~~~~~~~~~~~~~~~~
+1. Ray has a builtin legacy `debugger <https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html>`_ that allows you to debug your distributed applications. To enable debugger, start ray cluster with ``RAY_DEBUG=legacy`` and ``--ray-debugger-external``.
+
+.. code-block:: bash
+
+    # start head node
+    RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external
+    # start worker node
+    RAY_DEBUG=legacy ray start --address='10.124.46.192:6379' --ray-debugger-external
+
+2. Set up breakpoint in your code, and submit job to cluster. Then run ``ray debug`` to wait breakpoint:
+
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/legacy.png?raw=true
+
--- a/docs/workers/fsdp_workers.rst
+++ b/docs/workers/fsdp_workers.rst
+PyTorch FSDP Backend
+======================
+
+Last updated: 02/12/2025.
+
+We support PyTorch FSDP Backend by implementing various workers for
+actor, critic, reference, rollout and reward models. We also implement
+the ``FSDPVLLMShardingManager`` that reshard weight between FSDP and
+vLLM in `fsdp_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/fsdp_vllm.py>`_.
+
+**Pros**
+
+- Readily support various models.
+
+  - Users only need to implement the corresponding
+    ``dtensor_weight_loader`` for weight synchronization between FSDP
+    and vLLM. While for ``hf_weight_loader``, users can directly apply
+    any models supported both in HF and vLLM without any code change.
+
+- Easy to organize the forward and backward computation for each model.
+
+**Cons**
+
+- Poor scalability when it comes to large-scale models (e.g. Llama 70B
+  and 405B)
+- The resharding overhead between actor and rollout could be larger than
+  Megatron-LM backend.
+
+Due to the simplicity, we recommend using FSDP backend for algorithm
+research and prototyping.
+
+FSDP Workers
+--------------
+
+ActorRolloutRefWorker
+^^^^^^^^^^^^^^^^^^^^^
+
+Actor/Rollout HybridEngine
+''''''''''''''''''''''''''
+
+1. HybridEngine, Actor and Rollout initialization API.
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
+   def init_model(self):
+
+``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
+process, each worker (on a GPU) will execute the following model
+initialization process.
+
+The initialization details of HybridEngine, Actor and Rollout are
+highlighted below:
+
+1. ``DataParallelPPOActor`` implements the simple PPO computation logics
+   when the model is built with FSDP, including compute log prob, model
+   update.
+2. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
+   Engine and make it executed under SPMD to fit into our
+   ``WorkerGroup`` design.
+3. ``FSDPVLLMShardingManager`` a context manager to perform actual
+   resharding between actor and rollout.
+
+See `source code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_. for more information.
+
+1. Generate sequence and recompute log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def generate_sequences(self, prompts: DataProto):
+
+- ``Dispatch.DP_COMPUTE_PROTO``: The data will be dispatched and
+  collected along the DP dimension
+
+- In this function, the rollout model will perform auto-regressive
+  generation and the actor model will recompute the old log prob for the
+  generated response.
+
+3. Update actor model
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def update_actor(self, data: DataProto):
+
+- Update the actor model weight using PPO & entropy loss.
+
+ReferenceModel
+''''''''''''''
+
+1. Reference model initialization
+
+The reference model is initialized using the same function as the actor
+model without initializing the HybridEngine and Optimizer. Then the
+actor model is also wrapped by the ``DataParallelPPOActor``.
+
+2. Compute reference log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def compute_ref_log_prob(self, data: DataProto):
+
+- In this function, the reference model will call the compute log prob
+  function in ``DataParallelPPOActor`` to compute the reference log
+  prob.
+
+CriticWorker and RewardWorker
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. Model initialization
+
+Quite similar to reference model. The CriticWorker will perform
+additional initialization for the Optimizer.
+
+2. Compute Values for CriticWorker
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def compute_values(self, data: DataProto):
+
+3. Update Critic
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def update_critic(self, data: DataProto):
+
+4. Compute Reward
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def compute_rm_score(self, data: DataProto):
+
+
+HybridShard
+------------
+
+We didn't support FSDP `HybridShard`. To support this, we may need to
+construct a 2D device mesh and test the corresponding
+``dtensor_weight_loader`` and ``hf_weight_loader`` for each model.
--- a/docs/workers/megatron_workers.rst
+++ b/docs/workers/megatron_workers.rst
+Megatron-LM Backend
+===================
+
+Last updated: 06/24/2025.
+
+We support Megatron Backend by implementing various workers for actor,
+critic, reference, rollout and reward models. We also implement the
+``3DHybridEngine`` using Megatron-LM and vLLM/SGLang in
+`megatron_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_vllm.py>`_
+and `megatron_sglang.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_sglang.py>`_.
+
+**Pros**
+
+- Support 5D parallelism (TP, EP, CP, DP, PP) and sequence parallelism
+  for best scalablility and throughput.
+- 3D HybridEngine can significantly reduce peak memory usage and reduce
+  weight synchronize overhead between actor and rollout.
+
+**Cons**
+
+- Huggingface Models and Megatron checkpoints need tools for conversion.
+
+
+Development Progress
+--------------------
+
+
+Note that [Deprecated] means that the feature is not supported in the latest
+version of verl.
+[To-Optimize] means that the feature is implemented but not optimized yet.
+[WIP] means that the feature is working in progress.
+[In-Release] means that the feature is ready and in review process,
+coming at any time.
+
+
+---------------+-----------------------------------------------------------+
+| [Deprecated]  | Megatron 3D Parallelism with custom models                |
+---------------+-----------------------------------------------------------+
+| [Done]        | Megatron 0.11.0 ``GPTModel`` support                      |
+---------------+-----------------------------------------------------------+
+| [Done]        | Megatron GRPO support                                     |
+---------------+-----------------------------------------------------------+
+| [Done]        | Megatron with vLLM 0.8.2, with per-tensor weights loading |
+---------------+-----------------------------------------------------------+
+| [Done]        | Megatron with Context Parallel                            |
+---------------+-----------------------------------------------------------+
+| [Done]        | Qwen2MoE model support                                    |
+---------------+-----------------------------------------------------------+
+| [To-Optimize] | Megatron dist Checkpoint                                  |
+---------------+-----------------------------------------------------------+
+| [To-Optimize] | Huggingface and Megatron Checkpoint Converter             |
+---------------+-----------------------------------------------------------+
+| [To-Optimize] | Efficient fused linear, entropy and cross entropy         |
+---------------+-----------------------------------------------------------+
+| [Done]        | Megatron offload(param, grad, optimizer)                  |
+---------------+-----------------------------------------------------------+
+| [Done]        | Megatron Profiler                                         |
+---------------+-----------------------------------------------------------+
+| [In-Release]  | Megatron 0.12.0, TE 2.2 with vLLM 0.8.3 and Fused Attn    |
+---------------+-----------------------------------------------------------+
+| [WIP]         | Moonlight/DeepSeek-V3 model support                       |
+---------------+-----------------------------------------------------------+
+| [WIP]         | Expert Parallel support                                   |
+---------------+-----------------------------------------------------------+
+| [WIP]         | Megatron support dynamic batch size                       |
+---------------+-----------------------------------------------------------+
+| [To-Do]       | Performance tuning                                        |
+---------------+-----------------------------------------------------------+
+| [MileStone]   | Runnable with DeepSeek-V3 671B post-training              |
+---------------+-----------------------------------------------------------+
+
+
+
+Utils of Megatron Workers
+-------------------------
+
+MegatronWorker
+^^^^^^^^^^^^^^
+
+``MegatronWorker`` is the base class of different megatron worker
+classes. In this class, ``get_megatron_global_info`` and
+``get_megatron_rank_info`` function to retrieve the 3D parallel world
+size and rank of each ``Worker`` running on specific GPU. These information
+will be used in transfer protocol for Megatron Backend.
+
+The following ``Worker`` class for different models will be utilized to
+construct the ``WorkerGroup`` .
+
+We implement various of APIs for each ``Worker`` class decorated by the
+``@register(dispatch_mode=)`` . These APIs can be called by the ray
+driver process. The data can be correctly collect and dispatch following
+the ``dispatch_mode`` on each function. The supported dispatch_model
+(i.e., transfer protocols) can be found in `decorator.py <https://github.com/volcengine/verl/blob/main/verl/single_controller/base/decorator.py>`_.
+
+ActorRolloutRefWorker
+^^^^^^^^^^^^^^^^^^^^^
+
+This class is implemented for Actor/Rollout HybridEngine or for the
+reference model to initialize their model and perform computation.
+
+Actor/Rollout HybridEngine
+''''''''''''''''''''''''''
+
+1. HybridEngine, Actor and Rollout initialization API.
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
+   def init_model(self):
+
+``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
+process, each worker (on a GPU) will execute the following model
+initialization process.
+
+The initialization details of HybridEngine, Actor and Rollout are
+highlighted below:
+
+1. ``MegatronPPOActor`` implements the simple PPO computation logics
+   when the model is built with Megatron, including compute log prob,
+   model update.
+2. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
+   Engine and make it executed under SPMD to fit into our
+   ``WorkerGroup`` design.
+3. ``MegatronVLLMShardingManager`` a context manager to perform actual
+   resharding between actor and rollout.
+
+See `source code <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py#L63>`_ for more information.
+
+.. code:: python
+
+   # build actor model
+   self.actor = MegatronPPOActor(config=self.config.actor,
+                                 model_config=self.actor_model_config,
+                                 megatron_config=megatron_config,
+                                 actor_module=self.actor_module,
+                                 actor_optimizer=self.actor_optimizer,
+                                 actor_optimizer_config=self.actor_optim_config)
+
+   # build rollout
+   # rollout initialization
+   rollout = vLLMRollout(actor_module=params,
+                        config=self.config.rollout,
+                        tokenizer=self.tokenizer,
+                        model_hf_config=self.actor_model_config,
+                        train_tp=mpu.get_tensor_model_parallel_world_size())
+   # perform weight resharding between actor and rollout
+   sharding_manager = MegatronVLLMShardingManager(module=self.hybrid_engine,
+                                                  inference_engine=rollout.inference_engine,
+                                                  model_config=self.actor_model_config,
+                                                  layer_name_mapping=layer_name_mapping)
+   ...
+
+1. Generate sequence and recompute log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
+   def generate_sequences(self, prompts: DataProto):
+
+- ``Dispatch.MEGATRON_PP_AS_DP_PROTO``: The PP dimension of the actor
+  model will be regarded as DP dimension. Then the driver process will
+  dispatch and collect the data according to this reorganization. This
+  is because, in HybridEngine, the actor weight, which usually applied
+  larger 3D parallel sizes, will be gathered along the PP dimension and
+  TP dimension. Therefore, the corresponding data should be dispatched
+  and collected through the 3D parallel group of the rollout model,
+  rather than the actor model. However, the world_size and rank
+  information can only be retrieved from ``get_megatron_global_info`` and
+  ``get_megatron_rank_info``, which records the 3D information for the
+  actor model. Moreover, the data resharding inside TP dimension will be
+  processed within the HybridEngine.
+
+- In this function, the rollout model will perform auto-regressive
+  generation and the actor model will recompute the old log prob for the
+  generated response.
+
+3. Update actor model
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def update_actor(self, data: DataProto):
+
+- ``Dispatch.MEGATRON_COMPUTE_PROTO``: User passes the data partitioned
+  by DP dimension. The data is dispatched to all tp/pp ranks within the
+  same dp group, and ultimately only collects output data from tp=0 and
+  the last pp.
+- Update the actor model weight using PPO & entropy loss.
+
+
+..note:: 
+
+   Currently, training Tensor Parallel Size can be different from inference
+   Tensor Parallel Size.
+
+
+ReferenceModel
+''''''''''''''
+
+1. Reference model initialization
+
+The reference model is initialized using the same function as the actor
+model without initializing the HybridEngine and Optimizer. Then the
+actor model is also wrapped by the ``MegatronPPOActor``.
+
+2. Compute reference log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_ref_log_prob(self, data: DataProto):
+
+- In this function, the reference model will call the compute log prob
+  function in ``MegatronPPOActor`` to compute the reference log prob.
+
+CriticWorker and RewardWorker
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. Model initialization
+
+Quite similar to reference model. The CriticWorker will perform
+additional initialization for the Optimizer.
+
+2. Compute Values for CriticWorker
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_values(self, data: DataProto):
+
+3. Update Critic
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def update_critic(self, data: DataProto):
+
+4. Compute Reward
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_rm_score(self, data: DataProto):
+
+
+Utils of Train Optimization
+---------------------------
+
+Offload
+^^^^^^^
+When resources are tight, the offload method can lower GPU memory 
+usage, helping training and inference frameworks work well under verl. 
+It moves parameters, gradients, and optimizers to CPU memory and only 
+loads them back to the GPU when needed.
+
+If you want to use the offload, you can add the following parameters 
+for the actor and ref separately. 
+
+.. code:: python
+
+   # For the actor
+   actor_rollout_ref.actor.megatron.param_offload=True \
+   actor_rollout_ref.actor.megatron.grad_offload=True \
+   actor_rollout_ref.actor.megatron.optimizer_offload=True \
+   # For the ref w/o grad and optimizer
+   actor_rollout_ref.ref.megatron.param_offload=True \
+
+
+For the critic, you can include these parameters.
+
+.. code:: python
+
+   # For the critic
+   critic.megatron.param_offload=True \
+   critic.megatron.grad_offload=True \
+   critic.megatron.optimizer_offload=True \
+
+Profiler
+^^^^^^^^
+
+The profiler is a tool that helps you understand the performance of your 
+model. It can be used to profile the time spent on different operations 
+and identify the bottlenecks. You can get more information from 
+`torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_.
+
+In verl, now the profiler is only support for the actor role In Megatron. You can set 
+the begin step and end step to profile. Notice, one step means one gradient update. And 
+the profile result will be saved in the save_path. If you just want to profile in the 
+specific rank, you can set the profile_ranks, by default, it will be [0].
+
+.. code:: python
+
+   actor_rollout_ref.actor.profile.use_profile=True \
+   actor_rollout_ref.actor.profile.profile_ranks=[0] \
+   actor_rollout_ref.actor.profile.step_start=0 \
+   actor_rollout_ref.actor.profile.step_end=1 \
+   actor_rollout_ref.actor.profile.save_path="./profile"
+
+
+Related MCore Document
+----------------------
+
+There is also a detailed document of using MCore to train different
+kinds of models, please refer to `MCore Document <https://github.com/volcengine/verl/blob/main/verl/models/mcore/readme.md>`_.
--- a/docs/workers/ray_trainer.rst
+++ b/docs/workers/ray_trainer.rst
+PPO Ray Trainer
+===============
+
+Last updated: 02/12/2025.
+
+We implement the RayPPOTrainer, which is a trainer runs on the driver
+process on a single CPU/GPU node (default is CPU).
+
+The PPORayTrainer include 3 core functions for data preparation,
+WorkerGroup initialization and PPO training loop.
+
+Data Preparation
+----------------
+
+The ``PPORayTrainer``, as a single process, is responsible for loading a
+complete batch of samples (prompts) from the dataset and then dispatch
+to different worker_groups running on different GPUs.
+
+To generalize the data loading, we implement the ``RLHFDataset`` class
+to load the preprocessed parquet files, apply chat templates to the
+prompts, add padding, truncate prompts that exceed max prompt length and
+then tokenize.
+
+.. code:: python
+
+   self.train_dataset = RLHFDataset(data_files=self.config.data.train_files,
+                                       tokenizer=self.tokenizer,
+                                       config=self.config.data)
+
+Then, the dataloader will iterate the dataset under PPO mini batch size.
+
+WorkerGroup Initialization
+--------------------------
+
+We first introduce a basic implementation of initializing the
+``WorkerGroup`` of the actor model on a given set of GPUs.
+
+.. code:: python
+
+   # max_colocate_count means the number of WorkerGroups (i.e. processes) in each RayResourcePool
+   # For FSDP backend, we recommend using max_colocate_count=1 that merge all WorkerGroups into one.
+   # For Megatron backend, we recommend using max_colocate_count>1 that can utilize different WorkerGroup for differnt models
+   resource_pool = RayResourcePool(process_on_nodes=[config.trainer.n_gpus_per_node] * config.trainer.nnodes,
+                                   use_gpu=True,
+                                   max_colocate_count=1)
+   # define actor rollout cls to be init on remote
+   actor_rollout_cls = RayClassWithInitArgs(cls=ActorRolloutWorker)
+   # define actor_rollout worker group
+   actor_rollout_worker_group = MegatronRayWorkerGroup(resource_pool=resource_pool,
+                                                       ray_cls_with_init=actor_rollout_cls,
+                                                       default_megatron_kwargs=config.actor_rollout.megatron)
+
+Different WorkerGroups, like ``actor_rollout_worker_group`` ,
+``critic_worker_group`` and ``ref_worker_group`` lies on a separate
+process in the above implementation.
+
+The driver process can then call the distributed compute function within
+the ``actor_rollout_worker_group`` and other roles to construct the RL
+training loop.
+
+For models colocated in the same set of GPUs, we further provide a
+fine-grain optimization, which merge the ``worker_group`` of different roles
+in the same process. This optimization can save the redundant
+CUDA/distributed context in different processes.
+
+.. code:: python
+
+   # initialize WorkerGroup
+   # NOTE: if you want to use a different resource pool for each role, which can support different parallel size,
+   # you should not use `create_colocated_worker_cls`. Instead, directly pass different resource pool to different worker groups.
+   # See TODO(url) for more information.
+   all_wg = {}
+   for resource_pool, class_dict in self.resource_pool_to_cls.items():
+       worker_dict_cls = create_colocated_worker_cls(class_dict=class_dict)
+       wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
+       spawn_wg = wg_dict.spawn(prefix_set=class_dict.keys())
+       all_wg.update(spawn_wg)
+
+   if self.use_critic:
+       self.critic_wg = all_wg['critic']
+       self.critic_wg.init_model()
+
+   if self.use_reference_policy:
+       self.ref_policy_wg = all_wg['ref']
+       self.ref_policy_wg.init_model()
+
+   if self.use_rm:
+       self.rm_wg = all_wg['rm']
+       self.rm_wg.init_model()
+
+   # we should create rollout at the end so that vllm can have a better estimation of kv cache memory
+   self.actor_rollout_wg = all_wg['actor_rollout']
+   self.actor_rollout_wg.init_model()
+
+.. note:: For megatron backend, if we merge the ``worker_groups`` into the same processes, all the roles will utilize the same 3D parallel size. To optimize this, we may need to maintain several 3D process groups for each role in the same distributed context. If you want to use different 3D parallel size for different roles, please follow the similar architecture of the first code block to initialize each role's ``worker_group``
+
+
+PPO Training Loop
+-----------------
+
+We implement the PPO training loop by calling the functions in
+worker_group of each role. The input and output data of each function is
+a ``DataProto`` object implemented in `protocol.py <https://github.com/volcengine/verl/blob/main/verl/protocol.py>`_. In the training
+loop, trainer will dispatch/collect the data to/from different GPUs
+following the transfer protocols wrapped in the workers' functions. The
+computation of PPO micro batches is processed in ``update_actor`` and
+``update_critic`` functions.
+
+To extend to other RLHF algorithms, such as DPO, GRPO, please refer to
+:doc:`../advance/dpo_extension`.
+
+.. code:: python
+
+   def fit(self):
+       """
+       The training loop of PPO.
+       The driver process only need to call the compute functions of the worker group through RPC to construct the PPO dataflow.
+       The light-weight advantage computation is done on the driver process.
+       """
+       from verl.utils.tracking import Tracking
+       from omegaconf import OmegaConf
+
+       logger = Tracking(project_name=self.config.trainer.project_name,
+                           experiment_name=self.config.trainer.experiment_name,
+                           default_backend=self.config.trainer.logger,
+                           config=OmegaConf.to_container(self.config, resolve=True))
+
+       global_steps = 0
+
+       # perform validation before training
+       # currently, we only support validation using the reward_function.
+       if self.val_reward_fn is not None:
+           val_metrics = self._validate()
+           pprint(f'Initial validation metrics: {val_metrics}')
+
+       for epoch in range(self.config.trainer.total_epochs):
+           for batch_dict in self.train_dataloader:
+               metrics = {}
+
+               batch: DataProto = DataProto.from_single_dict(batch_dict)
+               # batch = batch.to('cuda')
+
+               # pop those keys for generation
+               gen_batch = batch.pop(batch_keys=['input_ids', 'attention_mask', 'position_ids'])
+
+               # generate a batch
+               with Timer(name='gen', logger=None) as timer:
+                   gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
+               metrics['timing/gen'] = timer.last
+
+               batch = batch.union(gen_batch_output)
+
+               if self.use_reference_policy:
+                   # compute reference log_prob
+                   with Timer(name='ref', logger=None) as timer:
+                       ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
+                       batch = batch.union(ref_log_prob)
+                   metrics['timing/ref'] = timer.last
+
+               # compute values
+               with Timer(name='values', logger=None) as timer:
+                   values = self.critic_wg.compute_values(batch)
+                   batch = batch.union(values)
+               metrics['timing/values'] = timer.last
+
+               with Timer(name='adv', logger=None) as timer:
+                   # compute scores. Support both model and function-based.
+                   # We first compute the scores using reward model. Then, we call reward_fn to combine
+                   # the results from reward model and rule-based results.
+                   if self.use_rm:
+                       # we first compute reward model score
+                       reward_tensor = self.rm_wg.compute_rm_score(batch)
+                       batch = batch.union(reward_tensor)
+
+                   # we combine with rule-based rm
+                   reward_tensor = self.reward_fn(batch)
+                   batch.batch['token_level_scores'] = reward_tensor
+
+                   # compute rewards. apply_kl_penalty if available
+                   batch, kl_metrics = apply_kl_penalty(batch,
+                                                           kl_ctrl=self.kl_ctrl_in_reward,
+                                                           kl_penalty=self.config.algorithm.kl_penalty)
+                   metrics.update(kl_metrics)
+
+                   # compute advantages, executed on the driver process
+                   batch = compute_advantage(batch,
+                                               self.config.algorithm.gamma,
+                                               self.config.algorithm.lam,
+                                               adv_estimator=self.config.algorithm.adv_estimator)
+               metrics['timing/adv'] = timer.last
+
+               # update critic
+               if self.use_critic:
+                   with Timer(name='update_critic', logger=None) as timer:
+                       critic_output = self.critic_wg.update_critic(batch)
+                   metrics['timing/update_critic'] = timer.last
+                   critic_output_metrics = reduce_metrics(critic_output.meta_info['metrics'])
+                   metrics.update(critic_output_metrics)
+
+               # implement critic warmup
+               if self.config.trainer.critic_warmup <= global_steps:
+                   # update actor
+                   with Timer(name='update_actor', logger=None) as timer:
+                       actor_output = self.actor_rollout_wg.update_actor(batch)
+                   metrics['timing/update_actor'] = timer.last
+                   actor_output_metrics = reduce_metrics(actor_output.meta_info['metrics'])
+                   metrics.update(actor_output_metrics)
+
+               # validate
+               if self.val_reward_fn is not None and (global_steps + 1) % self.config.trainer.test_freq == 0:
+                   with Timer(name='testing', logger=None) as timer:
+                       val_metrics: dict = self._validate()
+                       val_metrics = {f'val/{key}': val for key, val in val_metrics.items()}
+                   metrics['timing/testing'] = timer.last
+                   metrics.update(val_metrics)
+
+               # collect metrics
+               data_metrics = compute_data_metrics(batch=batch)
+               metrics.update(data_metrics)
+
+               # TODO: make a canonical logger that supports various backend
+               logger.log(data=metrics, step=global_steps)
+
+               if self.config.trainer.save_freq > 0 and (global_steps + 1) % self.config.trainer.save_freq == 0:
+                   actor_local_path = os.path.join(self.config.trainer.default_local_dir, 'actor',
+                                                   f'global_step_{global_steps}')
+                   actor_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'actor')
+                   self.actor_rollout_wg.save_checkpoint(actor_local_path, actor_remote_path)
+
+                   if self.use_critic:
+                       critic_local_path = os.path.join(self.config.trainer.default_local_dir, 'critic',
+                                                           f'global_step_{global_steps}')
+                       critic_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'critic')
+                       self.critic_wg.save_checkpoint(critic_local_path, critic_remote_path)
+
+               global_steps += 1
+
+       # perform validation after training
+       if self.val_reward_fn is not None:
+           val_metrics = self._validate()
+           pprint(f'Final validation metrics: {val_metrics}')
--- a/docs/workers/sglang_worker.rst
+++ b/docs/workers/sglang_worker.rst
+SGLang Backend
+==============
+
+Last updated: 05/31/2025.
+
+**Authored By SGLang RL Team and listed alphabetically by last name**
+
+`Jingyi Chen <https://github.com/fzyzcjy>`_, `Yitong Guan <https://github.com/minleminzui>`_, `Zhuobin Huang <https://zobinhuang.github.io/sec_about/>`_, `Jiajun Li <https://github.com/guapisolo>`_, `Ji Li <https://github.com/GeLee-Q>`_, `Shenggui Li <https://franklee.xyz/about>`_, `Junrong Lin <https://github.com/ocss884>`_, `Xiang Long <https://github.com/SwordFaith>`_, `Rui Lu <https://scholar.google.com/citations?user=-MGuqDcAAAAJ>`_, `Jin Pan <https://jhinpan.github.io/>`_, `Shuai Shi <https://github.com/shuaills>`_, `Yushen Su <https://yushengsu-thu.github.io/>`_, `Xinyuan Tong <https://github.com/JustinTong0323>`_, `Chendong Wang <https://github.com/cedricbeta>`_, `Hanchen Zhang <https://scholar.google.com/citations?user=pGcJcagAAAAJ>`_, `Haoran Wang <https://ubecc.github.io/about/>`_, `Yongan Xiang <https://github.com/BearBiscuit05>`_, `Chengxing Xie <https://yitianlian.github.io/>`_, `Yuhao Yang <https://github.com/yhyang201>`_, `Jinwei Yao <https://kivi-yao.github.io/>`_, `Qiaolin Yu <https://github.com/Qiaolin-Yu>`_, `Yuzhen Zhou <https://github.com/zyzshishui>`_, `Chenyang Zhao <https://github.com/zhaochenyang20>`_
+
+
+
+Introduction
+------------
+`SGLang <https://github.com/sgl-project/sglang>`_ is an open-source state-of-the-art inference service engine, fully adopted by xAI to support all inference needs of Grok during research and serving processes.
+
+Currently, verl fully supports using SGLang as the inference engine during the rollout phase. As a rollout engine, SGLang provides the same feature coverage as vLLM., including memory saving and multi-node rollout features. After installing verl and SGLang, simply add ``actor_rollout_ref.rollout.name=sglang`` at startup script to seamlessly switch between the two inference frameworks.
+
+In addition, the SGLang team is actively working on supporting features such as Multi-Turn Agentic RL, VLM RLHF, Server-Based RLHF, and Partial Rollout. You can track the related development progress in the `Tracking Roadmap <https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/74>`_.
+
+Installation
+------------
+Please always follow the following command to install SGLang with verl. 
+
+.. code-block:: bash
+    
+    pip install --upgrade pip
+    # Currently 0.4.8, subject to updates at any time, please refer to the latest version specified in `setup.py`
+    pip install -e ".[sglang]"
+
+You can check the following dependencies are in your environment:
+
+.. note::
+
+    - **PyTorch**: 2.6.0+cu124
+    - **CUDA**: 12.4
+    - **flashinfer-python**: 0.2.5+cu124torch2.6
+    - **SGLang**: 0.4.6.post5
+    - **sgl-kernel**: 0.1.4
+
+Using SGLang as the Inference Backend for PPO Training on a Single Machine
+-------------------------------------------------------------------------
+We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
+
+1. Run the following command to prepare the gsm8k dataset:
+
+.. code-block:: bash
+
+    python3 examples/data_preprocess/gsm8k.py
+
+2. Run the following script to conduct a PPO experiment on a single machine with 4 GPUs:
+
+.. code-block:: bash
+
+    export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True
+    PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
+        data.train_files=$HOME/data/gsm8k/train.parquet \
+        data.val_files=$HOME/data/gsm8k/test.parquet \
+        data.train_batch_size=4096 \
+        data.max_prompt_length=4096 \
+        data.max_response_length=4096 \
+        actor_rollout_ref.rollout.name=sglang \
+        actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=True \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+        critic.optim.lr=1e-5 \
+        critic.model.path=Qwen/Qwen2-7B-Instruct \
+        critic.ppo_micro_batch_size_per_gpu=4 \
+        critic.model.fsdp_config.param_offload=True \
+        critic.model.fsdp_config.optimizer_offload=True \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.logger=console \
+        trainer.val_before_train=False \
+        trainer.n_gpus_per_node=4 \
+        trainer.nnodes=1 \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15 2>&1 | tee verl_demo.log
+
+Why export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples.
+
+2. ``SGLangRollout`` will initialize ``Engine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP).
+
+3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks.
+
+Why might there be inconsistent GPU memory?
+"""""""""""""""""""""""""""""""""""""""""""
+
+**1. Ray Distributed Actor loads the model at different times**
+
+``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times:
+
+.. code-block:: python
+
+    self.rollout = SGLangRollout(...)
+
+Different workers initialize the model at different times → different memory usage.
+
+**2. Delayed initialization causes memory bias**
+
+Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others.  
+Early workers already use up GPU memory → late workers still have empty memory → memory difference appears.
+
+**3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing**
+
+Although ``SGLangRollout`` may only involve subset of GPUs, its ``Engine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so:
+
+- Non-rollout GPUs also join the communication.
+- Later on, ``DeviceMesh`` init will fail due to "inconsistent memory".
+
+**4. Different FSDP/TP loading behaviors also lead to mismatch**
+
+If using:
+
+.. code-block:: bash
+
+    actor.fsdp_config.param_offload=True  
+    ref.fsdp_config.param_offload=True
+
+Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.
+
+Using SGLang as the Inference Backend for PPO Training Across Multiple Machines
+------------------------------------------------------------------------------
+SGLang also supports running verl's RAY-based cross-machine inference in IPv4 and IPv6 scenarios. In the script below, we use TP=16 for cross-machine inference. Suppose we have two interconnected machines: node0 with IP 10.94.16.4 and node1 with IP 10.94.16.5.
+
+1. Start Ray on node0:
+
+.. code-block:: bash
+
+    ray start --head --dashboard-host=0.0.0.0
+
+You will see the following prompt:
+
+.. code-block:: bash
+
+    Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
+
+    Local node IP: 10.94.16.4
+
+    --------------------
+    Ray runtime started.
+    --------------------
+
+    Next steps
+    To add another node to this Ray cluster, run
+        ray start --address='10.94.16.4:6379'
+
+2. Have node1 join the Ray cluster:
+
+Run the following command on node1:
+
+.. code-block:: bash
+
+    ray start --address='10.94.16.4:6379'
+
+Run the following command to confirm that the Ray cluster now has two nodes:
+
+.. code-block:: bash
+
+    ray status
+
+You can see that the cluster has two nodes with 16 GPUs:
+
+.. code-block:: bash
+
+    ======== Autoscaler status: 2025-04-09 09:25:37.694016 ========
+    Node status
+    ---------------------------------------------------------------
+    Active:
+     1 node_ef382ffd687d8f6b060c1b68e63ada7341b936fe5b1901dd04de1027
+     1 node_1eb4d7d07e793114c23a89d1a41f1f76acf6ef5b35af844a4ee8e4ba
+    Pending:
+     (no pending nodes)
+    Recent failures:
+     (no failures)
+
+    Resources
+    ---------------------------------------------------------------
+    Usage:
+     0.0/360.0 CPU
+     0.0/16.0 GPU
+     0B/3.39TiB memory
+     0B/372.53GiB object_store_memory
+
+3. Run the following script to train meta-llama/Llama-3.1-8B-Instruct with TP=16 across 2 machines using 16 GPUs:
+
+.. code-block:: bash
+
+    DATA_DIR=$HOME/data/gsm8k
+
+    python3 -m verl.trainer.main_ppo \
+        actor_rollout_ref.rollout.name=sglang \
+        data.train_files=$DATA_DIR/train.parquet \
+        data.val_files=$DATA_DIR/test.parquet \
+        data.train_batch_size=4096 \
+        data.max_prompt_length=4096 \
+        data.max_response_length=4096 \
+        actor_rollout_ref.model.path=meta-llama/Llama-3.1-8B-Instruct \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.model.use_remove_padding=True \
+        actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=True \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=16 \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+        actor_rollout_ref.rollout.free_cache_engine=True \
+        actor_rollout_ref.ref.log_prob_micro_batch_size=16 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=True \
+        critic.optim.lr=1e-5 \
+        critic.model.use_remove_padding=True \
+        critic.model.path=meta-llama/Llama-3.1-8B-Instruct \
+        critic.model.enable_gradient_checkpointing=True \
+        critic.ppo_micro_batch_size=16 \
+        critic.model.fsdp_config.param_offload=True \
+        critic.model.fsdp_config.optimizer_offload=True \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.critic_warmup=0 \
+        trainer.logger=console \
+        trainer.val_before_train=True \
+        trainer.n_gpus_per_node=8 \
+        trainer.nnodes=2 \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15 2>&1 | tee verl_demo.log
--- a/examples/data_preprocess/aime2024_multiturn_w_tool.py
+++ b/examples/data_preprocess/aime2024_multiturn_w_tool.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023-2024 SGLang Team
+# Copyright 2025 ModelBest Inc. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the DAPO-Math-17k dataset to multiturn format
+"""
+
+import argparse
+import os
+
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dir", default="~/data/retool_aime2024")
+    parser.add_argument("--hdfs_dir", default=None)
+
+    args = parser.parse_args()
+
+    data_path = "BytedTsinghua-SIA/AIME-2024"
+    dataset = datasets.load_dataset(data_path, "default")
+
+    train_dataset = dataset["train"]
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+        def process_fn(example, idx):
+            orig_extra_info = example.pop("extra_info")
+            extra_info = orig_extra_info.copy()
+            extra_info["need_tools_kwargs"] = True
+            extra_info["tools_kwargs"] = {
+                "code_interpreter": {
+                    "create_kwargs": {
+                        "ground_truth": example["reward_model"]["ground_truth"],
+                    },
+                },
+            }
+            example["extra_info"] = extra_info
+            return example
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, "train.parquet"))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/data_preprocess/dapo_multiturn_w_tool.py
+++ b/examples/data_preprocess/dapo_multiturn_w_tool.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023-2024 SGLang Team
+# Copyright 2025 ModelBest Inc. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the DAPO-Math-17k dataset to multiturn format
+"""
+
+import argparse
+import os
+
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dir", default="~/data/retool_dapo")
+    parser.add_argument("--hdfs_dir", default=None)
+
+    args = parser.parse_args()
+
+    data_path = "BytedTsinghua-SIA/DAPO-Math-17k"
+    dataset = datasets.load_dataset(data_path, "default")
+
+    train_dataset = dataset["train"]
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+        def process_fn(example, idx):
+            orig_extra_info = example.pop("extra_info")
+            extra_info = orig_extra_info.copy()
+            extra_info["need_tools_kwargs"] = True
+            extra_info["tools_kwargs"] = {
+                "code_interpreter": {
+                    "create_kwargs": {
+                        "ground_truth": example["reward_model"]["ground_truth"],
+                    },
+                },
+            }
+            example["extra_info"] = extra_info
+            return example
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, "train.parquet"))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+        copy(src=local_dir, dst=hdfs_dir)