Commit 7f6cc211 authored by jerrrrry's avatar jerrrrry
Browse files

Initial commit

parents
Pipeline #2874 failed with stages
in 0 seconds
Prepare Data for Post-Training
========================================
Last updated: 02/09/2025.
Before starting the post-training job, we need to prepare the data for
the policy training. The data should be stored in the parquet format.
We provide several data preprocess scripts for different datasets,
including GSM8K, MATH, HelloSwag, Full_hh_rlhf. To prepare other datasets, we need
to follow the following steps: The data preprocess script can be divided
into two parts:
1. The first part is the common part, which loads the dataset from
huggingface's ``datasets`` package. Then preprocess the datasets with
the ``make_map_fn`` and then store in the parquet format.
.. code:: python
import re
import os
import datasets
from verl.utils.hdfs_io import copy, makedirs
import argparse
# To extract the solution for each prompts in the dataset
# def extract_solution(solution_str):
# ...
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--local_dir', default='/opt/tiger/gsm8k')
parser.add_argument('--hdfs_dir', default=None)
args = parser.parse_args()
num_few_shot = 5
data_source = 'openai/gsm8k'
dataset = datasets.load_dataset(data_source, 'main')
train_dataset = dataset['train']
test_dataset = dataset['test']
# Construct a `def make_map_fn(split)` for the corresponding datasets.
# ...
train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
local_dir = args.local_dir
hdfs_dir = args.hdfs_dir
train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
makedirs(hdfs_dir)
copy(src=local_dir, dst=hdfs_dir)
2. The users are required to implement the ``make_map_fn()`` function
(as well as the ``extract_solution``) on their own to support
different datasets or tasks.
We already implemented the data preprocess of GSM8k, MATH, Hellaswag and Full_hh_rlhf
datasets. And we take the GSM8k dataset as an example:
**GSM8K**
In the ``make_map_fn``, each data field should consist of the following
5 fields:
1. ``data_source``: The name of the dataset. To index the corresponding
reward function in the ``RewardModule``
2. ``prompt``: This field should be constructed in the format of
huggingface chat_template. The tokenizer in ``RLHFDataset`` will
apply chat template and tokenize the prompt.
3. ``ability``: Define the task category.
4. ``reward_model``: Currently, we only utilize the ``ground_truth``
field during evaluation. The ``ground_truth`` is computed by the
``extract_solution`` function. **NOTED** that the implementation of
the corresponding reward function should align with this extracted
``ground_truth``.
5. ``extra_info``: Record some information of the current prompt. Not
use for now.
.. code:: python
def extract_solution(solution_str):
solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str) # extract the solution after ####
assert solution is not None
final_solution = solution.group(0)
final_solution = final_solution.split('#### ')[1].replace(',', '')
return final_solution
instruction_following = "Let's think step by step and output the final answer after \"####\"."
# add a row to each data item that represents a unique id
def make_map_fn(split):
def process_fn(example, idx):
question = example.pop('question')
question = question + ' ' + instruction_following
answer = example.pop('answer')
solution = extract_solution(answer)
data = {
"data_source": data_source,
"prompt": [{
"role": "user",
"content": question
}],
"ability": "math",
"reward_model": {
"style": "rule",
"ground_truth": solution
},
"extra_info": {
'split': split,
'index': idx
}
}
return data
return process_fn
Implement Reward Function for Dataset
======================================
Last updated: 06/02/2025.
For each dataset, we need to implement a reward function or utilize a reward model to compute the rewards for the generated responses.
We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
You can also use customized reward functions.
Currently, we support reward functions for GSM8k and MATH datasets. For RLHF datasets (e.g.,
full_hh_rlhf) and Code Generation (e.g., APPS), we utilize reward model
and SandBox (will opensource soon) for evaluation respectively.
RewardManager
-------------
In the entrypoint of the PPO Post-Training script `main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py#L33>`_,
we implement a ``RewardManager`` that utilize pre-implemented reward functions to compute the scores for each response.
In the ``RewardManager``, we implemented a ``__call__`` function to
compute the score for each response.
All the reward functions are executed by ``compute_score_fn``.
The input is a ``DataProto``, which includes:
- ``input_ids``, ``attention_mask``: ``input_ids`` and ``attention_mask`` after applying
chat_template, including prompt and response
- ``responses``: response tokens
- ``ground_truth``: The ground truth string of the current prompt.
Stored in ``non_tensor_batch`` in the ``DataProto``, which should be
preprocessed in the parquet files.
- ``data_source``: The dataset name of the current prompt. Stored in
``non_tensor_batch`` in the ``DataProto``, which should be
preprocessed in the parquet files.
After detokenize the responses, the responses string and the ground
truth string will be input to the ``compute_score_fn`` to compute the
score for each response.
Reward Functions
----------------
Pre-implemented
~~~~~~~~~~~~~~~
We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
- In the `GSM8k example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py>`_, we
force the response to output the final answer after four ####, then
use string matching to compare with the ground truth. If completely
correct, score 1 point; if the format is correct, score 0.1 points; if
the format is incorrect, score 0 points.
- In the `MATH example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/math.py>`_, we follow
the implementation in `lm-evaluation-harness repository <https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hendrycks_math/utils.py>`_.
Customized
~~~~~~~~~~
You can implement customized reward functions in a separate file and specify them using ``custom_reward_function.path`` and ``custom_reward_function.name``. For the set of them, please refer to :ref:`config-explain-page`.
The parameters of your reward function should be ``data_source``, ``solution_str``, ``ground_truth``, and ``extra_info``.
For example:
.. code:: python
def my_reward_fn(data_source, solution_str, ground_truth, extra_info=None):
return len(solution_str)/100
If you are testing only a single customized reward function, you can simply name it 'compute_score' and leave ``custom_reward_function.name`` unset.
To run multiple tests with different customized reward functions, you can modify both ``custom_reward_function.path`` and ``custom_reward_function.name`` for each trial.
For instance, you might create a single `my_reward.py` file and implement multiple reward functions within it. This way, for different trials, you only need to adjust ``custom_reward_function.name``, making it more convenient to conduct multiple tests within scripts.
# markdown support
recommonmark
myst_parser
# markdown table support
sphinx-markdown-tables
# theme default rtd
# crate-docs-theme
sphinx-rtd-theme
# pin tokenizers version to avoid env_logger version req
tokenizers==0.21
Interaction System for Multi-turn RL Training
=============================================
Last updated: 06/25/2025.
Overview
--------
The verl interaction system enables dynamic, multi-turn conversational feedback during reinforcement learning training. This system allows models to engage in iterative problem-solving scenarios where interaction agents can provide corrective feedback, guidance, or evaluation based on the model's responses.
**New in Multi-Interaction Support**: The system now supports multiple named interactions within a single training session, enabling sophisticated training scenarios where different samples can use different interaction strategies. This allows for curriculum learning, domain-specific feedback, and flexible agent switching at the sample level.
Key features:
- **Async-based Architecture**: Non-blocking interaction processing for distributed training
- **Instance Management**: Stateful session handling with unique instance IDs for concurrent interactions
- **SGLang Integration**: Seamless integration with SGLang rollout system for multi-turn conversations
- **Configuration-driven**: Dynamic agent loading via YAML configuration files
- **Multi-Interaction Support**: Registry system enabling multiple named interactions per rollout
- **Sample-Level Selection**: Each sample can specify which interaction to use via configuration
- **Reward Integration**: Turn-level scoring mechanism integrated with verl's reward system
Architecture
------------
The interaction system follows a plugin-based architecture with clear separation of concerns:
.. code-block::
Interaction Registry System
BaseInteraction (Abstract Interface)
Multiple Named Interactions (e.g., Gsm8kInteraction, CustomInteraction)
SGLang Rollout Integration (interaction_map)
Sample-Level Interaction Selection
Async Request Lifecycle Management
Core Components
~~~~~~~~~~~~~~~
**Interaction Registry System**
The interaction registry system allows loading and managing multiple named interactions:
.. code-block:: python
from verl.interactions.utils.interaction_registry import initialize_interactions_from_config
# Load multiple interactions from config
interaction_map = initialize_interactions_from_config("config.yaml")
# Access specific interaction by name
gsm8k_interaction = interaction_map["gsm8k"]
custom_interaction = interaction_map["custom_solver"]
**BaseInteraction Interface**
All interaction agents must implement the ``BaseInteraction`` abstract class:
.. code-block:: python
from verl.interactions.base import BaseInteraction
from typing import Dict, Any, List, Tuple, Optional
class BaseInteraction:
def __init__(self, config: Dict[str, Any]):
self.config = config
self.name: str = config.get("name", "interaction_agent")
async def start_interaction(self, instance_id: Optional[str] = None, **kwargs) -> str:
"""Initialize interaction session, return instance_id"""
async def generate_response(self, instance_id: str, messages: List[Dict[str, Any]], **kwargs) -> Tuple[bool, str, float, Dict[str, Any]]:
"""Generate response, return (should_terminate, response, score, metadata)"""
async def calculate_score(self, instance_id: str, **kwargs) -> float:
"""Calculate turn-level score for RL training"""
async def finalize_interaction(self, instance_id: str, **kwargs) -> None:
"""Clean up resources"""
**Request Lifecycle**
The interaction system integrates with SGLang's async rollout via state management:
1. ``PENDING`` → Initialize interaction via ``start_interaction()``
2. ``GENERATING`` → Model generates response
3. ``INTERACTING`` → Process response via ``generate_response()``
4. ``GENERATING`` → Continue if not terminated, otherwise ``COMPLETED``
Configuration
-------------
**Basic Setup**
Enable interaction in your rollout configuration:
.. code-block:: yaml
actor_rollout_ref:
rollout:
multi_turn:
enable: true
interaction_config_path: "path/to/interaction_config.yaml"
max_user_turns: 10
max_assistant_turns: 10
**Interaction Configuration File**
Create an interaction configuration file (e.g., ``interaction_config.yaml``):
**Single Interaction (Legacy Format)**
.. code-block:: yaml
interaction:
- name: "gsm8k"
class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
config: {}
**Multiple Interactions (New Format)**
.. code-block:: yaml
interaction:
- name: "gsm8k"
class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
config: {}
- name: "custom_solver"
class_name: "custom.interactions.CustomInteraction"
config:
solver_type: "advanced"
timeout: 30
- name: "code_verifier"
class_name: "verl.interactions.base.BaseInteraction"
config:
verification_mode: "strict"
**Automatic Name Generation**
If no ``name`` field is provided, the system will automatically generate one from the class name:
.. code-block:: yaml
interaction:
- class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
config: {}
# Automatically generates name: "gsm8k"
The system will dynamically load all specified interaction classes and make them available by name.
Implementation Example: GSM8K
-----------------------------
The GSM8K interaction demonstrates a complete implementation for math problem-solving scenarios:
.. code-block:: python
from verl.interactions.base import BaseInteraction
from verl.utils.reward_score import gsm8k
from uuid import uuid4
class Gsm8kInteraction(BaseInteraction):
def __init__(self, config: dict):
super().__init__(config)
self._instance_dict = {}
async def start_interaction(self, instance_id=None, ground_truth=None, **kwargs):
if instance_id is None:
instance_id = str(uuid4())
self._instance_dict[instance_id] = {
"response": "",
"ground_truth": ground_truth,
"reward": 0.0,
}
return instance_id
async def generate_response(self, instance_id, messages, **kwargs):
# Extract last user message content
content = ""
for item in reversed(messages):
if item.get("role") == "assistant":
content = item.get("content", "")
break
# Ensure GSM8K format (#### prefix)
self._instance_dict[instance_id]["response"] = content
reward = await self.calculate_score(instance_id)
if reward == 1.0:
return True, "Your response is correct!", 1.0, {}
else:
return False, "Your response is incorrect! You need to reflect on your answer and try again.", 0.0, {}
async def calculate_score(self, instance_id, **kwargs):
return gsm8k.compute_score(
self._instance_dict[instance_id]["response"],
self._instance_dict[instance_id]["ground_truth"],
method="strict", format_score=0.0, score=1.0,
)
async def finalize_interaction(self, instance_id, **kwargs):
del self._instance_dict[instance_id]
Training Integration
--------------------
**Training Script Configuration**
Include interaction configuration in your training command:
.. code-block:: bash
python3 -m verl.trainer.main_ppo \\
--config-path="$CONFIG_PATH" \\
--config-name='gsm8k_multiturn_grpo_w_interaction' \\
algorithm.adv_estimator=grpo \\
data.train_batch_size=512 \\
data.return_raw_chat=True \\
actor_rollout_ref.rollout.name=sglang \\
actor_rollout_ref.rollout.multi_turn.interaction_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/interaction_config/gsm8k_interaction_config.yaml" \\
trainer.total_epochs=15
**Data Requirements**
Ensure your dataset includes interaction parameters with the ``name`` field for interaction selection:
.. code-block:: python
# Dataset should include interaction_kwargs in non_tensor_batch
interaction_kwargs = [
{"name": "gsm8k", "query": "What is 2+2?", "ground_truth": "4"},
{"name": "custom_solver", "query": "Solve: x^2 + 5x + 6 = 0", "ground_truth": "x = -2, -3"},
{"name": "gsm8k", "query": "What is 3+3?", "ground_truth": "6"},
]
**Sample-Level Interaction Selection**
Each sample can specify which interaction to use via the ``name`` field. This enables flexible training scenarios where different samples use different interaction strategies:
.. code-block:: python
# Example: Math problems use GSM8K interaction, code problems use code verifier
data_samples = [
{
"prompt": "What is 15% of 200?",
"interaction_kwargs": {
"name": "gsm8k",
"query": "What is 15% of 200?",
"ground_truth": "30"
}
},
{
"prompt": "Write a function to check if a number is prime",
"interaction_kwargs": {
"name": "code_verifier",
"code_type": "python",
"expected_behavior": "return True for prime numbers"
}
}
]
**Backward Compatibility**
If no ``name`` field is provided in ``interaction_kwargs``, the system defaults to ``"gsm8k"`` for backward compatibility.
Best Practices
--------------
**Resource Management**
- Always implement proper cleanup in ``finalize_interaction()``
- Use unique instance IDs to avoid conflicts in concurrent training
- Handle edge cases like empty messages or malformed content
**Performance Optimization**
- Keep interaction logic lightweight to avoid blocking training
- Use async/await properly to maintain non-blocking behavior
- Consider caching expensive computations within interaction instances
**Testing**
Comprehensive testing is essential for interaction systems:
.. code-block:: python
import pytest
from unittest.mock import patch
@pytest.mark.asyncio
async def test_interaction_workflow():
interaction = YourInteraction({})
# Test complete workflow
instance_id = await interaction.start_interaction(ground_truth="expected_answer")
messages = [{"role": "user", "content": "user_content"}, {"role": "assistant", "content": "assistant_response"}]
should_terminate, response, reward, metadata = await interaction.generate_response(instance_id, messages)
assert should_terminate in [True, False]
assert isinstance(reward, float)
await interaction.finalize_interaction(instance_id)
Advanced Usage
--------------
**Multi-Interaction Training Strategies**
You can design sophisticated training scenarios using multiple interactions:
.. code-block:: python
# Example: Progressive difficulty with different interaction agents
class MathTrainingPipeline:
def create_interaction_config(self):
return {
"interaction": [
{
"name": "basic_math",
"class_name": "verl.interactions.gsm8k_interaction.Gsm8kInteraction",
"config": {"difficulty": "easy"}
},
{
"name": "advanced_math",
"class_name": "custom.interactions.AdvancedMathInteraction",
"config": {"difficulty": "hard", "allow_hints": True}
},
{
"name": "competition_math",
"class_name": "custom.interactions.CompetitionMathInteraction",
"config": {"time_limit": 300, "show_steps": False}
}
]
}
def create_curriculum_data(self, epoch):
if epoch < 5:
return [{"name": "basic_math", ...} for _ in samples]
elif epoch < 10:
return [{"name": "advanced_math", ...} for _ in samples]
else:
return [{"name": "competition_math", ...} for _ in samples]
**Custom Scoring Functions**
You can integrate custom reward functions:
.. code-block:: python
async def calculate_score(self, instance_id, **kwargs):
response = self._instance_dict[instance_id]["response"]
ground_truth = self._instance_dict[instance_id]["ground_truth"]
# Custom evaluation logic
if custom_evaluation_function(response, ground_truth):
return 1.0
else:
return 0.0
**Multi-step Interactions**
For complex scenarios requiring multiple feedback rounds:
.. code-block:: python
async def generate_response(self, instance_id, messages, **kwargs):
instance = self._instance_dict[instance_id]
instance["attempts"] += 1
# Evaluate current response
reward = await self.calculate_score(instance_id)
if reward > 0.8:
return True, "Excellent work!", reward, {}
elif instance["attempts"] < 3:
return False, "Good attempt, but try to improve...", reward, {}
else:
return True, "Maximum attempts reached.", reward, {}
Troubleshooting
---------------
**Common Issues**
1. **Instance ID Conflicts**: Ensure unique instance IDs across concurrent sessions
2. **Memory Leaks**: Always call ``finalize_interaction()`` to clean up resources
3. **Blocking Operations**: Keep interaction logic async and non-blocking
4. **Configuration Errors**: Verify interaction config path and class name are correct
5. **Interaction Name Conflicts**: Ensure all interactions have unique names in the configuration
6. **Missing Interaction**: Verify the ``name`` field in ``interaction_kwargs`` matches available interactions
7. **Backward Compatibility**: When migrating from single to multi-interaction, add ``name`` fields to existing data
**Debugging**
Enable debug logging to trace interaction flow:
.. code-block:: bash
export VERL_LOGGING_LEVEL=DEBUG
**Performance Monitoring**
Monitor interaction performance impact on training throughput and adjust accordingly.
Related Documentation
--------------------
- :doc:`multiturn`: Basic multi-turn rollout configuration
- :doc:`sandbox_fusion`: Tool integration with SGLang
- :doc:`search_tool_example`: Search tool implementation example
\ No newline at end of file
Multi-turn Rollout Support
==========================
Last updated: 06/27/2025.
Basic Configuration
~~~~~~~~~~~~~~~~~~~
To enable multi-turn rollout, make sure to configure the following fields in your rollout configuration:
.. code-block:: yaml
actor_rollout_ref:
rollout:
multi_turn: True
name: "sglang"
These configuration activates the sglang engine for multi-turn interaction during rollout.
Custom Tool Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~
For custom environment interaction tools, you can implement your own tools based on ``verl.tools.base_tool.BaseTool``. Then, specify your tool configurations in a YAML file:
.. code-block:: yaml
tools:
- class_name: ""
config:
type: native
tool_schema:
You may refer to GSM8KTool_example_configuration_, which is one example of the tool configurations. Its implementation can be found in gsm8k_tool.py_.
Finally, set the ``tools_config_file`` in your rollout config:
.. code-block:: yaml
actor_rollout_ref:
rollout:
tool_kwargs:
tools_config_file: <path_to_tool_yaml_file>
This allows integration of customized tool behaviors during actor rollout steps.
If you want rollout with simulated interaction, you can set the ``interaction_config_file`` in your rollout config:
.. code-block:: yaml
interaction:
- class_name: ""
config: {}
.. code-block:: yaml
actor_rollout_ref:
rollout:
interaction_config_file: <path_to_interaction_yaml_file>
If your tool creates multi-modal inputs, you should return a list of multi-modal inputs in your tool.execute() implementation.
Image and video should be processed before returning. For example, if you are using Qwen2.5-VL, you can use the following code to get the representations:
.. code-block:: python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
...
from verl.utils.dataset.vision_utils import process_image, process_video
img1 = process_image(img1)
video1 = process_video(video1)
# due to the (image | video) key is ("image" | "video") instead of ("images" | "videos") in vllm, we need to use ("image" | "video") to specify list of images/videos
# link: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
remeber to set ``return_multi_modal_inputs: False`` in your dataset config in order to process the multi-modal inputs in the rollout correctly.
Refer to the `Handling Multi-Modal Inputs in Datasets`_ section for more details.
MCP Tool Configuration
~~~~~~~~~~~~~~~~~~~~~~
For MCP interaction tools, you can flexibly configure them using a YAML file. The typical setup is as follows:
.. code-block:: yaml
tools:
- class_name: ""
config:
type: mcp
mcp:
mcp_servers_config_path: ./mcp_server.json
tool_selected_list: {}
The ``tool_selected_list`` field is optional and specifies which tools to use from the servers. If you want to enable all available tools, simply omit this attribute. Besides, ``mcp_servers_config_path`` points to a JSON file containing the MCP server configurations. For example:
.. code-block:: json
{
"mcpServers": {
"SSE Server": {
"url": "your_server_url",
"auth_token": "your_server_api_token"
},
"STDIO Server": {
"command": "npx",
"args": ["-y", "server-mcp@0.2.1"],
"env": {
"SERVER_API_KEY": "your_server_api_token"
}
}
}
}
Since the content formats returned by the MCP server may vary, users can inherit from ``MCPBaseTool`` and override the ``_parse_tool_result`` method to implement custom parsing logic.
.. code-block:: python
class MCPYourTool(MCPBaseTool):
def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
super().__init__(config, tool_schema)
def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
...
Overall, you may refer to mcp_search_tool.py_ and mcp_tool_config.yaml_ for custom implementation and configuration.
Multi-turn Tokenization
~~~~~~~~~~~~~~~~~~~~~~~
Tokenizing multi-turn rollouts poses a challenge: after applying the chat template and tokenizing the full message list, it's hard to identify which tokens belong to assistant messages. Since the token list is flat, it lacks direct alignment with the message roles.
To address this, we adopt a **delta-based tokenization** strategy. Each time the LLM generates a new message, we:
1. Apply the chat template to all prior messages (`messages[:i]`).
2. Apply the chat template again including the latest message (`messages[:i+1]`).
3. Tokenize only the *delta* between these two serialized message strings.
This ensures that only tokens generated by the assistant are included in the loss mask.
.. code-block:: python
# When using tokenizer
# Exclude the assistant prompt (e.g., "<|im_start|>assistant") from the loss by setting add_generation_prompt=True
prev = tokenizer.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
curr = tokenizer.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids) # Mask only the new assistant tokens
.. code-block:: python
# When using processor
# Exclude the assistant prompt (e.g., "<|im_start|>assistant") from the loss by setting add_generation_prompt=True
prev = processor.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
prev_model_inputs = processor(text=prev, images=images, videos=videos, return_tensors="pt")[0].tolist()
curr = processor.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
curr_model_inputs = processor(text=curr, images=images, videos=videos, return_tensors="pt")[0].tolist()
token_ids += curr_model_inputs["input_ids"][len(prev_model_inputs["input_ids"]):]
loss_mask += [1] * len(token_ids) # Mask only the new assistant tokens
While we've validated this produces consistent results with full message tokenization, future models' chat template could break compatibility. To guard against silent inconsistencies, we compare the delta-based tokenization with full-tokenization results by default at the end of each rollout.
If you see the following warning, you can check the mismatched substring in the log:
.. code-block::
Inconsistent training and inference tokenization detected. This may lead to unexpected behavior during training. Please review your chat template to determine if this is intentional. For more information, refer to the multiturn README.md.
The tokenization sanity check mode can be configured using the ``actor_rollout_ref.rollout.multi_turn.tokenization_sanity_check_mode`` parameter, which accepts the following values:
- ``strict`` (default): Performs strict comparison between delta-based and full tokenization results, raising warnings for any differences.
- ``ignore_strippable``: Ignores differences in whitespace characters (``\n``, ``\t``, ``\r``, spaces) while still checking for meaningful text mismatches. This is useful when debugging chat template issues where whitespace variations are expected and acceptable.
- ``disable``: Completely disables the tokenization sanity check. Only use this if you have thoroughly validated that tokenization discrepancies are expected and won't impact training.
Example configuration:
.. code-block:: yaml
actor_rollout_ref:
rollout:
multi_turn:
tokenization_sanity_check_mode: "ignore_strippable" # Choose from: "disable", "ignore_strippable", "strict"
Handling Multi-Modal Inputs in Datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If your dataset includes multi-modal inputs (such as images or videos), you can control whether these are pre-processed and included in each sample by setting the return_multi_modal_inputs flag in your dataset config (used by RLHFDataset).
- ``return_multi_modal_inputs: True`` (default): The dataset will pre-process and include a multi_modal_inputs dictionary for each sample. This dict contains the model-ready representations (e.g., image tensors, video tensors, etc.) as produced by your processor. This is useful for single-turn or SFT-style training, where the model expects all modalities to be present in the batch.
- ``return_multi_modal_inputs: False``: The dataset will not include the multi_modal_inputs field. This is recommended for multi-turn RL or tool-augmented rollouts, where the model may generate new multi-modal inputs dynamically during rollout, and you want to avoid conflicts or redundant data in the batch.
Special Cases
^^^^^^^^^^^^^
Some models (e.g., Qwen/QwQ-32B and Qwen3 series) remove internal reasoning content during chat template rendering. As a result, the message content can vary across turns, making the delta-based tokenization inaccurate.
For example, for the following conversation:
.. code-block:: python
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2 + 2?"},
{"role": "assistant", "content": "<think>user asked about a simple math question.</think> 2 + 2 = 4."},
{"role": "user", "content": "Explain why."},
{"role": "assistant", "content": "<think>user wants to know the reasoning behind the answer. Search for a good explanation</think>",
"tool_calls": [{"id": "tool1", "type": "search", "arguments": {"query": "Why is 2 + 2 = 4?"}}]},
{"role": "tool", "content": "The sum of two and two is four because it is a basic arithmetic operation."},
{"role": "assistant", "content": "<think>The tool provided a good explanation.</think>The sum of two and two is four because it is a basic arithmetic operation."}
]
1. Qwen/QwQ-32B will remove all reasoning content except the last assistant message after applying the chat template.
.. code-block:: text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant
2 + 2 = 4.<|im_end|>
<|im_start|>user
Explain why.<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
The sum of two and two is four because it is a basic arithmetic operation.
</tool_response><|im_end|>
<|im_start|>assistant
<think>The tool provided a good explanation.</think> The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>
2. Qwen3 series will remove all reasoning content before the last user message.
.. code-block:: text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant
2 + 2 = 4.<|im_end|>
<|im_start|>user
Explain why.<|im_end|>
<|im_start|>assistant
<think>
user wants to know the reasoning behind the answer. Search for a good explanation
</think>
<tool_call>
{"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
The sum of two and two is four because it is a basic arithmetic operation.
</tool_response><|im_end|>
<|im_start|>assistant
<think>
The tool provided a good explanation.
</think>
The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>
To handle this, we fall back to a **fixed base conversation** containing only a single system and user message. Since this base doesn't include assistant messages or reasoning content, it remains consistent across turns.
.. code-block:: python
BASE_CHAT_HISTORY = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "I am a user."}
]
prev = tokenizer.apply_chat_template(BASE_CHAT_HISTORY, add_generation_prompt=True, tokenize=False)
curr = tokenizer.apply_chat_template([*BASE_CHAT_HISTORY, messages[i]], add_generation_prompt=False, tokenize=False)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids)
This method works well for Qwen3 series. However, Qwen/QwQ-32B currently has a bug in its chat template. A fix_ has been proposed but not yet adopted. Until then, use the following command to download the fixed model revision:
.. code-block:: bash
pip install huggingface_hub
huggingface-cli download Qwen/QwQ-32B --revision refs/pr/81
.. _fix: https://huggingface.co/Qwen/QwQ-32B/discussions/81
Discrepancy Between Training and Inference Templates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Although the above approach fixes the delta mismatch issue, the removal of reasoning content in the inference-time chat template introduces a new discrepancy: training uses the full reasoning content, while inference does not.
This mismatch can affect model performance in unpredictable ways. To avoid it, we default to using the full response (including reasoning) for both training and rollout.
However, this approach comes with trade-offs:
1. Long reasoning contents can easily exceed the model's context window, especially in multi-turn rollout.
2. There's a mismatch between rollout and production environment now—models will not have reasoning content from past turns if you use the default chat template in production.
We are still evaluating the impact of these issues. If you experience context length problems or prefer rollouts that match production (i.e., exclude reasoning), you can enable:
``actor_rollout_ref.rollout.multi_turn.use_inference_chat_template = True``
GSM8K Multi-turn Training Performance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See the training performance of multi-turn rollout on the GSM8K task HERE_.
.. _HERE: https://wandb.ai/zhaochenyang20/gsm8k_async_rl/runs/1ro1r7om?nw=nwuserzhaochenyang20
.. _GSM8KTool_example_configuration: https://github.com/volcengine/verl/blob/main/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml
.. _gsm8k_tool.py: https://github.com/volcengine/verl/blob/main/verl/tools/gsm8k_tool.py
.. _mcp_search_tool.py: https://github.com/volcengine/verl/blob/main/verl/tools/mcp_search_tool.py
.. _mcp_tool_config.yaml: https://github.com/volcengine/verl/blob/main/examples/sglang_multiturn/config/tool_config/mcp_tool_config.yaml
Interaction System
~~~~~~~~~~~~~~~~~~
For dynamic conversational feedback during RL training, see:
.. toctree::
:maxdepth: 1
interaction_system
Search Tool Integration
~~~~~~~~~~~~~~~~~~~~~~~
.. toctree::
:maxdepth: 1
search_tool_example
Code Walkthrough
~~~~~~~~~~~~~~~~~~~~~~~
If you want to learn more in depth about the code execution flow, please read https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/rlhf/verl/multi-turn/code-walk-through
===============================
Sandbox Fusion Tool Integration
===============================
Last updated: 06/10/2025.
Motivations
===========
- As users of verl, we want to allow the model to call certain tools during Actor rollout, incorporating the results into the training process.
- A colleague from ByteDance proposed a paper aimed at enhancing model capability through code execution tools.
- We aim to support tool-calling capabilities of inference engines using `sandbox-fusion` as the code execution system, providing the community with a reimplementation of `retools`.
Reward Compute with Sandbox Fusion + FaaS Integration
=====================================================
- In current datasets and tasks, similar work already exists (e.g., Prime), which uses local processes as runners to execute model-generated code for reward computation.
- On this basis, #1429 has advanced the design by integrating FaaS as the runner for reward computation.
Goals
=====
- Adapt to the `sglang` tool-calling protocol and define tools for sandbox fusion.
- Integrate with the `async-rollout` process, ensuring sandbox fusion tools follow asyncIO conventions.
- Design and implement a basic rate limiter to prevent issues such as 429 errors.
Non-Goals
=========
- Training effectiveness is out of scope.
- Observability metrics are not considered.
- Distributed failover and component fault tolerance are not addressed.
Design Details
==============
Tool Schema Definition
----------------------
- Currently, only code execution is considered, requiring a `code` field in the JSON from the model.
- Only Python code is supported for now, so no `language` parameter is defined.
.. code-block:: python
OpenAIFunctionToolSchema(
type="function",
function=OpenAIFunctionSchema(
name="code_interpreter",
description="A tool for executing code.",
parameters=OpenAIFunctionParametersSchema(
type="object",
properties={
"code": OpenAIFunctionPropertySchema(
type="string",
description="The code to execute.",
enum=None,
)
},
required=["code"],
),
strict=False,
)
)
Configuration Parameters
--------------------------
+----------------------------+--------------------------------------------------------------+
| Parameter Name | Description |
+============================+==============================================================+
| `num_workers` | Number of worker threads/processes per DP to request runner. |
+----------------------------+--------------------------------------------------------------+
| `rate_limit` | Global limit of concurrent code executions. Default: 10 |
+----------------------------+--------------------------------------------------------------+
| `default_timeout` | Timeout (in seconds) for each code execution. Default: 30 |
+----------------------------+--------------------------------------------------------------+
| `default_language` | Default programming language. Default: "python" |
+----------------------------+--------------------------------------------------------------+
| `enable_global_rate_limit` | Whether to enable global rate limiting. Default: True |
+----------------------------+--------------------------------------------------------------+
| `sandbox_fusion_url` | URL for the veFaas sandbox execution service |
+----------------------------+--------------------------------------------------------------+
Rate Limiting Design
-----------------------
Objective:
- Limit the number of inflight requests using a token bucket model.
- Ensure ordered submission to code runners to avoid starvation due to backoff.
Design Highlights:
- Use Ray Global Actor as a singleton distributed counter at cluster level.
- Semaphore used for counting, with `acquire` and `release` in separate thread pools to preserve order.
- Use Ray’s cloud-pickle to serialize functions for decoupled `ExecutionWorker`.
.. code-block:: python
@ray.remote(concurrency_groups={"acquire": 1,"release": 10})
class TokenBucketWorker:
def __init__(self, rate_limit: int):
self.rate_limit = rate_limit
self.current_count = 0
self._semaphore = threading.Semaphore(rate_limit)
@ray.method(concurrency_group="acquire")
def acquire(self):
self._semaphore.acquire()
self.current_count += 1
@ray.method(concurrency_group="release")
def release(self):
self._semaphore.release()
self.current_count -= 1
def get_current_count(self):
return self.current_count
class ExecutionWorker:
def __init__(self, enable_global_rate_limit=True, rate_limit=10):
self.rate_limit_worker = self._init_rate_limit(rate_limit) if enable_global_rate_limit else None
def _init_rate_limit(self, rate_limit):
return TokenBucketWorker.options(name="rate-limiter", get_if_exists=True).remote(rate_limit)
def execute(self, fn: Callable[..., T], *fn_args, **fn_kwargs) -> T:
with ExitStack() as stack:
stack.callback(self.rate_limit_worker.release.remote)
ray.get(self.rate_limit_worker.acquire.remote())
try:
return fn(*fn_args, **fn_kwargs)
except Exception as e:
logger.warning(f"Error when executing code: {e}")
def init_execution_pool(num_workers: int, enable_global_rate_limit=True, rate_limit=10, mode: PoolMode=PoolMode.ThreadMode):
if mode == PoolMode.ThreadMode:
return ray.remote(ExecutionWorker).options(max_concurrency=num_workers).remote(
enable_global_rate_limit=enable_global_rate_limit,
rate_limit=rate_limit
)
else:
raise NotImplementedError("Process mode is not implemented yet")
Tool Implementation
-------------------
- Use `instance_id` to identify requests across multiple dialogue rounds.
- Use `execution_pool` to implement async invocation.
- Cleanup state after rollout completion.
.. code-block:: python
class SandboxFusionTool(BaseTool):
def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
...
self.execution_pool = init_execution_pool(...)
...
async def create(self, instance_id: Optional[str] = None, ...):
...
async def execute(self, instance_id: str, parameters: dict[str, Any], **kwargs) -> Tuple[str, float, dict]:
code = parameters.get("code", "")
timeout = parameters.get("timeout", self.default_timeout)
language = parameters.get("language", self.default_language)
if not isinstance(code, str):
code = str(code)
result = await self.execution_pool.execute.remote(self.execute_code,instance_id,code,timeout,language)
self._instance_dict[instance_id]["reward"].append(result.strip())
return result, result, {}
def execute_code(self,instance_id,code,timeout=30,language="python"):
result_status, metadata = _process_single_case(0, None, None,self.sandbox_fusion_url, code, timeout, language)
# we should always expect this since we don't have correct answer
if metadata["run_status"] == "Finished":
actual_output = metadata["stdout"] if metadata["stdout"] is not None else ""
return actual_output
else:
return "no stdout here"
async def calc_reward(self, instance_id: str, ...):
...
async def release(self, instance_id: str, ...):
...
Test Plan
=========
Unit Tests
----------
- **test_tools_registration**: Test tool registration and initialization.
- **test_rollout_req_creation**: Validate that `AsyncRolloutReq` is built correctly.
- **test_over_size_case**: Ensure rollout terminates early when exceeding `max_seq_len`.
- **test_tool_call_basic_case**: Mock `sglang` output, validate tool call and result.
- **test_tool_call_batch_case**: Test batch processing of tool calls.
- **test_basic_multi_process_init**: Validate Ray global actor behaves as singleton.
- **TestSingleNodeRateLimiterCase**: Verify rate limiter works in single-node mode.
- **test_rotten_execution**: Ensure rate limiter recovers from function errors.
- **TestMultiNodeRateLimiterCase**: Verify behavior in multi-node environments.
e2e Tests
----------
we provide e2e test scripts in `tests/special_e2e` folder, named `tests/special_e2e/run_gsm8k_fsdp_sgl_multiturn_sf_tool.sh`
by setting 'trainer.rollout_data_dir' you can dump the rollout data to local disk. here is an sample taken from the rollout data:
.. code-block:: python
{
"input": "
system\nYou are a math expert. You are given a question and you need to solve it step by step. Reasoning step by step before any tool call. You should use the `calc_gsm8k_reward` tool after step by step solving the question, before generate final answer at least once and refine your answer if necessary. Put your final answer in the format of `#### <answer>`.\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"code_interpreter\", \"description\": \"A tool for executing code.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"code\": {\"type\": \"string\", \"description\": \"The code to execute.\", \"enum\": null}}, \"required\": [\"code\"]}, \"strict\": false}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n
user\nJohn gets a bonus that's the same percentage every year. Last year he made $100,000 and got a $10,000 bonus. This year he makes $200,000. How much will John make this year when adding both his total pay and bonus together? Let's think step by step and output the final answer after `####`.\n
assistant\n",
"output": "assistant\n<think>\nOkay, let's see. John gets a bonus that's the same percentage every year. Last year, he made $100,000 and got a $10,000 bonus. This year, he makes $200,000. The question is asking how much he'll make this year when adding both his total pay and bonus together.\n\nFirst, I need to figure out the percentage bonus he gets. Since the bonus is the same percentage each year, if I can find that percentage based on last year's details, I can apply it to this year's earnings to find this year's bonus. Then, I'll add that bonus to his total pay this year.\n\nLast year, he earned $100,000 and got a $10,000 bonus. To find the percentage, I can divide the bonus by his total earnings. So, 10,000 divided by 100,000. Let me calculate that.\n\n10,000 / 100,000 equals 0.1, which is 10% as a percentage. So, his bonus is 10% of his pay each year.\n\nThis year, he makes $200,000. So, the bonus this year would be 10% of 200,000. Let me compute that using the code interpreter to ensure accuracy.\n
<tool_call>\n{\"name\": \"code_interpreter\", \"arguments\": {\"code\": \"total_pay_this_year = 200000\\nbonus_percentage = 10 / 100\\nbonus_this_year = total_pay_this_year * bonus_percentage\\ntotal_income_this_year = total_pay_this_year + bonus_this_year\\nprint(total_income_this_year)\", \"executes\": \"True\"}}\n</tool_call>\n
tool\n220000.0\n\n
assistant\n<think>\nSo the code shows that John's total income this year would be $220,000. This is consistent with the 10% bonus on last year's pay of $100,000, which was $10,000. Therefore, this year's bonus of $20,000 added to his total pay of $200,000 results in a total income of $220,000.\n\nI need to verify that there are no other factors at play. The problem states it's a constant percentage bonus each year, so there's no mention of changing percentages or other variables. Therefore, applying a straightforward percentage of 10% on this year's partial salary amount would result in $20,000. Adding that to his total pay gives $220,000. The calculation is correct and aligns with the question's parameters.\n</think>\n\n#### 220000.0",
"score": 0,
"step": 1
}
here is the readable format version:
.. code-block:: python
[system]
You are a math expert. You are given a question and you need to solve it step by step. Reasoning step by step before any tool call. You should use the `calc_gsm8k_reward` tool after step by step solving the question, before generate final answer at least once and refine your answer if necessary. Put your final answer in the format of `#### <answer>`.
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "code_interpreter", "description": "A tool for executing code.", "parameters": {"type": "object", "properties": {"code": {"type": "string", "description": "The code to execute.", "enum": null}}, "required": ["code"]}, "strict": false}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
[user]
John gets a bonus that's the same percentage every year. Last year he made $100,000 and got a $10,000 bonus. This year he makes $200,000. How much will John make this year when adding both his total pay and bonus together? Let's think step by step and output the final answer after `####`.
[assistant]
<think>
Okay, let's see. John gets a bonus that's the same percentage every year. Last year, he made $100,000 and got a $10,000 bonus. This year, he makes $200,000. The question is asking how much he'll make this year when adding both his total pay and bonus together.
First, I need to figure out the percentage bonus he gets. Since the bonus is the same percentage each year, if I can find that percentage based on last year's details, I can apply it to this year's earnings to find this year's bonus. Then, I'll add that bonus to his total pay this year.
Last year, he earned $100,000 and got a $10,000 bonus. To find the percentage, I can divide the bonus by his total earnings. So, 10,000 divided by 100,000. Let me calculate that.
10,000 / 100,000 equals 0.1, which is 10% as a percentage. So, his bonus is 10% of his pay each year.
This year, he makes $200,000. So, the bonus this year would be 10% of 200,000. Let me compute that using the code interpreter to ensure accuracy.
<tool_call>
{"name": "code_interpreter", "arguments": {"code": "total_pay_this_year = 200000\nbonus_percentage = 10 / 100\nbonus_this_year = total_pay_this_year * bonus_percentage\ntotal_income_this_year = total_pay_this_year + bonus_this_year\nprint(total_income_this_year)", "executes": "True"}}
</tool_call>
[tool]
220000.0
[assistant]
<think>
So the code shows that John's total income this year would be $220,000. This is consistent with the 10% bonus on last year's pay of $100,000, which was $10,000. Therefore, this year's bonus of $20,000 added to his total pay of $200,000 results in a total income of $220,000.
I need to verify that there are no other factors at play. The problem states it's a constant percentage bonus each year, so there's no mention of changing percentages or other variables. Therefore, applying a straightforward percentage of 10% on this year's partial salary amount would result in $20,000. Adding that to his total pay gives $220,000. The calculation is correct and aligns with the question's parameters.
</think>
#### 220000.0
You can also use the `RolloutViewer` TUI tool to view the dumped rollout data:
.. code-block:: bash
python scripts/rollout_viewer.py ${trainer.rollout_data_dir}
.. image:: https://github.com/user-attachments/assets/e34e5157-2880-4a21-afb2-73885d0dfb11
:alt: RolloutViewer screenshot
\ No newline at end of file
=======================
Search Tool Integration
=======================
Last updated: 05/30/2025.
Introduction
------------
- We have added a search tool calling function to Multi-Turn RL, enabling the model to initiate retrieval requests during Actor rollout and directly use retrieval results for training. **We support using a local dense retriever as the retrieval tool, as well as integrating with your own local retrieval engine.**
Quick Reproduction
------------------
Create a New Docker Container
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: bash
docker run \
-it \
--shm-size 32g \
--gpus all \
-v {Huggingface-Cache-Path}:/root/.cache \
--ipc=host \
--network=host \
--privileged \
--name sglang_{your-name} \
lmsysorg/sglang:dev \
/bin/zsh
If you need to restart after exiting the container:
.. code:: bash
docker start -i sglang_{your-name}
Update Python and Configure the Virtual Environment using uv
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: bash
apt update
apt install -y python3.10 python3.10-venv
# Create a virtual environment
python3 -m venv ~/.python/verl-multiturn-rollout
# Activate the virtual environment
source ~/.python/verl-multiturn-rollout/bin/activate
# Install uv
python3 -m pip install uv
Install verl Upstream
~~~~~~~~~~~~~~~~~~~~~
.. code:: bash
cd ~
git clone https://github.com/volcengine/verl.git
cd verl
# Install verl
python3 -m uv pip install .
python3 -m uv pip install -r ./requirements_sglang.txt
# Manually install flash-attn
python3 -m uv pip install wheel
python3 -m uv pip install packaging
python3 -m uv pip install flash-attn --no-build-isolation --no-deps
Set Up a Local Retrieval Engine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you are using your own local retrieval service, you can skip this
step. We chose the local dense retriever provided in the search-R1
example; detailed instructions are in the `searchR1
docs <https://raw.githubusercontent.com/PeterGriffinJin/Search-R1/refs/heads/main/docs/retriever.md>`__.
In brief:
- The GPU version offers higher accuracy and speed; each GPU uses about
5–7 GB of memory.
- The CPU version can be used for simple testing but has lower
retrieval precision, which will degrade training performance. See the
`retriever
documentation <https://github.com/PeterGriffinJin/Search-R1/blob/main/docs/retriever.md>`__
in search-R1 for details.
- Recommend using Conda to install faiss-gpu=1.8.0; venv may cause errors.
**Note**: To start both the training process and the local retrieval
service, we launch two separate Python environments. The training uses
uv in the verl-multiturn-rollout environment, while the retriever uses
conda to install ``faiss-gpu``.
.. code:: bash
# Download the Miniconda installer script
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
# Install to $HOME/miniconda3 in batch mode
bash ~/miniconda.sh -b -p $HOME/miniconda3
# Activate conda (only in the current shell)
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
# (Optional) Add conda to your default shell startup
conda init
# Reload shell config
source ~/.bashrc
# Create and activate the retriever environment with Python 3.10
conda create -n retriever python=3.10 -y
conda activate retriever
# Install PyTorch (with GPU support) and related libraries
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# Install other Python packages
pip install transformers datasets pyserini huggingface_hub
# Install the GPU version of faiss
conda install faiss-gpu=1.8.0 -c pytorch -c nvidia -y
# Install the API service framework
pip install uvicorn fastapi
Download the Indexing and Corpus
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The local retrieval files are large—prepare sufficient disk space.
Downloading is about 60–70 GB, and uncompressed takes about 132 GB:
.. code:: bash
conda activate retriever
save_path=/the/path/to/save
python examples/sglang_multiturn/search_r1_like/local_dense_retriever/download.py --save_path $save_path
cat $save_path/part_* > $save_path/e5_Flat.index
gzip -d $save_path/wiki-18.jsonl.gz
Start the Local flat e5 Retrieval Server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. The first startup will download models and load the index.
2. Apart from the download, startup takes about 1–2 minutes.
3. After startup, each GPU uses about 5–7 GB of memory, leaving the rest
for multi-turn RL training.
.. code:: bash
conda activate retriever
index_file=$save_path/e5_Flat.index
corpus_file=$save_path/wiki-18.jsonl
retriever_name=e5
retriever_path=intfloat/e5-base-v2
python examples/sglang_multiturn/search_r1_like/local_dense_retriever/retrieval_server.py \
--index_path $index_file \
--corpus_path $corpus_file \
--topk 3 \
--retriever_name $retriever_name \
--retriever_model $retriever_path \
--faiss_gpu
Set Up WANDB_API_KEY
~~~~~~~~~~~~~~~~~~~~
.. code:: bash
export WANDB_API_KEY={YOUR_WANDB_API_KEY}
# Define a timestamp function
function now() {
date '+%Y-%m-%d-%H-%M'
}
**Preprocess the Dataset**
~~~~~~~~~~~~~~~~~~~~~~~~~~
**Note:** The following data processing and training commands must be
run in the verl-multiturn-rollout environment.
.. code:: bash
python3 examples/data_preprocess/preprocess_search_r1_dataset.py
Testing on 8 x H20
~~~~~~~~~~~~~~~~~~
.. code:: bash
# Ensure the now() function is defined
# Create a logs directory
mkdir -p logs
# Set GPUs and run with a suitable log path
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
nohup bash examples/sglang_multiturn/search_r1_like/run_qwen2.5-3b_instruct_search_multiturn.sh \
trainer.experiment_name=qwen2.5-3b-it_rm-searchR1-like-sgl-multiturn-$(now) \
> logs/searchR1-like$(now).log 2>&1 &
Custom Search Configuration
---------------------------
To enable multi-turn reasoning, set the following fields in your config:
.. code:: yaml
actor_rollout_ref:
rollout:
name: "sglang"
multi_turn:
enable: True
You must specify ``retrieval_service_url`` in ``examples/sglang_multiturn/config/tool_config/search_tool_config.yaml``, and properly configure concurrency. For more details on concurrency, refer to the Sandbox Fusion example:
.. code:: yaml
tools:
- class_name: verl.tools.search_tool.SearchTool
config:
retrieval_service_url: http://127.0.0.1:8000/retrieve
num_workers: 120
rate_limit: 120
timeout: 30
The retriever input/output formats are as follows. If your service
parameters match, only modify ``retrieval_service_url``. You can also
customize in ``search_r1_like_utils.py``.
.. code:: python
Input format:
{
"queries": ["What is Python?", "Tell me about neural networks."],
"topk": 3,
"return_scores": true
}
Output format (when return_scores=True, similarity scores are returned):
{
"result": [
[ # Results for each query
{
"document": doc, "score": score
},
# ... more documents
],
# ... results for other queries
]
}
Notes
-----
1. The total training time is about 27 hours; meanwhile, the validation
dataset is very large (51 k), and each validation takes about 6000 s.
(Therefore, ``val_before_train=False`` by default)
The Design of ``verl.single_controller``
==============================================
Last updated: 05/21/2025.
**Author:**\ `Wang Zhang <https://github.com/zw0610>`__
Preface
-------
We prepared this document for developers of ``verl``, particularly those
interested in understanding or contributing to the
``verl.single_controller`` module. It is not intended for end users, but
for contributors seeking to understand the architectural rationale and
internal mechanics.
--------------
Origin
------
The ``single_controller`` module originated from a request I received —
to adapt a toy single-process RLHF script into a distributed system with
minimal changes, while maintaining ease of debugging.
Common practice — such as using PyTorch’s Distributed Data Parallel
(DDP) — typically involves wrapping ``nn.Module`` and launching multiple
processes that execute the same function under different ranks. However,
this approach presents two main limitations in the context of
distributed RLHF: - Difficulty representing multiple DAGs as required by
PPO; - Difficulty inspecting intermediate tensors during training.
To maintain debuggability, we opted for a different approach — breaking
the training loop into well-defined stages like ``generate_sequences``,
``compute_advantages``, and so on.
We selected `Ray <https://www.ray.io/>`__ as the initial backend for
``verl`` due to its ability to expose Python class methods as RPC
endpoints. However, Ray’s default model only supports **one method call,
one RPC**, while training LLMs typically requires coordination across
multiple processes.
To hide this multi-Ray actors invocation for a single method from users,
we introduced the following components:
- ``WorkerGroup`` – manages a group of remote workers and provides
a unified interface for multi-process distributed computation;
- ``ResourcePool`` – binds computational resources to worker
processes;
- ``ClassWithArgs`` – enables delayed remote instantiation with
specified initialization arguments.
--------------
A Running Example: ``generate_sequences``
-----------------------------------------
To illustrate the design, we walk through how the ``generate_sequences``
method in the ``ActorRolloutRefWorker`` class is registered and invoked
across distributed workers.
--------------
Step 1: Register with a Decorator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The first step is to define the ``generate_sequences`` and decorate it
with ``@register`` as it will be called in driver script.
**Source:**
`fsdp_workers.py <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/workers/fsdp_workers.py#L528>`__
.. code:: python
class ActorRolloutRefWorker(Worker):
...
@register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
def generate_sequences(self, prompts: DataProto):
prompts = prompts.to(torch.cuda.current_device())
...
The ``@register`` decorator adds metadata to the ``generate_sequences``
method. Currently, it doesn’t alter functionality, but attaches
attributes via a magic key (``MAGIC_ATTR``):
**Source:**
`decorator.py <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L411>`__
.. code:: python
def register(dispatch_mode=Dispatch.ALL_TO_ALL, execute_mode=Execute.ALL, blocking=True, materialize_futures=True):
...
def decorator(func):
@wraps(func)
def inner(*args, **kwargs):
if materialize_futures:
args, kwargs = _materialize_futures(*args, **kwargs)
return func(*args, **kwargs)
attrs = {"dispatch_mode": dispatch_mode, "execute_mode": execute_mode, "blocking": blocking}
setattr(inner, MAGIC_ATTR, attrs)
return inner
return decorator
As the code shows, values of ``dispatch_mode``, ``execute_mode`` and
``blocking`` is attached the ``generate_sequences`` method.
--------------
Step 2: Binding During Initialization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These attached attributes are extracted and utilized when
``ActorRolloutRefWorker``, wrapped in a ``RayClassWithArgs``, is passed
into a ``RayWorkerGroup``.
**Source:**
`main_generation.py <https://github.com/volcengine/verl/blob/4ae9a0fdab229f75f080e9478807783ed4c97154/verl/trainer/main_generation.py#L82>`__
.. code:: python
ray_cls_with_init = RayClassWithInitArgs(cls=ray.remote(ActorRolloutRefWorker), config=config, role="rollout")
resource_pool = RayResourcePool(process_on_nodes=[config.trainer.n_gpus_per_node] * config.trainer.nnodes)
wg = RayWorkerGroup(resource_pool=resource_pool, ray_cls_with_init=ray_cls_with_init)
During the
`initialization <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/ray/base.py#L184>`__
of ``RayWorkerGroup``, two key steps occur:
1. Worker instances (Ray actors) are created:
`RayWorkerGroup._init_with_resource_pool <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/ray/base.py#L211>`__
2. Methods decorated with ``@register`` are bound to ``RayWorkerGroup``:
`RayWorkerGroup._bind_worker_method <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/ray/base.py#L214>`__
.. figure:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/worker_group_init.png?raw=true
:alt: initialization_and_binding_of_worker_group
initialization_and_binding_of_worker_group
The binding procedure is the heart of ``verl.single_controller``.
**Key function:**
`WorkerGroup._bind_worker_method <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/worker_group.py#L143>`__
.. code:: python
def _bind_worker_method(self, user_defined_cls, func_generator):
...
for method_name in dir(user_defined_cls):
try:
method = getattr(user_defined_cls, method_name)
assert callable(method)
except Exception:
continue # Skip properties
<<<to be continue 1>>>
When a method has the ``MAGIC_ATTR``, the attributes set by
``@register`` are extracted:
.. code:: python
<<<continue 1>>>
if hasattr(method, MAGIC_ATTR):
attribute = getattr(method, MAGIC_ATTR)
dispatch_mode = attribute["dispatch_mode"]
execute_mode = attribute["execute_mode"]
blocking = attribute["blocking"]
<<<to be continue 2>>>
As show in the flow chart above, these attributes are fed into
``func_generator``. However, ``func_generator`` takes ``method_name``,
``dispatch_fn``, ``collect_fn``, ``execute_fn``, ``blocking``. We need
to find the corresponding ``dispatch_fn`` and ``collect_fn`` associated
with the ``dispatch_mode`` (``DP_COMPUTE_PROTO``) from
`DISPATCH_MODE_FN_REGISTRY <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L387>`__:
.. code:: python3
DISPATCH_MODE_FN_REGISTRY = {
Dispatch.ONE_TO_ALL: {
"dispatch_fn": dispatch_one_to_all,
"collect_fn": collect_all_to_all,
},
...
Dispatch.DP_COMPUTE_PROTO: {
"dispatch_fn": dispatch_dp_compute_data_proto,
"collect_fn": collect_dp_compute_data_proto,
},
...
}
Similarly, the ``execute_fn`` is selected by ``execute_mode`` and
extracted by:
.. code:: python
<<<continue 2>>>
# get execute_fn_name
execute_mode = get_predefined_execute_fn(execute_mode=execute_mode)
wg_execute_fn_name = execute_mode["execute_fn_name"]
# get execute_fn from string
try:
execute_fn = getattr(self, wg_execute_fn_name)
assert callable(execute_fn), "execute_fn must be callable"
except Exception:
print(f"execute_fn {wg_execute_fn_name} is invalid")
raise
<<<to be continue 3>>>
In this ``generate_sequences`` cases: -
``dispatch_mode = Dispatch.DP_COMPUTE_PROTO`` -
``dispatch_fn = dispatch_dp_compute_data_proto`` -
``collect_fn = collect_dp_compute_data_proto`` -
``execute_fn = RayWorkerGroup.execute_all``
ONE_TO_ALL v.s. DP_COMPUTE_PROTO
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``dispatch_mode`` is associated with a ``dispatch_fn`` and a
``collect_fn``. As the name implies, ``dispatch_fn`` processes the input
arguments in ``WorkerGroup`` and generate a batch (list) of input
arguments, each of which will be fed into a worker attached to the
``WorkerGroup``.
``dispatch_fn`` of ``ONE_TO_ALL`` is
`dispatch_one_to_all <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L119>`__,
which just duplicates all the input arguments into N replicas, where N
equals the number of Workers attached to the ``worker_group``:
.. code:: python
def dispatch_one_to_all(worker_group, *args, **kwargs):
args = tuple([arg] * worker_group.world_size for arg in args)
kwargs = {k: [v] * worker_group.world_size for k, v in kwargs.items()}
return args, kwargs
``dispatch_fn`` of ``DP_COMPUTE_PROTO`` is
`dispatch_dp_compute_data_proto <https://github.com/volcengine/verl/blob/c59ab2f4788f9a910836a9f2f53dcdb62dfa314e/verl/single_controller/base/decorator.py#L350>`__,
which uses ``DataProto.chunk`` to split a large ``DataProto`` into N
smaller ``DataProto``, where N equals the world_size (number of the
workers) of the ``worker_group``:
.. code:: python
def dispatch_dp_compute_data_proto(worker_group, *args, **kwargs):
from verl.single_controller.base.worker_group import WorkerGroup
assert isinstance(worker_group, WorkerGroup)
# Note: enable auto padding for dp compute DatapProto
splitted_args, splitted_kwargs = _split_args_kwargs_data_proto_with_auto_padding(
worker_group.world_size,
*args,
**kwargs,
)
return splitted_args, splitted_kwargs
The ``collect_fn`` follows the same pattern and process a batch (list)
of returned value from all workers of a ``WorkerGroup`` and merge it
into a list as ``collect_all_to_all`` does or a large ``DataProto`` as
``collect_dp_compute_data_proto`` does.
Finally, a new method is dynamically generated using ``func_generator``
and added to the ``WorkerGroup`` instance:
.. code:: python
<<<continue 3>>>
# bind a new method to the RayWorkerGroup
func = func_generator(
self,
method_name,
dispatch_fn=dispatch_fn,
collect_fn=collect_fn,
execute_fn=execute_fn,
blocking=blocking,
)
try:
setattr(self, method_name, func)
method_names.append(method_name)
except Exception as e:
raise ValueError(f"Fail to set method_name {method_name}") from e
This makes the method invocable via the ``WorkerGroup`` interface.
--------------
Step 3: Call Chain
~~~~~~~~~~~~~~~~~~
All the machinery above ensures that distributed calls feel identical to
single-process ones. In the original single-process script, the code
looks like:
.. code:: python
rollout = Rollout()
rollout.generate_sequences(batch)
With ``verl``, the multiprocess program becomes:
.. code:: python
rollout = RayWorkerGroup(resource_pool=[4], RayClassWithArgs(Rollout))
rollout.generate_sequences(batch)
.. figure:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/call_generate_sequences.png?raw=true
:alt: call_chain_of_generate_sequences
call_chain_of_generate_sequences
Behind this simple call: - ``dispatch_fn`` splits input across workers -
``execute_fn`` performs the actual remote invocation - ``collect_fn``
gathers the results
All of this is abstracted away, enabling developers to write distributed
code with minimal changes to their existing logic.
--------------
Beyond RL Post-Training: Generalizing ``verl.single_controller``
----------------------------------------------------------------
The ``verl.single_controller`` module generalizes well beyond
reinforcement learning. It provides a clean abstraction to batch-process
remote method calls, with automatic input/output handling.
By minimizing the gap between single-process and multi-process scripts,
``verl.single_controller`` opens the door to distributed computing in
broader domains — not limited to RL post-training.
We hope this design inspires more examples and extensions from the
community.
Agentic RL Training
===================
Last updated: 07/15/2025.
Overview
----------
The goal of Agentic RL is to improve the performance of backend models from reinforcement learning to the Agent. During the training process, a series of features are developed:
1. Server-based asynchronous rollout
2. Multi-turn conversations and tool calls
3. LangGraph-based Agent
This document explains the system principles and usage involved to help users implement Agentic RL.
Server-based Asynchronous Rollout
---------------------------------
Since Agents need to interact with the environment through various tool calls, in order to avoid GPU idling while waiting for tool call return results, an asyncio based co-routing mechanism is utilized to execute each rollout requests asynchronously, thereby improving training performance. To support asynchronous rollout, the inference engine (server) and the agent (client) are architecturally separated, implementing a server-based system with the following objectives:
1. Enabling load balancing mechanisms to balance loads across multiple GPUs and reduce the impact of long-tail requests on performance. For this purpose, scheduling capabilities in stream mode (recipe\stream_mode) are implemented as a recipe.
2. Preventing agent specific features such as tracing from affecting the inference engine.
System Architecture
~~~~~~~~~~~~~~~~~~~
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop.png?raw=true
For more detail on internal design, please refer to :doc:`Agent Loop<../advance/agent_loop>`.
System Components
~~~~~~~~~~~~~~~~~
+--------------------------+----------------------------------------------------------------------------+
| Component | Role |
+==========================+============================================================================+
| AgentLoop | Client, implements Agent functions |
+--------------------------+----------------------------------------------------------------------------+
| AsyncLLMServerManager | Inference gateway, provides generate interface for AgentLoop |
+--------------------------+----------------------------------------------------------------------------+
| AsyncServer | Server, each instance is connected to one DP group of the inference engine |
+--------------------------+----------------------------------------------------------------------------+
**"generate" Interface**
The "generate" function based on ray actor is used between the Client and Server instead of the standard chat completion API. This is because the conversion between tokens and text can be irreversible. For example, the token converted from "<think>" will be different from that generated by the LLM. During the training phase, it is necessary to strictly use the tokens generated by LLM inference to avoid inaccurate in computing advantage, which may affect model performance. Having the Server provide a token-based API helps the Client maintain the relationship between the text generated by tool calls and the tokens returned by the LLM, so as to output correct tokens for training.
**Inference Engine Adaptation**
AsyncServer uniformly provides a generate function to the upper layer, with separate implementations for SGLang and vLLM to hide underlying differences:
1. The SGLang AsyncServer uses the async_generate interface of the SGLang engine, which is located on the first GPU of each TP group. Therefore, AsyncServer needs to remotely call async_generate through ray actor.
2. The vLLM AsyncServer uses the generate interface of the vLLM engine, which can communicate with the GPUs in the TP group through ZMQ and can be directly called in AsyncServer.
Usage Example
~~~~~~~~~~~~~
Follow :doc:`GSM8K example<../examples/gsm8k_example>` to prepare the dataset and model checkpoints.
There are two options required to use agent loop:
- `data.return_raw_chat=True`
- `actor_rollout_ref.rollout.mode=async`
This example uses the sglang inference engine by default, and you can also modify rollout_name to use vllm.
.. code-block:: bash
bash examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
Multi-turn Conversations and Tool Calls
---------------------------------------
Follow :doc:`Multi-turn Rollout Support<../sglang_multiturn/multiturn>` to prepare tool and configuration files.
The Tool Agent Loop has an additional requirement: adding an "agent_name" field to the dataset. During rollout, it will choose to use tool_agent_loop or single_turn_agent (default) based on this field.
Usage Example
~~~~~~~~~~~~~
.. code-block:: bash
# install mlflow to view toolcall and llm trace
pip install mlflow
# This will download and preprocess the GSM8K dataset into ~/data/gsm8k/ and add the "agent_name" field.
python examples/data_preprocess/gsm8k_tool_agent_loop.py
# Start training with tool calls and enabled mlflow based trace helping to debug the rollout details
bash examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_tool_agent_mlflow.sh
# When training is done, start a mlflow server to view trace
mlflow ui -h 0.0.0.0 -p 5000 --backend-store-uri sqlite:////tmp/mlruns.db
# then you can open http://<your ip address>:5000 from browser to view trace
Note: During training, because the model may sometimes fail to generate correct toolcall tags, an error message "Failed to decode tool call" will be output to the console, which does not indicate an abnormality in training.
Follow :doc:`Rollout trace<../advance/rollout_trace>` to known more about trace feature.
Agent Framework
---------------
System Architecture
~~~~~~~~~~~~~~~~~~~
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/langgraph_agent.png?raw=true
System Components
~~~~~~~~~~~~~~~~~
+--------------------------+-----------------------------------------------------------------------------------------------+
| Component | Role |
+==========================+===============================================================================================+
| ChatModel | LLM object of LangChain, used to adapt to the “generate” api provided by AsyncLLMServerManager|
+--------------------------+-----------------------------------------------------------------------------------------------+
| RectAgentLoop | Agent adaptation layer, which by default supports a naive LangGraph Agentic. |
| | New classes can be derived to support user-defined Agents, and the run function needs to be |
| | implemented to complete Agent calls. |
+--------------------------+-----------------------------------------------------------------------------------------------+
| AsyncServer | Server, each instance is connected to one DP group of the inference engine. |
+--------------------------+-----------------------------------------------------------------------------------------------+
Follow doc "recipe/langgraph_agent/example/README.md" for more details.
\ No newline at end of file
Installation
============
Requirements
------------
- **Python**: Version >= 3.9
- **CUDA**: Version >= 12.1
verl supports various backends. Currently, the following configurations are available:
- **FSDP** and **Megatron-LM** (optional) for training.
- **SGLang**, **vLLM** and **TGI** for rollout generation.
Choices of Backend Engines
----------------------------
1. Training:
We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in :doc:`FSDP Workers<../workers/fsdp_workers>`.
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.12.2 <https://github.com/NVIDIA/Megatron-LM/tree/core_v0.12.2>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
2. Inference:
For inference, vllm 0.8.3 and later versions have been tested for stability. We recommend turning on env var `VLLM_USE_V1=1` for optimal performance.
For SGLang, refer to the :doc:`SGLang Backend<../workers/sglang_worker>` for detailed installation and usage instructions. SGLang rollout is under extensive development and offers many advanced features and optimizations. We encourage users to report any issues or provide feedback via the `SGLang Issue Tracker <https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/106>`_.
For huggingface TGI integration, it is usually used for debugging and single GPU exploration.
Install from docker image
-------------------------
We provide pre-built Docker images for quick setup. And from this version,
we utilize a new image release hierarchy for productivity and stability.
The image types are divided into three large categories:
- **Base Image**: Without inference and training frameworks, only basic dependencies are installed.
Can directly install vllm or SGLang on top of it, without need of reinstall torch or CUDA.
- **Application Image**: Stable version with inference and training frameworks installed.
- **Community Image**: Unstable version with the latest frameworks and features.
The first two types of images are hosted on dockerhub `verlai/verl <https://hub.docker.com/r/verlai/verl>`_ repository, while the preview images are hosted on community repository.
.. note::
The image versions are mapped with verl releases, for example, image with tag ``verl0.4`` is built for verl release ``v0.4.x``.
Base Image
::::::::::
The stable base image is ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4`` and ``verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4``, with different Pytorch version for vLLM and sglang. The installed package versions can be found from tags, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.base``.
The update of base image is not frequent, and the app image can be built on top of it without reinstalling base packages.
Application Image
:::::::::::::::::
From this version, we divide images built for vLLM and SGLang as the divergence of dependent packages like Pytorch and FlashInfer.
There are four types of application images available:
- **vLLM with FSDP and Megatron**: ``verlai/verl:app-verl0.5-vllm0.9.1-mcore0.12.2-te2.2``
- **SGLang with FSDP and Megatron**: ``verlai/verl:app-verl0.5-sglang0.4.8-mcore0.12.2-te2.2``
Docker images with Megatron backends are runnable with large language model like ``Qwen/Qwen3-235B-A22B``, ``deepseek-ai/DeepSeek-V3-0324`` post-training. Refer to the :doc:`Large Language Model Post-Training documentation<../perf/dpsk>` for more details.
Application images can be updated frequently, and the Dockerfile can be found in ``docker/verl[version]-[packages]/Dockerfile.app.[frameworks]``. Based on the base image, it is easy to build your own application image with the desired inference and training frameworks.
Community Image
:::::::::::::::
Community images are provided by the community, including the latest versions of vLLM and SGLang, and may include experimental features or configurations. And also works for other hardwares or platforms like AMD GPUs with ROCM or AWS EFA and Sagemaker.
For latest vLLM with FSDP, please refer to `hiyouga/verl <https://hub.docker.com/r/hiyouga/verl>`_ repository and the latest version is ``hiyouga/verl:ngc-th2.6.0-cu126-vllm0.8.4-flashinfer0.2.2-cxx11abi0``.
For latest SGLang with FSDP, please refer to `ocss884/verl-sglang <https://hub.docker.com/r/ocss884/verl-sglang>`_ repository and the latest version is ``ocss884/verl-sglang:ngc-th2.6.0-cu126-sglang0.4.6.post5`` which is provided by SGLang RL Group.
See files under ``docker/`` for NGC-based image or if you want to build your own.
Note that For aws instances with EFA net interface (Sagemaker AI Pod),
you need to install EFA driver as shown in ``docker/Dockerfile.extenstion.awsefa``
Installation from Docker
::::::::::::::::::::::::
After pulling the desired Docker image and installing desired inference and training frameworks, you can run it with the following steps:
1. Launch the desired Docker image and attach into it:
.. code:: bash
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag> sleep infinity
docker start verl
docker exec -it verl bash
2. If you use the images provided, you only need to install verl itself without dependencies:
.. code:: bash
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install --no-deps -e .
[Optional] If you hope to switch between different frameworks, you can install verl with the following command:
.. code:: bash
# install the nightly version (recommended)
git clone https://github.com/volcengine/verl && cd verl
pip3 install -e .[vllm]
pip3 install -e .[sglang]
Install from custom environment
---------------------------------------------
We recommend to use docker images for convenience. However, if your environment is not compatible with the docker image, you can also install verl in a python environment.
Pre-requisites
::::::::::::::
For training and inference engines to utilize better and faster hardware support, CUDA/cuDNN and other dependencies are required,
and some of the dependencies are easy to be overridden when installing other packages,
so we put them in the :ref:`Post-installation` step.
.. note::
The installation steps below are recommended configurations for the latest version of verl.
If you are trying to customize your own environment, please ignore the strict constraints.
We need to install the following pre-requisites:
- **CUDA**: Version >= 12.4
- **cuDNN**: Version >= 9.8.0
- **Apex**
CUDA above 12.4 is recommended to use as the docker image,
please refer to `NVIDIA's official website <https://developer.nvidia.com/cuda-toolkit-archive>`_ for other version of CUDA.
.. code:: bash
# change directory to anywher you like, in verl source code directory is not recommended
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cuda-toolkit-12-4
update-alternatives --set cuda /usr/local/cuda-12.4
cuDNN can be installed via the following command,
please refer to `NVIDIA's official website <https://developer.nvidia.com/rdp/cudnn-archive>`_ for other version of cuDNN.
.. code:: bash
# change directory to anywher you like, in verl source code directory is not recommended
wget https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cudnn-cuda-12
NVIDIA Apex is required for Megatron-LM and FSDP training.
You can install it via the following command, but notice that this steps can take a very long time.
It is recommended to set the ``MAX_JOBS`` environment variable to accelerate the installation process,
but do not set it too large, otherwise the memory will be overloaded and your machines may hang.
.. code:: bash
# change directory to anywher you like, in verl source code directory is not recommended
git clone https://github.com/NVIDIA/apex.git && \
cd apex && \
MAX_JOB=32 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
Install dependencies
::::::::::::::::::::
.. note::
We recommend to use a fresh new conda environment to install verl and its dependencies.
**Notice that the inference frameworks often strictly limit your pytorch version and will directly override your installed pytorch if not paying enough attention.**
As a countermeasure, it is recommended to install inference frameworks first with the pytorch they needed. For vLLM, if you hope to use your existing pytorch,
please follow their official instructions
`Use an existing PyTorch installation <https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#build-wheel-from-source>`_ .
1. First of all, to manage environment, we recommend using conda:
.. code:: bash
conda create -n verl python==3.10
conda activate verl
2. Then, execute the ``install.sh`` script that we provided in verl:
.. code:: bash
# Make sure you have activated verl conda env
# If you need to run with megatron
bash scripts/install_vllm_sglang_mcore.sh
# Or if you simply need to run with FSDP
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
If you encounter errors in this step, please check the script and manually follow the steps in the script.
Install verl
::::::::::::
For installing the latest version of verl, the best way is to clone and
install it from source. Then you can modify our code to customize your
own post-training jobs.
.. code:: bash
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .
Post-installation
:::::::::::::::::
Please make sure that the installed packages are not overridden during the installation of other packages.
The packages worth checking are:
- **torch** and torch series
- **vLLM**
- **SGLang**
- **pyarrow**
- **tensordict**
- **nvidia-cudnn-cu12**: For Magetron backend
If you encounter issues about package versions during running verl, please update the outdated ones.
Install with AMD GPUs - ROCM kernel support
------------------------------------------------------------------
When you run on AMD GPUs (MI300) with ROCM platform, you cannot use the previous quickstart to run verl. You should follow the following steps to build a docker and run it.
If you encounter any issues in using AMD GPUs running verl, feel free to contact me - `Yusheng Su <https://yushengsu-thu.github.io/>`_.
Find the docker for AMD ROCm: `docker/Dockerfile.rocm <https://github.com/volcengine/verl/blob/main/docker/Dockerfile.rocm>`_
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
.. code-block:: bash
# Build the docker in the repo dir:
# docker build -f docker/Dockerfile.rocm -t verl-rocm:03.04.2015 .
# docker images # you can find your built docker
FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
# Set working directory
# WORKDIR $PWD/app
# Set environment variables
ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
# Install vllm
RUN pip uninstall -y vllm && \
rm -rf vllm && \
git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
cd vllm && \
MAX_JOBS=$(nproc) python3 setup.py install && \
cd .. && \
rm -rf vllm
# Copy the entire project directory
COPY . .
# Install dependencies
RUN pip install "tensordict<0.6" --no-deps && \
pip install accelerate \
codetiming \
datasets \
dill \
hydra-core \
liger-kernel \
numpy \
pandas \
datasets \
peft \
"pyarrow>=15.0.0" \
pylatexenc \
"ray[data,train,tune,serve]" \
torchdata \
transformers \
wandb \
orjson \
pybind11 && \
pip install -e . --no-deps
Build the image
::::::::::::::::::::::::
.. code-block:: bash
docker build -t verl-rocm .
Launch the container
::::::::::::::::::::::::::::
.. code-block:: bash
docker run --rm -it \
--device /dev/dri \
--device /dev/kfd \
-p 8265:8265 \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME/.ssh:/root/.ssh \
-v $HOME:$HOME \
--shm-size 128G \
-w $PWD \
verl-rocm \
/bin/bash
If you do not want to root mode and require assign yourself as the user,
Please add ``-e HOST_UID=$(id -u)`` and ``-e HOST_GID=$(id -g)`` into the above docker launch script.
verl with AMD GPUs currently supports FSDP as the training engine, vLLM and SGLang as the inference engine. We will support Megatron in the future.
More Resources
==============
Last updated: 06/30/2025.
- Introduction to verl (`Slides <https://tongyx361.github.io/blogs/posts/verl-intro>`_)
- verl Code Walkthrough (`Slides <https://tongyx361.github.io/blogs/posts/verl-tutorial>`_, `Talk in Chinese <https://hcqnc.xetlk.com/sl/3vACOK>`_)
Multinode Training
==================
Last updated: 06/10/2025.
.. _wuxibin89: https://github.com/wuxibin89
Author: `Xibin Wu <https://github.com/wuxibin89>`_, `Yusheng Su <https://yushengsu-thu.github.io/>`_.
Manual
------
Set up multinode ray cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Start head node with ``ray start --head --dashboard-host=0.0.0.0``, there're 2 address you should care about:
- GCS address: ``ray start --address=<address>``, where worker node should connect to.
- Dashboard address: ``<address>:8265``, where you should submit job to the cluster.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/head.png?raw=true
2. Start worker node with ``ray start --address=<address>`` you get above.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/worker.png?raw=true
3. Now you should see the cluster have 2 nodes with ``ray status``.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/status.png?raw=true
4. Additionally, you can access dashboard in the browser with the address you get above.
*Firewall rules maybe need configure to access the dashboard, if there's any trouble, please contact your network administrator.*
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/overview.png?raw=true
Submit job to ray cluster
~~~~~~~~~~~~~~~~~~~~~~~~~
1. Submit ray job to cluster with the dashboard address you get above.
.. code-block:: bash
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env=verl/trainer/runtime_env.yaml \
--no-wait \
-- \
python3 -m verl.trainer.main_ppo \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
...
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/submit.png?raw=true
2. Then you can check the job status with the following commands:
- ray job list: list all jobs submitted to the cluster.
- ray job logs <Submission ID>: query the logs of the job.
- ray job status <Submission ID>: query the status of the job.
- ray job stop <Submission ID>: request the job to be stopped.
3. You can also access driver/task/actor logs in ``/tmp/ray/session_latest/logs/``, driver log is ``job-driver-raysubmit_<Submission ID>.log``.
4. We strongly recommend you to view job detail from dashboard in multinode training, because it provide more structure way to view the job information.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job.png?raw=true
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job_detail.png?raw=true
Slurm
-----
TBD
dstack
------
`dstackai/dstack <https://github.com/dstackai/dstack>`_ is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments
without the need to use K8S or Slurm.
Prerequisite
~~~~~~~~~~~~
Once dstack is `installed <https://dstack.ai/docs/installation>`_, initialize the directory as a repo with ``dstack init``.
.. code-block:: bash
mkdir myproject && cd myproject
dstack init
**Create a fleet**
Before submitting distributed training jobs, create a `dstack` `fleet <https://dstack.ai/docs/concepts/fleets>`_.
Run a Ray cluster task
~~~~~~~~~~~~~~~~~~~~~~
Once the fleet is created, define a Ray cluster task, e.g. in ``ray-cluster.dstack.yml``:
.. code-block:: yaml
type: task
name: ray-verl-cluster
nodes: 2
env:
- WANDB_API_KEY
- PYTHONUNBUFFERED=1
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
image: whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.2
commands:
- git clone https://github.com/volcengine/verl
- cd verl
- pip install --no-deps -e .
- pip install hf_transfer hf_xet
- |
if [ $DSTACK_NODE_RANK = 0 ]; then
python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')"
ray start --head --port=6379;
else
ray start --address=$DSTACK_MASTER_NODE_IP:6379
fi
# Expose Ray dashboard port
ports:
- 8265
resources:
gpu: 80GB:8
shm_size: 128GB
# Save checkpoints on the instance
volumes:
- /checkpoints:/checkpoints
Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`.
.. code-block:: bash
dstack apply -f ray-cluster.dstack.yml
As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution
Submit Ray jobs
~~~~~~~~~~~~~~~
Before you can submit Ray jobs, ensure to install `ray` locally:
.. code-block:: shell
pip install ray
Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``:
.. code-block:: shell
$ RAY_ADDRESS=http://localhost:8265
$ ray job submit \
-- python3 -m verl.trainer.main_ppo \
data.train_files=/root/data/gsm8k/train.parquet \
data.val_files=/root/data/gsm8k/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=256 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.project_name=ppo_training \
trainer.experiment_name=qwen-2.5-7B \
trainer.val_before_train=False \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
trainer.default_local_dir=/checkpoints \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.total_epochs=15 2>&1 | tee verl_demo.log \
trainer.resume_mode=disable
For more details on how `dstack` works, check out its `documentation <https://dstack.ai/docs>`_.
How to debug?
---------------------
Ray Distributed Debugger VSCode Extension (Recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Starting with Ray 2.39, Anyscale has introduced the `Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>`_ VSCode extension. Follow the extension’s installation instructions, then add your cluster using the dashboard URL you obtained earlier.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true
:alt: Ray Distributed Debugger VSCode extension screenshot
2. Prerequisites.
Ensure the following are installed (see the extension README for more detail):
- Visual Studio Code
- `ray[default]` >= 2.9.1
- `debugpy` >= 1.8.0
.. image:: https://github.com/aoshen524/verl/blob/main/docs/start/c7098b755ff689859837773a916c857.png?raw=true
:alt: VSCode with Ray prerequisites
3. Environment Variables.
To enable post‑mortem debugging, set:
.. code-block:: bash
export RAY_DEBUG_POST_MORTEM=1
.. admonition:: Note
:class: important
Be sure to remove any legacy flags before starting Ray:
- `RAY_DEBUG=legacy`
- `--ray-debugger-external`
4. Configuring BreakpointsSet up breakpoint() in your code, and submit job to cluster. Then the extension will show the breakpoint information.
1. Insert `breakpoint()` calls into your remote functions.
2. Submit your job to the cluster.
The extension will detect active breakpoints and display them in VSCode.
.. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true
:alt: Detected breakpoint in VSCode
**Note:** Breakpoints are only supported inside functions decorated with `@ray.remote`.
5. Launching the Debugger.
Run your job directly from the command line (do not use a `launch.json`):
.. code-block:: bash
python job.py
6. Attaching to a Breakpoint.
Once the process hits the first `breakpoint()`, click the Ray Distributed Debugger icon in the VSCode sidebar to attach the debugger.
.. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true
:alt: Attaching VSCode debugger to Ray process
7. Debugging With Multiple breakpoint().
For each subsequent task, first disconnect the current debugger session, then click the extension icon again to attach to the next breakpoint.
.. image:: https://github.com/aoshen524/verl/blob/main/docs/start/6e83c910a62c82fecb89c6619e001cd.png?raw=true
:alt: Disconnecting and reconnecting the debugger
Legacy Ray Debugger
~~~~~~~~~~~~~~~~~~~
1. Ray has a builtin legacy `debugger <https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html>`_ that allows you to debug your distributed applications. To enable debugger, start ray cluster with ``RAY_DEBUG=legacy`` and ``--ray-debugger-external``.
.. code-block:: bash
# start head node
RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external
# start worker node
RAY_DEBUG=legacy ray start --address='10.124.46.192:6379' --ray-debugger-external
2. Set up breakpoint in your code, and submit job to cluster. Then run ``ray debug`` to wait breakpoint:
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/legacy.png?raw=true
Multi-node training on AMD clusters
---------------------------------------------------------------------------------------
If you want to run multi-node training with slurm with Docker/Podman container on AMD Cluster, you can use the following script.
If you encounter any issues in using AMD GPUs running verl, please contact `Yusheng Su <https://yushengsu-thu.github.io/>`_.
.. note::
1. You need to use ``podman`` or ``docker`` in the following script. We will release the apptainer script later.
2. If you want to use ``podman``, you just replace ``docker`` with ``podman`` in the following script.
The script includes the following steps:
1. SLURM Configuration
2. Environment Setup
3. Docker/Podman Container Setup
4. Ray Cluster Initialization
5. Data Preprocessing
6. Model Setup
7. Training Launch
slurm_script.sh
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
#!/bin/bash
#SBATCH --job-name=verl-ray-on-slurm
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=200G
#SBATCH --time=30-00:00:00
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=28
#SBATCH --output=../verl_log/slurm-%j.out
#SBATCH --error=../verl_log/slurm-%j.err
#SBATCH --nodelist=gpu-[0,1]
# load necessary modules
### Run this setup
# [Cluster]: Use docker
# docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
##########################################################################
###The following setting should be set in different project and cluster###
##########################################################################
### Project
CONTAINER_NAME="multinode_verl_training"
IMG="verl.rocm"
DOCKERFILE="docker/Dockerfile.rocm"
# echo $PWD
verl_workdir="${HOME}/projects/verl_upstream"
export TRANSFORMERS_CACHE="${HOME}/.cache/huggingface"
export HF_HOME=$TRANSFORMERS_CACHE
### Cluster Network Setting
export NCCL_DEBUG=TRACE
export GPU_MAX_HW_QUEUES=2
export TORCH_NCCL_HIGH_PRIORITY=1
export NCCL_CHECKS_DISABLE=1
# export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9
export NCCL_IB_GID_INDEX=3
export NCCL_CROSS_NIC=0
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_PROTO=Simple
export RCCL_MSCCL_ENABLE=0
export TOKENIZERS_PARALLELISM=false
export HSA_NO_SCRATCH_RECLAIM=1
##########################################################################
### For rocm and training script
export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
# Build and launch the Docker container
srun bash -c "
# Exit on any error
set -e
# Clean up dangling images (images with <none> tag)
docker image prune -f
# Need to pull the docker first
docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
if ! docker images --format "{{.Repository}}:{{.Tag}}" | grep -q "${IMG}"; then
echo \"Building ${IMG} image...\"
docker build -f \"${DOCKERFILE}\" -t \"${IMG}\" .
else
echo \"${IMG} image already exists, skipping build\"
fi
# Removing old container if exists
docker rm \"${CONTAINER_NAME}\" 2>/dev/null || true
# Checking network devices
ibdev2netdev
# Launch the docker
docker run --rm -d \
-e HYDRA_FULL_ERROR=1 \
-e HIP_VISIBLE_DEVICES=${HIP_VISIBLE_DEVICES} \
-e ROCR_VISIBLE_DEVICES=${ROCR_VISIBLE_DEVICES} \
-e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \
-e NCCL_DEBUG=${NCCL_DEBUG} \
-e GPU_MAX_HW_QUEUES=${GPU_MAX_HW_QUEUES} \
-e TORCH_NCCL_HIGH_PRIORITY=${TORCH_NCCL_HIGH_PRIORITY} \
-e NCCL_CHECKS_DISABLE=${NCCL_CHECKS_DISABLE} \
-e NCCL_IB_HCA=${NCCL_IB_HCA} \
-e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX} \
-e NCCL_CROSS_NIC=${NCCL_CROSS_NIC} \
-e CUDA_DEVICE_MAX_CONNECTIONS=${CUDA_DEVICE_MAX_CONNECTIONS} \
-e NCCL_PROTO=${NCCL_PROTO} \
-e RCCL_MSCCL_ENABLE=${RCCL_MSCCL_ENABLE} \
-e TOKENIZERS_PARALLELISM=${TOKENIZERS_PARALLELISM} \
-e HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM} \
-e TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE} \
-e HF_HOME=${HF_HOME} \
--network host \
--device /dev/dri \
--device /dev/kfd \
--device /dev/infiniband \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v \${HOME}:\${HOME} \
-v \${HOME}/.ssh:/root/.ssh \
-w "${verl_workdir}" \
--shm-size 128G \
--name \"${CONTAINER_NAME}\" \
\"${IMG}\" \
tail -f /dev/null
echo \"Container setup completed\"
"
# (Optional): If you do not want to root mode and require assign yuorself as the user
# Please add `-e HOST_UID=$(id -u)` and `-e HOST_GID=$(id -g)` into the above docker launch script.
### Ray launch the nodes before training
# Getting the node names
nodes_array=($(scontrol show hostnames "$SLURM_JOB_NODELIST" | tr '\n' ' '))
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
# make sure we set environment variables before Ray initialization
# Print out all env variables
printenv
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
docker exec "${CONTAINER_NAME}" \
ray start --head --node-ip-address="$head_node_ip" --port=$port \
--dashboard-port=8266 \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
# optional, though may be useful in certain versions of Ray < 1.0.
sleep 10
# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "Debug: Starting worker on node_i = ${node_i}"
if [ -z "$node_i" ]; then
echo "Error: Empty node name for worker $i"
continue
fi
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" \
docker exec "${CONTAINER_NAME}" \
ray start --address "$ip_head" --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
sleep 5
done
# Ray initlization test (See whether any error in the above execution)
echo "Testing Ray initialization in the slurm nodes..."
docker exec "${CONTAINER_NAME}" python3 -c '
import ray
try:
ray.init(address="auto")
print("\n=== Ray Cluster Status ===")
print(f"Number of nodes: {len(ray.nodes())}")
for node in ray.nodes():
print("Node: {}, Status: {}".format(node["NodeManagerHostname"], node["Alive"]))
# print(f"Node: {node}")
ray.shutdown()
print("Ray initialization successful!")
except Exception as e:
print(f"Ray initialization failed: {str(e)}")
'
echo "=== Ray test completed ==="
######
# Run data preprocessing
echo "Starting data preprocessing..."
docker exec "${CONTAINER_NAME}" \
python3 "examples/data_preprocess/gsm8k.py" "--local_dir" "../data/gsm8k"
echo "Starting data preprocessing..."
docker exec "${CONTAINER_NAME}" \
python3 "examples/data_preprocess/math_dataset.py" "--local_dir" "../data/math"
train_files="../data/gsm8k/train.parquet"
val_files="../data/gsm8k/test.parquet"
# Download and test model
echo "Loading model..."
docker exec "${CONTAINER_NAME}" \
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
MODEL_PATH="Qwen/Qwen2-7B-Instruct"
# Set model path after pipeline test
MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"
echo "== Data and model loading Done =="
echo "Start to train..."
docker exec "${CONTAINER_NAME}" \
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
MODEL_PATH="Qwen/Qwen2-7B-Instruct"
PYTHONUNBUFFERED=1 srun --overlap --nodes=${SLURM_NNODES} --ntasks=1 -w "$head_node" \
docker exec "${CONTAINER_NAME}" \
python3 -m verl.trainer.main_ppo \
data.train_files=$train_files \
data.val_files=$val_files \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=$MODEL_PATH \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.kl_ctrl.kl_coef=0.0001 \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=${SLURM_GPUS_PER_NODE} \
trainer.val_before_train=False \
trainer.nnodes=${SLURM_NNODES} \
trainer.save_freq=-1 \
trainer.test_freq=10 \
trainer.total_epochs=15
Run multi-node training with above slurm_script.sh
~~~~~~~~~~~~~~~~~~~~
Just sbatch your slurm_script.sh
.. code-block:: bash
sbatch slurm_script.sh
.. _quickstart:
=========================================================
Quickstart: PPO training on GSM8K dataset
=========================================================
Post-train a LLM using GSM8K dataset.
Introduction
------------
.. _hf_dataset_gsm8k: https://huggingface.co/datasets/gsm8k
In this example, we train an LLM to tackle the `GSM8k <hf_dataset_gsm8k>`_ task with function-based rewards. [1]_
Prerequisite:
- the latest version of ``verl`` and its dependencies installed following the installation guide. Using the docker image is recommended.
- a GPU with at least 24 GB HBM
Dataset Introduction
--------------------
GSM8k is a math problem dataset. The prompt is an elementary school
problem. The LLM model is asked to solve the math problem. Below is an example:
Prompt
Katy makes coffee using teaspoons of sugar and cups of water in the
ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups
of water, calculate the number of teaspoonfuls of sugar she used.
Solution
The total ratio representing the ingredients she used to make the
coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the
number of teaspoons she used is 7/20, she used 7/20\ *120 =
<<7/20*\ 120=42>>42 #### 42
Step 1: Prepare the dataset
----------------------------
We preprocess the dataset in parquet format so that (1) it contains necessary fields for computing RL rewards and (2) is faster to read.
.. code-block:: bash
python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
Step 2: Download a model for post-training
-------------------------------------------
In this example, we start with the ``Qwen2.5-0.5B-Instruct`` model.
If you want to perform SFT before RL, refer to the :doc:`Complete GSM8K Example<../examples/gsm8k_example>`, the `sft directory <https://github.com/volcengine/verl/blob/main/examples/sft/gsm8k>`_ and `SFT Trainer <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_ for further details.
.. code-block:: bash
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')"
Step 3: Perform PPO training with the instruct model
----------------------------------------------------------------------
**Reward Model/Function**
We use a pre-defined rule-based reward model. We force the model to produce a final
answer following 4 “#” as shown in the solution. We extract the final
answer from both the solution and model's output using regular
expression matching. We assign a reward of 1 to correct
answer, 0.0 to incorrect answer and 0 to no answer.
For more details, please refer to `verl/utils/reward_score/gsm8k.py <https://github.com/volcengine/verl/blob/v0.4.1/verl/utils/reward_score/gsm8k.py>`_.
**Training Script**
Now let's run PPO training with the dataset and model above. [2]_
Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on your dataset and model names or paths.
You may set ``VERL_USE_MODELSCOPE=True`` to download models from `modelscope <https://www.modelscope.cn>`_ instead of `huggingface <https://huggingface.co>`_.
.. code-block:: bash
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=512 \
data.max_response_length=256 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=console \
trainer.val_before_train=False \
trainer.n_gpus_per_node=1 \
trainer.nnodes=1 \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.total_epochs=15 2>&1 | tee verl_demo.log
You are expected to see the following logs, indicating training in progress. The key metric ``val/test_score/openai/gsm8k`` is computed every ``trainer.test_freq`` steps:
.. code-block:: bash
step:0 - timing/gen:21.470 - timing/ref:4.360 - timing/values:5.800 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - timing/adv:0.109 - timing/update_critic:15.664 - critic/vf_loss:14.947 - critic/vf_clipfrac:0.000 - critic/vpred_mean:-2.056 - critic/grad_norm:1023.278 - critic/lr(1e-4):0.100 - timing/update_actor:20.314 - actor/entropy_loss:0.433 - actor/pg_loss:-0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:1.992 - actor/lr(1e-4):0.010 - critic/score/mean:0.004 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.004 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:2.360 - critic/advantages/min:-2.280 - critic/returns/mean:0.003 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.045 - critic/values/max:9.500 - critic/values/min:-14.000 - response_length/mean:239.133 - response_length/max:256.000 - response_length/min:77.000 - prompt_length/mean:104.883 - prompt_length/max:175.000 - prompt_length/min:68.000
step:1 - timing/gen:23.020 - timing/ref:4.322 - timing/values:5.953 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty:0.001 - timing/adv:0.118 - timing/update_critic:15.646 - critic/vf_loss:18.472 - critic/vf_clipfrac:0.384 - critic/vpred_mean:1.038 - critic/grad_norm:942.924 - critic/lr(1e-4):0.100 - timing/update_actor:20.526 - actor/entropy_loss:0.440 - actor/pg_loss:0.000 - actor/pg_clipfrac:0.002 - actor/ppo_kl:0.000 - actor/grad_norm:2.060 - actor/lr(1e-4):0.010 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:0.000 - critic/advantages/max:2.702 - critic/advantages/min:-2.616 - critic/returns/mean:0.000 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.280 - critic/values/max:11.000 - critic/values/min:-16.000 - response_length/mean:232.242 - response_length/max:256.000 - response_length/min:91.000 - prompt_length/mean:102.398 - prompt_length/max:185.000 - prompt_length/min:70.000
Checkout ``Algorithm Baselines`` page for full training and validation logs for reference.
The checkpoint is saved at the following dir by default: ``checkpoints/${trainer.project_name}/${trainer.experiment_name}``. You can merge the saved checkpoints to huggingface model using ``verl.model_merger`` module, for example:
.. code-block:: bash
python3 -m verl.model_merger merge \
--backend fsdp \
--local_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor \
--target_dir checkpoints/${trainer.project_name}/${trainer.experiment_name}/global_step_1/actor/huggingface
For more details about checkpoint and model merging, please refer to :ref:`checkpoint-page`.
To enable ``wandb`` for experiment tracking, set the following configs:
.. code-block:: bash
trainer.logger='["console","wandb"]' \
trainer.project_name=$YOUR_PROJECT_NAME \
trainer.experiment_name=$YOUR_RUN_NAME \
If you encounter out of memory issues with HBM less than 32GB, enable the following configs would help:
.. code-block:: bash
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
critic.ppo_micro_batch_size_per_gpu=1 \
For the full set of configs, please refer to :ref:`config-explain-page` for detailed explanation and performance tuning.
.. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model.
.. [2] More training script examples for FSDP and Megatron-LM backend are stored in `examples/ppo_trainer <https://github.com/volcengine/verl/tree/main/examples/ppo_trainer>`_ directory.
Ray Debug Tutorial
==================
Last updated: 04/23/2025
.. _wuxibin89: https://github.com/wuxibin89
Author: `Ao Shen <https://aoshen524.github.io/>`_.
How to debug?
---------------------
Ray Distributed Debugger VSCode Extension (Recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Starting with Ray 2.39, Anyscale has introduced the `Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>`_ VSCode extension. Follow the extension’s installation instructions, then add your cluster using the dashboard URL you obtained earlier.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true
:alt: Ray Distributed Debugger VSCode extension screenshot
2. Prerequisites.
Ensure the following are installed (see the extension README for more detail):
- Visual Studio Code
- `ray[default]` >= 2.9.1
- `debugpy` >= 1.8.0
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/readme.png?raw=true
:alt: VSCode with Ray prerequisites
3. Environment Variables.
To enable post‑mortem debugging, set:
.. code-block:: bash
export RAY_DEBUG_POST_MORTEM=1
.. admonition:: Note
:class: important
Be sure to remove any legacy flags before starting Ray:
- `RAY_DEBUG=legacy`
- `--ray-debugger-external`
4. Configuring BreakpointsSet up breakpoint() in your code, and submit job to cluster. Then the extension will show the breakpoint information.
1. Insert `breakpoint()` calls into your remote functions.
2. Submit your job to the cluster.
The extension will detect active breakpoints and display them in VSCode.
**Note:** Breakpoints are only supported inside functions decorated with `@ray.remote`.
5. Launching the Debugger.
Run your job directly from the command line (do not use a `launch.json`):
.. code-block:: bash
python job.py
6. Attaching to a Breakpoint.
Once the process hits the first `breakpoint()`, click the Ray Distributed Debugger icon in the VSCode sidebar to attach the debugger.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/launch.png?raw=true
:alt: Attaching VSCode debugger to Ray process
7. Debugging With Multiple breakpoint().
For each subsequent task, first disconnect the current debugger session, then click the extension icon again to attach to the next breakpoint.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/disconnect.png?raw=true
:alt: Disconnecting and reconnecting the debugger
Legacy Ray Debugger
~~~~~~~~~~~~~~~~~~~
1. Ray has a builtin legacy `debugger <https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html>`_ that allows you to debug your distributed applications. To enable debugger, start ray cluster with ``RAY_DEBUG=legacy`` and ``--ray-debugger-external``.
.. code-block:: bash
# start head node
RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external
# start worker node
RAY_DEBUG=legacy ray start --address='10.124.46.192:6379' --ray-debugger-external
2. Set up breakpoint in your code, and submit job to cluster. Then run ``ray debug`` to wait breakpoint:
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/legacy.png?raw=true
PyTorch FSDP Backend
======================
Last updated: 02/12/2025.
We support PyTorch FSDP Backend by implementing various workers for
actor, critic, reference, rollout and reward models. We also implement
the ``FSDPVLLMShardingManager`` that reshard weight between FSDP and
vLLM in `fsdp_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/fsdp_vllm.py>`_.
**Pros**
- Readily support various models.
- Users only need to implement the corresponding
``dtensor_weight_loader`` for weight synchronization between FSDP
and vLLM. While for ``hf_weight_loader``, users can directly apply
any models supported both in HF and vLLM without any code change.
- Easy to organize the forward and backward computation for each model.
**Cons**
- Poor scalability when it comes to large-scale models (e.g. Llama 70B
and 405B)
- The resharding overhead between actor and rollout could be larger than
Megatron-LM backend.
Due to the simplicity, we recommend using FSDP backend for algorithm
research and prototyping.
FSDP Workers
--------------
ActorRolloutRefWorker
^^^^^^^^^^^^^^^^^^^^^
Actor/Rollout HybridEngine
''''''''''''''''''''''''''
1. HybridEngine, Actor and Rollout initialization API.
.. code:: python
@register(dispatch_mode=Dispatch.ONE_TO_ALL)
def init_model(self):
``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
process, each worker (on a GPU) will execute the following model
initialization process.
The initialization details of HybridEngine, Actor and Rollout are
highlighted below:
1. ``DataParallelPPOActor`` implements the simple PPO computation logics
when the model is built with FSDP, including compute log prob, model
update.
2. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
Engine and make it executed under SPMD to fit into our
``WorkerGroup`` design.
3. ``FSDPVLLMShardingManager`` a context manager to perform actual
resharding between actor and rollout.
See `source code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_. for more information.
1. Generate sequence and recompute log prob
.. code:: python
@register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
def generate_sequences(self, prompts: DataProto):
- ``Dispatch.DP_COMPUTE_PROTO``: The data will be dispatched and
collected along the DP dimension
- In this function, the rollout model will perform auto-regressive
generation and the actor model will recompute the old log prob for the
generated response.
3. Update actor model
.. code:: python
@register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
def update_actor(self, data: DataProto):
- Update the actor model weight using PPO & entropy loss.
ReferenceModel
''''''''''''''
1. Reference model initialization
The reference model is initialized using the same function as the actor
model without initializing the HybridEngine and Optimizer. Then the
actor model is also wrapped by the ``DataParallelPPOActor``.
2. Compute reference log prob
.. code:: python
@register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
def compute_ref_log_prob(self, data: DataProto):
- In this function, the reference model will call the compute log prob
function in ``DataParallelPPOActor`` to compute the reference log
prob.
CriticWorker and RewardWorker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. Model initialization
Quite similar to reference model. The CriticWorker will perform
additional initialization for the Optimizer.
2. Compute Values for CriticWorker
.. code:: python
@register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
def compute_values(self, data: DataProto):
3. Update Critic
.. code:: python
@register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
def update_critic(self, data: DataProto):
4. Compute Reward
.. code:: python
@register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
def compute_rm_score(self, data: DataProto):
HybridShard
------------
We didn't support FSDP `HybridShard`. To support this, we may need to
construct a 2D device mesh and test the corresponding
``dtensor_weight_loader`` and ``hf_weight_loader`` for each model.
Megatron-LM Backend
===================
Last updated: 06/24/2025.
We support Megatron Backend by implementing various workers for actor,
critic, reference, rollout and reward models. We also implement the
``3DHybridEngine`` using Megatron-LM and vLLM/SGLang in
`megatron_vllm.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_vllm.py>`_
and `megatron_sglang.py <https://github.com/volcengine/verl/blob/main/verl/workers/sharding_manager/megatron_sglang.py>`_.
**Pros**
- Support 5D parallelism (TP, EP, CP, DP, PP) and sequence parallelism
for best scalablility and throughput.
- 3D HybridEngine can significantly reduce peak memory usage and reduce
weight synchronize overhead between actor and rollout.
**Cons**
- Huggingface Models and Megatron checkpoints need tools for conversion.
Development Progress
--------------------
Note that [Deprecated] means that the feature is not supported in the latest
version of verl.
[To-Optimize] means that the feature is implemented but not optimized yet.
[WIP] means that the feature is working in progress.
[In-Release] means that the feature is ready and in review process,
coming at any time.
+---------------+-----------------------------------------------------------+
| [Deprecated] | Megatron 3D Parallelism with custom models |
+---------------+-----------------------------------------------------------+
| [Done] | Megatron 0.11.0 ``GPTModel`` support |
+---------------+-----------------------------------------------------------+
| [Done] | Megatron GRPO support |
+---------------+-----------------------------------------------------------+
| [Done] | Megatron with vLLM 0.8.2, with per-tensor weights loading |
+---------------+-----------------------------------------------------------+
| [Done] | Megatron with Context Parallel |
+---------------+-----------------------------------------------------------+
| [Done] | Qwen2MoE model support |
+---------------+-----------------------------------------------------------+
| [To-Optimize] | Megatron dist Checkpoint |
+---------------+-----------------------------------------------------------+
| [To-Optimize] | Huggingface and Megatron Checkpoint Converter |
+---------------+-----------------------------------------------------------+
| [To-Optimize] | Efficient fused linear, entropy and cross entropy |
+---------------+-----------------------------------------------------------+
| [Done] | Megatron offload(param, grad, optimizer) |
+---------------+-----------------------------------------------------------+
| [Done] | Megatron Profiler |
+---------------+-----------------------------------------------------------+
| [In-Release] | Megatron 0.12.0, TE 2.2 with vLLM 0.8.3 and Fused Attn |
+---------------+-----------------------------------------------------------+
| [WIP] | Moonlight/DeepSeek-V3 model support |
+---------------+-----------------------------------------------------------+
| [WIP] | Expert Parallel support |
+---------------+-----------------------------------------------------------+
| [WIP] | Megatron support dynamic batch size |
+---------------+-----------------------------------------------------------+
| [To-Do] | Performance tuning |
+---------------+-----------------------------------------------------------+
| [MileStone] | Runnable with DeepSeek-V3 671B post-training |
+---------------+-----------------------------------------------------------+
Utils of Megatron Workers
-------------------------
MegatronWorker
^^^^^^^^^^^^^^
``MegatronWorker`` is the base class of different megatron worker
classes. In this class, ``get_megatron_global_info`` and
``get_megatron_rank_info`` function to retrieve the 3D parallel world
size and rank of each ``Worker`` running on specific GPU. These information
will be used in transfer protocol for Megatron Backend.
The following ``Worker`` class for different models will be utilized to
construct the ``WorkerGroup`` .
We implement various of APIs for each ``Worker`` class decorated by the
``@register(dispatch_mode=)`` . These APIs can be called by the ray
driver process. The data can be correctly collect and dispatch following
the ``dispatch_mode`` on each function. The supported dispatch_model
(i.e., transfer protocols) can be found in `decorator.py <https://github.com/volcengine/verl/blob/main/verl/single_controller/base/decorator.py>`_.
ActorRolloutRefWorker
^^^^^^^^^^^^^^^^^^^^^
This class is implemented for Actor/Rollout HybridEngine or for the
reference model to initialize their model and perform computation.
Actor/Rollout HybridEngine
''''''''''''''''''''''''''
1. HybridEngine, Actor and Rollout initialization API.
.. code:: python
@register(dispatch_mode=Dispatch.ONE_TO_ALL)
def init_model(self):
``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
process, each worker (on a GPU) will execute the following model
initialization process.
The initialization details of HybridEngine, Actor and Rollout are
highlighted below:
1. ``MegatronPPOActor`` implements the simple PPO computation logics
when the model is built with Megatron, including compute log prob,
model update.
2. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
Engine and make it executed under SPMD to fit into our
``WorkerGroup`` design.
3. ``MegatronVLLMShardingManager`` a context manager to perform actual
resharding between actor and rollout.
See `source code <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py#L63>`_ for more information.
.. code:: python
# build actor model
self.actor = MegatronPPOActor(config=self.config.actor,
model_config=self.actor_model_config,
megatron_config=megatron_config,
actor_module=self.actor_module,
actor_optimizer=self.actor_optimizer,
actor_optimizer_config=self.actor_optim_config)
# build rollout
# rollout initialization
rollout = vLLMRollout(actor_module=params,
config=self.config.rollout,
tokenizer=self.tokenizer,
model_hf_config=self.actor_model_config,
train_tp=mpu.get_tensor_model_parallel_world_size())
# perform weight resharding between actor and rollout
sharding_manager = MegatronVLLMShardingManager(module=self.hybrid_engine,
inference_engine=rollout.inference_engine,
model_config=self.actor_model_config,
layer_name_mapping=layer_name_mapping)
...
1. Generate sequence and recompute log prob
.. code:: python
@register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
def generate_sequences(self, prompts: DataProto):
- ``Dispatch.MEGATRON_PP_AS_DP_PROTO``: The PP dimension of the actor
model will be regarded as DP dimension. Then the driver process will
dispatch and collect the data according to this reorganization. This
is because, in HybridEngine, the actor weight, which usually applied
larger 3D parallel sizes, will be gathered along the PP dimension and
TP dimension. Therefore, the corresponding data should be dispatched
and collected through the 3D parallel group of the rollout model,
rather than the actor model. However, the world_size and rank
information can only be retrieved from ``get_megatron_global_info`` and
``get_megatron_rank_info``, which records the 3D information for the
actor model. Moreover, the data resharding inside TP dimension will be
processed within the HybridEngine.
- In this function, the rollout model will perform auto-regressive
generation and the actor model will recompute the old log prob for the
generated response.
3. Update actor model
.. code:: python
@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def update_actor(self, data: DataProto):
- ``Dispatch.MEGATRON_COMPUTE_PROTO``: User passes the data partitioned
by DP dimension. The data is dispatched to all tp/pp ranks within the
same dp group, and ultimately only collects output data from tp=0 and
the last pp.
- Update the actor model weight using PPO & entropy loss.
..note::
Currently, training Tensor Parallel Size can be different from inference
Tensor Parallel Size.
ReferenceModel
''''''''''''''
1. Reference model initialization
The reference model is initialized using the same function as the actor
model without initializing the HybridEngine and Optimizer. Then the
actor model is also wrapped by the ``MegatronPPOActor``.
2. Compute reference log prob
.. code:: python
@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_ref_log_prob(self, data: DataProto):
- In this function, the reference model will call the compute log prob
function in ``MegatronPPOActor`` to compute the reference log prob.
CriticWorker and RewardWorker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. Model initialization
Quite similar to reference model. The CriticWorker will perform
additional initialization for the Optimizer.
2. Compute Values for CriticWorker
.. code:: python
@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_values(self, data: DataProto):
3. Update Critic
.. code:: python
@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def update_critic(self, data: DataProto):
4. Compute Reward
.. code:: python
@register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
def compute_rm_score(self, data: DataProto):
Utils of Train Optimization
---------------------------
Offload
^^^^^^^
When resources are tight, the offload method can lower GPU memory
usage, helping training and inference frameworks work well under verl.
It moves parameters, gradients, and optimizers to CPU memory and only
loads them back to the GPU when needed.
If you want to use the offload, you can add the following parameters
for the actor and ref separately.
.. code:: python
# For the actor
actor_rollout_ref.actor.megatron.param_offload=True \
actor_rollout_ref.actor.megatron.grad_offload=True \
actor_rollout_ref.actor.megatron.optimizer_offload=True \
# For the ref w/o grad and optimizer
actor_rollout_ref.ref.megatron.param_offload=True \
For the critic, you can include these parameters.
.. code:: python
# For the critic
critic.megatron.param_offload=True \
critic.megatron.grad_offload=True \
critic.megatron.optimizer_offload=True \
Profiler
^^^^^^^^
The profiler is a tool that helps you understand the performance of your
model. It can be used to profile the time spent on different operations
and identify the bottlenecks. You can get more information from
`torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_.
In verl, now the profiler is only support for the actor role In Megatron. You can set
the begin step and end step to profile. Notice, one step means one gradient update. And
the profile result will be saved in the save_path. If you just want to profile in the
specific rank, you can set the profile_ranks, by default, it will be [0].
.. code:: python
actor_rollout_ref.actor.profile.use_profile=True \
actor_rollout_ref.actor.profile.profile_ranks=[0] \
actor_rollout_ref.actor.profile.step_start=0 \
actor_rollout_ref.actor.profile.step_end=1 \
actor_rollout_ref.actor.profile.save_path="./profile"
Related MCore Document
----------------------
There is also a detailed document of using MCore to train different
kinds of models, please refer to `MCore Document <https://github.com/volcengine/verl/blob/main/verl/models/mcore/readme.md>`_.
PPO Ray Trainer
===============
Last updated: 02/12/2025.
We implement the RayPPOTrainer, which is a trainer runs on the driver
process on a single CPU/GPU node (default is CPU).
The PPORayTrainer include 3 core functions for data preparation,
WorkerGroup initialization and PPO training loop.
Data Preparation
----------------
The ``PPORayTrainer``, as a single process, is responsible for loading a
complete batch of samples (prompts) from the dataset and then dispatch
to different worker_groups running on different GPUs.
To generalize the data loading, we implement the ``RLHFDataset`` class
to load the preprocessed parquet files, apply chat templates to the
prompts, add padding, truncate prompts that exceed max prompt length and
then tokenize.
.. code:: python
self.train_dataset = RLHFDataset(data_files=self.config.data.train_files,
tokenizer=self.tokenizer,
config=self.config.data)
Then, the dataloader will iterate the dataset under PPO mini batch size.
WorkerGroup Initialization
--------------------------
We first introduce a basic implementation of initializing the
``WorkerGroup`` of the actor model on a given set of GPUs.
.. code:: python
# max_colocate_count means the number of WorkerGroups (i.e. processes) in each RayResourcePool
# For FSDP backend, we recommend using max_colocate_count=1 that merge all WorkerGroups into one.
# For Megatron backend, we recommend using max_colocate_count>1 that can utilize different WorkerGroup for differnt models
resource_pool = RayResourcePool(process_on_nodes=[config.trainer.n_gpus_per_node] * config.trainer.nnodes,
use_gpu=True,
max_colocate_count=1)
# define actor rollout cls to be init on remote
actor_rollout_cls = RayClassWithInitArgs(cls=ActorRolloutWorker)
# define actor_rollout worker group
actor_rollout_worker_group = MegatronRayWorkerGroup(resource_pool=resource_pool,
ray_cls_with_init=actor_rollout_cls,
default_megatron_kwargs=config.actor_rollout.megatron)
Different WorkerGroups, like ``actor_rollout_worker_group`` ,
``critic_worker_group`` and ``ref_worker_group`` lies on a separate
process in the above implementation.
The driver process can then call the distributed compute function within
the ``actor_rollout_worker_group`` and other roles to construct the RL
training loop.
For models colocated in the same set of GPUs, we further provide a
fine-grain optimization, which merge the ``worker_group`` of different roles
in the same process. This optimization can save the redundant
CUDA/distributed context in different processes.
.. code:: python
# initialize WorkerGroup
# NOTE: if you want to use a different resource pool for each role, which can support different parallel size,
# you should not use `create_colocated_worker_cls`. Instead, directly pass different resource pool to different worker groups.
# See TODO(url) for more information.
all_wg = {}
for resource_pool, class_dict in self.resource_pool_to_cls.items():
worker_dict_cls = create_colocated_worker_cls(class_dict=class_dict)
wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
spawn_wg = wg_dict.spawn(prefix_set=class_dict.keys())
all_wg.update(spawn_wg)
if self.use_critic:
self.critic_wg = all_wg['critic']
self.critic_wg.init_model()
if self.use_reference_policy:
self.ref_policy_wg = all_wg['ref']
self.ref_policy_wg.init_model()
if self.use_rm:
self.rm_wg = all_wg['rm']
self.rm_wg.init_model()
# we should create rollout at the end so that vllm can have a better estimation of kv cache memory
self.actor_rollout_wg = all_wg['actor_rollout']
self.actor_rollout_wg.init_model()
.. note:: For megatron backend, if we merge the ``worker_groups`` into the same processes, all the roles will utilize the same 3D parallel size. To optimize this, we may need to maintain several 3D process groups for each role in the same distributed context. If you want to use different 3D parallel size for different roles, please follow the similar architecture of the first code block to initialize each role's ``worker_group``
PPO Training Loop
-----------------
We implement the PPO training loop by calling the functions in
worker_group of each role. The input and output data of each function is
a ``DataProto`` object implemented in `protocol.py <https://github.com/volcengine/verl/blob/main/verl/protocol.py>`_. In the training
loop, trainer will dispatch/collect the data to/from different GPUs
following the transfer protocols wrapped in the workers' functions. The
computation of PPO micro batches is processed in ``update_actor`` and
``update_critic`` functions.
To extend to other RLHF algorithms, such as DPO, GRPO, please refer to
:doc:`../advance/dpo_extension`.
.. code:: python
def fit(self):
"""
The training loop of PPO.
The driver process only need to call the compute functions of the worker group through RPC to construct the PPO dataflow.
The light-weight advantage computation is done on the driver process.
"""
from verl.utils.tracking import Tracking
from omegaconf import OmegaConf
logger = Tracking(project_name=self.config.trainer.project_name,
experiment_name=self.config.trainer.experiment_name,
default_backend=self.config.trainer.logger,
config=OmegaConf.to_container(self.config, resolve=True))
global_steps = 0
# perform validation before training
# currently, we only support validation using the reward_function.
if self.val_reward_fn is not None:
val_metrics = self._validate()
pprint(f'Initial validation metrics: {val_metrics}')
for epoch in range(self.config.trainer.total_epochs):
for batch_dict in self.train_dataloader:
metrics = {}
batch: DataProto = DataProto.from_single_dict(batch_dict)
# batch = batch.to('cuda')
# pop those keys for generation
gen_batch = batch.pop(batch_keys=['input_ids', 'attention_mask', 'position_ids'])
# generate a batch
with Timer(name='gen', logger=None) as timer:
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
metrics['timing/gen'] = timer.last
batch = batch.union(gen_batch_output)
if self.use_reference_policy:
# compute reference log_prob
with Timer(name='ref', logger=None) as timer:
ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
batch = batch.union(ref_log_prob)
metrics['timing/ref'] = timer.last
# compute values
with Timer(name='values', logger=None) as timer:
values = self.critic_wg.compute_values(batch)
batch = batch.union(values)
metrics['timing/values'] = timer.last
with Timer(name='adv', logger=None) as timer:
# compute scores. Support both model and function-based.
# We first compute the scores using reward model. Then, we call reward_fn to combine
# the results from reward model and rule-based results.
if self.use_rm:
# we first compute reward model score
reward_tensor = self.rm_wg.compute_rm_score(batch)
batch = batch.union(reward_tensor)
# we combine with rule-based rm
reward_tensor = self.reward_fn(batch)
batch.batch['token_level_scores'] = reward_tensor
# compute rewards. apply_kl_penalty if available
batch, kl_metrics = apply_kl_penalty(batch,
kl_ctrl=self.kl_ctrl_in_reward,
kl_penalty=self.config.algorithm.kl_penalty)
metrics.update(kl_metrics)
# compute advantages, executed on the driver process
batch = compute_advantage(batch,
self.config.algorithm.gamma,
self.config.algorithm.lam,
adv_estimator=self.config.algorithm.adv_estimator)
metrics['timing/adv'] = timer.last
# update critic
if self.use_critic:
with Timer(name='update_critic', logger=None) as timer:
critic_output = self.critic_wg.update_critic(batch)
metrics['timing/update_critic'] = timer.last
critic_output_metrics = reduce_metrics(critic_output.meta_info['metrics'])
metrics.update(critic_output_metrics)
# implement critic warmup
if self.config.trainer.critic_warmup <= global_steps:
# update actor
with Timer(name='update_actor', logger=None) as timer:
actor_output = self.actor_rollout_wg.update_actor(batch)
metrics['timing/update_actor'] = timer.last
actor_output_metrics = reduce_metrics(actor_output.meta_info['metrics'])
metrics.update(actor_output_metrics)
# validate
if self.val_reward_fn is not None and (global_steps + 1) % self.config.trainer.test_freq == 0:
with Timer(name='testing', logger=None) as timer:
val_metrics: dict = self._validate()
val_metrics = {f'val/{key}': val for key, val in val_metrics.items()}
metrics['timing/testing'] = timer.last
metrics.update(val_metrics)
# collect metrics
data_metrics = compute_data_metrics(batch=batch)
metrics.update(data_metrics)
# TODO: make a canonical logger that supports various backend
logger.log(data=metrics, step=global_steps)
if self.config.trainer.save_freq > 0 and (global_steps + 1) % self.config.trainer.save_freq == 0:
actor_local_path = os.path.join(self.config.trainer.default_local_dir, 'actor',
f'global_step_{global_steps}')
actor_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'actor')
self.actor_rollout_wg.save_checkpoint(actor_local_path, actor_remote_path)
if self.use_critic:
critic_local_path = os.path.join(self.config.trainer.default_local_dir, 'critic',
f'global_step_{global_steps}')
critic_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'critic')
self.critic_wg.save_checkpoint(critic_local_path, critic_remote_path)
global_steps += 1
# perform validation after training
if self.val_reward_fn is not None:
val_metrics = self._validate()
pprint(f'Final validation metrics: {val_metrics}')
SGLang Backend
==============
Last updated: 05/31/2025.
**Authored By SGLang RL Team and listed alphabetically by last name**
`Jingyi Chen <https://github.com/fzyzcjy>`_, `Yitong Guan <https://github.com/minleminzui>`_, `Zhuobin Huang <https://zobinhuang.github.io/sec_about/>`_, `Jiajun Li <https://github.com/guapisolo>`_, `Ji Li <https://github.com/GeLee-Q>`_, `Shenggui Li <https://franklee.xyz/about>`_, `Junrong Lin <https://github.com/ocss884>`_, `Xiang Long <https://github.com/SwordFaith>`_, `Rui Lu <https://scholar.google.com/citations?user=-MGuqDcAAAAJ>`_, `Jin Pan <https://jhinpan.github.io/>`_, `Shuai Shi <https://github.com/shuaills>`_, `Yushen Su <https://yushengsu-thu.github.io/>`_, `Xinyuan Tong <https://github.com/JustinTong0323>`_, `Chendong Wang <https://github.com/cedricbeta>`_, `Hanchen Zhang <https://scholar.google.com/citations?user=pGcJcagAAAAJ>`_, `Haoran Wang <https://ubecc.github.io/about/>`_, `Yongan Xiang <https://github.com/BearBiscuit05>`_, `Chengxing Xie <https://yitianlian.github.io/>`_, `Yuhao Yang <https://github.com/yhyang201>`_, `Jinwei Yao <https://kivi-yao.github.io/>`_, `Qiaolin Yu <https://github.com/Qiaolin-Yu>`_, `Yuzhen Zhou <https://github.com/zyzshishui>`_, `Chenyang Zhao <https://github.com/zhaochenyang20>`_
Introduction
------------
`SGLang <https://github.com/sgl-project/sglang>`_ is an open-source state-of-the-art inference service engine, fully adopted by xAI to support all inference needs of Grok during research and serving processes.
Currently, verl fully supports using SGLang as the inference engine during the rollout phase. As a rollout engine, SGLang provides the same feature coverage as vLLM., including memory saving and multi-node rollout features. After installing verl and SGLang, simply add ``actor_rollout_ref.rollout.name=sglang`` at startup script to seamlessly switch between the two inference frameworks.
In addition, the SGLang team is actively working on supporting features such as Multi-Turn Agentic RL, VLM RLHF, Server-Based RLHF, and Partial Rollout. You can track the related development progress in the `Tracking Roadmap <https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/74>`_.
Installation
------------
Please always follow the following command to install SGLang with verl.
.. code-block:: bash
pip install --upgrade pip
# Currently 0.4.8, subject to updates at any time, please refer to the latest version specified in `setup.py`
pip install -e ".[sglang]"
You can check the following dependencies are in your environment:
.. note::
- **PyTorch**: 2.6.0+cu124
- **CUDA**: 12.4
- **flashinfer-python**: 0.2.5+cu124torch2.6
- **SGLang**: 0.4.6.post5
- **sgl-kernel**: 0.1.4
Using SGLang as the Inference Backend for PPO Training on a Single Machine
-------------------------------------------------------------------------
We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
1. Run the following command to prepare the gsm8k dataset:
.. code-block:: bash
python3 examples/data_preprocess/gsm8k.py
2. Run the following script to conduct a PPO experiment on a single machine with 4 GPUs:
.. code-block:: bash
export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=4096 \
data.max_prompt_length=4096 \
data.max_response_length=4096 \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2-7B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
critic.model.fsdp_config.param_offload=True \
critic.model.fsdp_config.optimizer_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=console \
trainer.val_before_train=False \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=10 \
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Why export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples.
2. ``SGLangRollout`` will initialize ``Engine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP).
3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks.
Why might there be inconsistent GPU memory?
"""""""""""""""""""""""""""""""""""""""""""
**1. Ray Distributed Actor loads the model at different times**
``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times:
.. code-block:: python
self.rollout = SGLangRollout(...)
Different workers initialize the model at different times → different memory usage.
**2. Delayed initialization causes memory bias**
Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others.
Early workers already use up GPU memory → late workers still have empty memory → memory difference appears.
**3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing**
Although ``SGLangRollout`` may only involve subset of GPUs, its ``Engine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so:
- Non-rollout GPUs also join the communication.
- Later on, ``DeviceMesh`` init will fail due to "inconsistent memory".
**4. Different FSDP/TP loading behaviors also lead to mismatch**
If using:
.. code-block:: bash
actor.fsdp_config.param_offload=True
ref.fsdp_config.param_offload=True
Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.
Using SGLang as the Inference Backend for PPO Training Across Multiple Machines
------------------------------------------------------------------------------
SGLang also supports running verl's RAY-based cross-machine inference in IPv4 and IPv6 scenarios. In the script below, we use TP=16 for cross-machine inference. Suppose we have two interconnected machines: node0 with IP 10.94.16.4 and node1 with IP 10.94.16.5.
1. Start Ray on node0:
.. code-block:: bash
ray start --head --dashboard-host=0.0.0.0
You will see the following prompt:
.. code-block:: bash
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 10.94.16.4
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='10.94.16.4:6379'
2. Have node1 join the Ray cluster:
Run the following command on node1:
.. code-block:: bash
ray start --address='10.94.16.4:6379'
Run the following command to confirm that the Ray cluster now has two nodes:
.. code-block:: bash
ray status
You can see that the cluster has two nodes with 16 GPUs:
.. code-block:: bash
======== Autoscaler status: 2025-04-09 09:25:37.694016 ========
Node status
---------------------------------------------------------------
Active:
1 node_ef382ffd687d8f6b060c1b68e63ada7341b936fe5b1901dd04de1027
1 node_1eb4d7d07e793114c23a89d1a41f1f76acf6ef5b35af844a4ee8e4ba
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/360.0 CPU
0.0/16.0 GPU
0B/3.39TiB memory
0B/372.53GiB object_store_memory
3. Run the following script to train meta-llama/Llama-3.1-8B-Instruct with TP=16 across 2 machines using 16 GPUs:
.. code-block:: bash
DATA_DIR=$HOME/data/gsm8k
python3 -m verl.trainer.main_ppo \
actor_rollout_ref.rollout.name=sglang \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=4096 \
data.max_prompt_length=4096 \
data.max_response_length=4096 \
actor_rollout_ref.model.path=meta-llama/Llama-3.1-8B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=16 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
actor_rollout_ref.rollout.free_cache_engine=True \
actor_rollout_ref.ref.log_prob_micro_batch_size=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=meta-llama/Llama-3.1-8B-Instruct \
critic.model.enable_gradient_checkpointing=True \
critic.ppo_micro_batch_size=16 \
critic.model.fsdp_config.param_offload=True \
critic.model.fsdp_config.optimizer_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=console \
trainer.val_before_train=True \
trainer.n_gpus_per_node=8 \
trainer.nnodes=2 \
trainer.save_freq=-1 \
trainer.test_freq=10 \
trainer.total_epochs=15 2>&1 | tee verl_demo.log
# Copyright 2024 Bytedance Ltd. and/or its affiliates
# Copyright 2023-2024 SGLang Team
# Copyright 2025 ModelBest Inc. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Preprocess the DAPO-Math-17k dataset to multiturn format
"""
import argparse
import os
import datasets
from verl.utils.hdfs_io import copy, makedirs
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_dir", default="~/data/retool_aime2024")
parser.add_argument("--hdfs_dir", default=None)
args = parser.parse_args()
data_path = "BytedTsinghua-SIA/AIME-2024"
dataset = datasets.load_dataset(data_path, "default")
train_dataset = dataset["train"]
# add a row to each data item that represents a unique id
def make_map_fn(split):
def process_fn(example, idx):
orig_extra_info = example.pop("extra_info")
extra_info = orig_extra_info.copy()
extra_info["need_tools_kwargs"] = True
extra_info["tools_kwargs"] = {
"code_interpreter": {
"create_kwargs": {
"ground_truth": example["reward_model"]["ground_truth"],
},
},
}
example["extra_info"] = extra_info
return example
return process_fn
train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
local_dir = args.local_dir
hdfs_dir = args.hdfs_dir
train_dataset.to_parquet(os.path.join(local_dir, "train.parquet"))
if hdfs_dir is not None:
makedirs(hdfs_dir)
copy(src=local_dir, dst=hdfs_dir)
# Copyright 2024 Bytedance Ltd. and/or its affiliates
# Copyright 2023-2024 SGLang Team
# Copyright 2025 ModelBest Inc. and/or its affiliates
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Preprocess the DAPO-Math-17k dataset to multiturn format
"""
import argparse
import os
import datasets
from verl.utils.hdfs_io import copy, makedirs
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_dir", default="~/data/retool_dapo")
parser.add_argument("--hdfs_dir", default=None)
args = parser.parse_args()
data_path = "BytedTsinghua-SIA/DAPO-Math-17k"
dataset = datasets.load_dataset(data_path, "default")
train_dataset = dataset["train"]
# add a row to each data item that represents a unique id
def make_map_fn(split):
def process_fn(example, idx):
orig_extra_info = example.pop("extra_info")
extra_info = orig_extra_info.copy()
extra_info["need_tools_kwargs"] = True
extra_info["tools_kwargs"] = {
"code_interpreter": {
"create_kwargs": {
"ground_truth": example["reward_model"]["ground_truth"],
},
},
}
example["extra_info"] = extra_info
return example
return process_fn
train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
local_dir = args.local_dir
hdfs_dir = args.hdfs_dir
train_dataset.to_parquet(os.path.join(local_dir, "train.parquet"))
if hdfs_dir is not None:
makedirs(hdfs_dir)
copy(src=local_dir, dst=hdfs_dir)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment