Initial commit

f87b35b2 · jerrrrry · f87b35b2 · f87b35b2 · f87b35b2 · f87b35b2
Commit f87b35b2 authored Apr 17, 2025 by jerrrrry
20 changed files
--- a/docs/advance/placement.rst
+++ b/docs/advance/placement.rst
+Ray API Design Tutorial
+=======================================
+We provide a tutorial for our Ray API design, including:
+- Ray basic concepts
+- Resource Pool and RayWorkerGroup
+- Data Dispatch, Execution and Collection
+- Initialize the RayWorkerGroup and execute the distributed computation in the given Resource Pool
+See details in `tutorial.ipynb <https://github.com/volcengine/verl/blob/main/examples/ray/tutorial.ipynb>`_.
\ No newline at end of file
--- a/docs/amd_tutorial/amd_build_dockerfile_page.rst
+++ b/docs/amd_tutorial/amd_build_dockerfile_page.rst
+Getting started with AMD (ROCM Kernel)
+=====================================================
+Author: `Yusheng Su <https://yushengsu-thu.github.io/>`_
+Setup
+-----
+If you run on AMD GPUs (MI300) with ROCM platform, you cannot use the previous quickstart to run VeRL. You should follow the following steps to build a docker and assign ``HIP_VISIBLE_DEVICES`` and ``ROCR_VISIBLE_DEVICES`` when starting RLHF training.
+docker/Dockerfile.rocm
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: bash
+    #  Build the docker in the repo dir:
+    # docker build -f docker/Dockerfile.rocm -t verl-rocm:03.04.2015 .
+    # docker images # you can find your built docker
+    FROM rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+    # Set working directory
+    # WORKDIR $PWD/app
+    # Set environment variables
+    ENV PYTORCH_ROCM_ARCH="gfx90a;gfx942"
+    # Install vllm
+    RUN pip uninstall -y vllm && \
+        rm -rf vllm && \
+        git clone -b v0.6.3 https://github.com/vllm-project/vllm.git && \
+        cd vllm && \
+        MAX_JOBS=$(nproc) python3 setup.py install && \
+        cd .. && \
+        rm -rf vllm
+    # Copy the entire project directory
+    COPY . .
+    # Install dependencies
+    RUN pip install "tensordict<0.6" --no-deps && \
+        pip install accelerate \
+        codetiming \
+        datasets \
+        dill \
+        hydra-core \
+        liger-kernel \
+        numpy \
+        pandas \
+        datasets \
+        peft \
+        "pyarrow>=15.0.0" \
+        pylatexenc \
+        "ray[data,train,tune,serve]" \
+        torchdata \
+        transformers \
+        wandb \
+        orjson \
+        pybind11 && \
+        pip install -e . --no-deps
+Build the image:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: bash
+    docker build -t verl-rocm .
+Run the container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Optional: Running without root and with user permissions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: bash
+    docker run --rm -it \
+      --device /dev/dri \
+      --device /dev/kfd \
+      -p 8265:8265 \
+      --group-add video \
+      --cap-add SYS_PTRACE \
+      --security-opt seccomp=unconfined \
+      --privileged \
+      -v $HOME/.ssh:/root/.ssh \
+      -v $HOME:$HOME \
+      --shm-size 128G \
+      -w $PWD \
+      verl-rocm \
+      /bin/bash
+(Optional): If you do not want to root mode and require assign yuorself as the user
+Please add ``-e HOST_UID=$(id -u)`` and ``-e HOST_GID=$(id -g)`` into the above docker launch script. 
+Example
+-------
+Due to to special setting in AMD (ROCM) torch, you need to assign ``HIP_VISIBLE_DEVICES`` and ``ROCR_VISIBLE_DEVICES`` when starting Ray in VeRL's RLHF training.
+PPO
+~~~
+.. code-block:: bash
+    YOUR_PROJECT_NAME=r1-verl-ppo-upstream
+    YOUR_RUN_NAME=r1-training_ppo-upstream 
+    # export HYDRA_FULL_ERROR=1
+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+    GPUS_PER_NODE=8
+    MODEL_PATH=Qwen/Qwen2.5-0.5B-Instruct
+    python3 examples/data_preprocess/gsm8k.py --local_dir data/gsm8k
+    python3 -c "import transformers; transformers.pipeline('text-generation', model='$MODEL_PATH')"
+    PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
+     data.train_files=data/gsm8k/train.parquet \
+     data.val_files=data/gsm8k/test.parquet \
+     data.train_batch_size=256 \
+     data.val_batch_size=1312 \
+     data.max_prompt_length=512 \
+     data.max_response_length=256 \
+     actor_rollout_ref.model.path=$MODEL_PATH \
+     actor_rollout_ref.actor.optim.lr=1e-6 \
+     actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+     actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
+     actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+     actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+     critic.optim.lr=1e-5 \
+     critic.model.path=$MODEL_PATH \
+     critic.ppo_micro_batch_size_per_gpu=4 \
+     algorithm.kl_ctrl.kl_coef=0.001 \
+     trainer.logger=['console'] \
+     trainer.project_name=$YOUR_PROJECT_NAME \
+     trainer.experiment_name=$YOUR_RUN_NAME \
+     trainer.val_before_train=False \
+     trainer.default_hdfs_dir=null \
+     trainer.n_gpus_per_node=$GPUS_PER_NODE \
+     trainer.nnodes=1 \
+     trainer.save_freq=10 \
+     trainer.test_freq=10 \
+     trainer.total_epochs=15 #2>&1 | tee verl_demo.log
+GRPO
+~~~~
+.. code-block:: bash
+    YOUR_PROJECT_NAME=r1-verl-grpo-upstream
+    YOUR_RUN_NAME=r1-training_grpo-upstream
+    # export HYDRA_FULL_ERROR=1
+    # export FSDP_VERBOSE=1 
+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+    GPUS_PER_NODE=8
+    MODEL_PATH=Qwen/Qwen2.5-0.5B-Instruct
+    # MODEL_PATH=Qwen/Qwen2-7B-Instruct
+    python3 examples/data_preprocess/gsm8k.py --local_dir data/gsm8k
+    python3 -c "import transformers; transformers.pipeline('text-generation', model='$MODEL_PATH')"
+    python3 -m verl.trainer.main_ppo \
+        algorithm.adv_estimator=grpo \
+        data.train_files=data/gsm8k/train.parquet \
+        data.val_files=data/gsm8k/test.parquet \
+        data.train_batch_size=1024 \
+        data.val_batch_size=1312 \
+        data.max_prompt_length=512 \
+        data.max_response_length=1024 \
+        actor_rollout_ref.model.path=$MODEL_PATH \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.model.use_remove_padding=True \
+        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+        actor_rollout_ref.actor.use_dynamic_bsz=True \
+        actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \
+        actor_rollout_ref.actor.use_kl_loss=True \
+        actor_rollout_ref.actor.kl_loss_coef=0.001 \
+        actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+        actor_rollout_ref.model.enable_gradient_checkpointing=Flase \
+        actor_rollout_ref.actor.fsdp_config.param_offload=False \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.name=vllm \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
+        actor_rollout_ref.rollout.n=5 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=False \
+        algorithm.kl_ctrl.kl_coef=0.001 \
+        trainer.critic_warmup=0 \
+        trainer.logger=['console'] \
+        trainer.project_name=$YOUR_PROJECT_NAME \
+        trainer.experiment_name=$YOUR_RUN_NAME \
+        trainer.n_gpus_per_node=$GPUS_PER_NODE \
+        trainer.val_before_train=False \
+        trainer.nnodes=1 \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15
+Multi-node training: slurm with Docker/Podman container
+---------------------------------------------------------------------------------------
+If you want to run multi-node training with slurm, you can use the following script. 
+.. note::
+    1. You need to use ``podman`` or ``docker`` in the following script. We will release the apptainer script later.
+    2. If you want to use ``podman``, you just replace ``docker`` with ``podman`` in the following script.
+The script includes the following steps:
+1. SLURM Configuration
+2. Environment Setup
+3. Docker/Podman Container Setup
+4. Ray Cluster Initialization
+5. Data Preprocessing
+6. Model Setup
+7. Training Launch
+slurm_script.sh
+~~~~~~~~~~~~~~~~~~~~
+.. code-block:: bash
+    #!/bin/bash
+    #SBATCH --job-name=verl-ray-on-slurm
+    #SBATCH --nodes=2
+    #SBATCH --ntasks-per-node=2
+    #SBATCH --mem=200G
+    #SBATCH --time=30-00:00:00
+    #SBATCH --gpus-per-node=8
+    #SBATCH --cpus-per-task=28
+    #SBATCH --output=../verl_log/slurm-%j.out
+    #SBATCH --error=../verl_log/slurm-%j.err
+    #SBATCH --nodelist=gpu-[0,1]
+    # load necessary modules
+    ### Run this setup
+    # [Cluster]: Use docker
+    # docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+    ##########################################################################
+    ###The following setting should be set in different project and cluster###
+    ##########################################################################
+    ### Project
+    CONTAINER_NAME="multinode_verl_training"
+    IMG="verl.rocm"
+    DOCKERFILE="docker/Dockerfile.rocm"
+    # echo $PWD
+    verl_workdir="${HOME}/projects/verl_upstream"
+    export TRANSFORMERS_CACHE="${HOME}/.cache/huggingface"
+    export HF_HOME=$TRANSFORMERS_CACHE
+    ### Cluster Network Setting
+    export NCCL_DEBUG=TRACE
+    export GPU_MAX_HW_QUEUES=2
+    export TORCH_NCCL_HIGH_PRIORITY=1
+    export NCCL_CHECKS_DISABLE=1
+    # export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 
+    export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9
+    export NCCL_IB_GID_INDEX=3
+    export NCCL_CROSS_NIC=0
+    export CUDA_DEVICE_MAX_CONNECTIONS=1
+    export NCCL_PROTO=Simple
+    export RCCL_MSCCL_ENABLE=0
+    export TOKENIZERS_PARALLELISM=false
+    export HSA_NO_SCRATCH_RECLAIM=1
+    ##########################################################################
+    ### For rocm and training script
+    export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+    export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+    # Build and launch the Docker container
+    srun bash -c "
+        # Exit on any error
+        set -e 
+        # Clean up dangling images (images with <none> tag)
+        docker image prune -f
+        # Need to pull the docker first
+        docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+        if ! docker images --format "{{.Repository}}:{{.Tag}}" | grep -q "${IMG}"; then
+            echo \"Building ${IMG} image...\"
+            docker build -f \"${DOCKERFILE}\" -t \"${IMG}\" .
+        else
+            echo \"${IMG} image already exists, skipping build\"
+        fi
+        # Removing old container if exists
+        docker rm \"${CONTAINER_NAME}\" 2>/dev/null || true
+        # Checking network devices
+        ibdev2netdev
+        # Launch the docker
+        docker run --rm -d \
+        -e HYDRA_FULL_ERROR=1 \
+        -e HIP_VISIBLE_DEVICES=${HIP_VISIBLE_DEVICES} \
+        -e ROCR_VISIBLE_DEVICES=${ROCR_VISIBLE_DEVICES} \
+        -e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \
+        -e NCCL_DEBUG=${NCCL_DEBUG} \
+        -e GPU_MAX_HW_QUEUES=${GPU_MAX_HW_QUEUES} \
+        -e TORCH_NCCL_HIGH_PRIORITY=${TORCH_NCCL_HIGH_PRIORITY} \
+        -e NCCL_CHECKS_DISABLE=${NCCL_CHECKS_DISABLE} \
+        -e NCCL_IB_HCA=${NCCL_IB_HCA} \
+        -e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX} \
+        -e NCCL_CROSS_NIC=${NCCL_CROSS_NIC} \
+        -e CUDA_DEVICE_MAX_CONNECTIONS=${CUDA_DEVICE_MAX_CONNECTIONS} \
+        -e NCCL_PROTO=${NCCL_PROTO} \
+        -e RCCL_MSCCL_ENABLE=${RCCL_MSCCL_ENABLE} \
+        -e TOKENIZERS_PARALLELISM=${TOKENIZERS_PARALLELISM} \
+        -e HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM} \
+        -e TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE} \
+        -e HF_HOME=${HF_HOME} \
+        --network host \
+        --device /dev/dri \
+        --device /dev/kfd \
+        --device /dev/infiniband \
+        --group-add video \
+        --cap-add SYS_PTRACE \
+        --security-opt seccomp=unconfined \
+        --privileged \
+        -v \${HOME}:\${HOME} \
+        -v \${HOME}/.ssh:/root/.ssh \
+        -w "${verl_workdir}" \
+        --shm-size 128G \
+        --name \"${CONTAINER_NAME}\" \
+        \"${IMG}\" \
+        tail -f /dev/null
+        echo \"Container setup completed\"
+    "
+        # (Optional): If you do not want to root mode and require assign yuorself as the user
+        # Please add `-e HOST_UID=$(id -u)` and `-e HOST_GID=$(id -g)` into the above docker launch script. 
+    ### Ray launch the nodes before training
+    # Getting the node names
+    nodes_array=($(scontrol show hostnames "$SLURM_JOB_NODELIST" | tr '\n' ' '))
+    head_node=${nodes_array[0]}
+    head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+    # if we detect a space character in the head node IP, we'll
+    # convert it to an ipv4 address. This step is optional.
+    if [[ "$head_node_ip" == *" "* ]]; then
+        IFS=' ' read -ra ADDR <<<"$head_node_ip"
+    if [[ ${#ADDR[0]} -gt 16 ]]; then
+        head_node_ip=${ADDR[1]}
+    else
+        head_node_ip=${ADDR[0]}
+    fi
+        echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
+    fi
+    port=6379
+    ip_head=$head_node_ip:$port
+    export ip_head
+    echo "IP Head: $ip_head"
+    # make sure we set environment variables before Ray initialization
+    export VLLM_ATTENTION_BACKEND=XFORMERS
+    # Print out all env variables
+    printenv
+    echo "Starting HEAD at $head_node"
+    srun --nodes=1 --ntasks=1 -w "$head_node" \
+        docker exec "${CONTAINER_NAME}" \
+            ray start --head --node-ip-address="$head_node_ip" --port=$port \
+            --dashboard-port=8266 \
+            --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
+    # optional, though may be useful in certain versions of Ray < 1.0.
+    sleep 10
+    # number of nodes other than the head node
+    worker_num=$((SLURM_JOB_NUM_NODES - 1))
+    for ((i = 1; i <= worker_num; i++)); do
+        node_i=${nodes_array[$i]}
+        echo "Debug: Starting worker on node_i = ${node_i}"
+        if [ -z "$node_i" ]; then
+            echo "Error: Empty node name for worker $i"
+            continue
+        fi
+        echo "Starting WORKER $i at $node_i"
+        srun --nodes=1 --ntasks=1 -w "$node_i" \
+            docker exec "${CONTAINER_NAME}" \
+                ray start --address "$ip_head" --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
+        sleep 5
+    done
+    # Ray initlization test (See whether any error in the above excution)
+    echo "Testing Ray initialization in the slurm nodes..."
+    docker exec "${CONTAINER_NAME}" python3 -c '
+    import ray
+    try:
+        ray.init(address="auto")
+        print("\n=== Ray Cluster Status ===")
+        print(f"Number of nodes: {len(ray.nodes())}")
+        for node in ray.nodes():
+            print("Node: {}, Status: {}".format(node["NodeManagerHostname"], node["Alive"]))
+            # print(f"Node: {node}")
+        ray.shutdown()
+        print("Ray initialization successful!")
+    except Exception as e:
+        print(f"Ray initialization failed: {str(e)}")
+    '
+    echo "=== Ray test completed ==="
+    ######
+    # Run data preprocessing
+    echo "Starting data preprocessing..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 "examples/data_preprocess/gsm8k.py" "--local_dir" "../data/gsm8k"
+    echo "Starting data preprocessing..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 "examples/data_preprocess/math_dataset.py" "--local_dir" "../data/math"
+    train_files="../data/gsm8k/train.parquet"
+    val_files="../data/gsm8k/test.parquet"
+    # Download and test model
+    echo "Loading model..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
+    MODEL_PATH="Qwen/Qwen2-7B-Instruct"
+    # Set model path after pipeline test
+    MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"
+    echo "== Data and model loading Done =="
+    echo "Start to train..."
+    docker exec "${CONTAINER_NAME}" \
+        python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')"
+    MODEL_PATH="Qwen/Qwen2-7B-Instruct"
+    PYTHONUNBUFFERED=1 srun --overlap --nodes=${SLURM_NNODES} --ntasks=1 -w "$head_node" \
+        docker exec "${CONTAINER_NAME}" \
+        python3 -m verl.trainer.main_ppo \
+        data.train_files=$train_files \
+        data.val_files=$val_files \
+        data.train_batch_size=1024 \
+        data.max_prompt_length=1024 \
+        data.max_response_length=1024 \
+        actor_rollout_ref.model.path=$MODEL_PATH \
+        actor_rollout_ref.model.enable_gradient_checkpointing=False \
+        actor_rollout_ref.actor.optim.lr=1e-6 \
+        actor_rollout_ref.model.use_remove_padding=True \
+        actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+        actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
+        actor_rollout_ref.model.enable_gradient_checkpointing=True \
+        actor_rollout_ref.actor.fsdp_config.param_offload=False \
+        actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+        actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+        actor_rollout_ref.rollout.name=vllm \
+        actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
+        actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+        actor_rollout_ref.ref.fsdp_config.param_offload=True \
+        critic.optim.lr=1e-5 \
+        critic.model.use_remove_padding=True \
+        critic.model.path=$MODEL_PATH \
+        critic.model.enable_gradient_checkpointing=False \
+        critic.ppo_micro_batch_size_per_gpu=8 \
+        critic.model.fsdp_config.param_offload=False \
+        critic.model.fsdp_config.optimizer_offload=False \
+        algorithm.kl_ctrl.kl_coef=0.0001 \
+        trainer.critic_warmup=0 \
+        trainer.logger=['console','wandb'] \
+        trainer.project_name='verl_example' \
+        trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
+        trainer.n_gpus_per_node=${SLURM_GPUS_PER_NODE} \
+        trainer.val_before_train=False \
+        trainer.nnodes=${SLURM_NNODES} \
+        trainer.save_freq=-1 \
+        trainer.test_freq=10 \
+        trainer.total_epochs=15
+Run slurm_script.sh
+~~~~~~~~~~~~~~~~~~~~
+Just sbatch your slurm_script.sh
+.. code-block:: bash
+    sbatch slurm_script.sh
--- a/docs/conf.py
+++ b/docs/conf.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+# -- Path setup --------------------------------------------------------------
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+# -- Project information -----------------------------------------------------
+project = u'verl'
+# pylint: disable=W0622
+copyright = u'2024 ByteDance Seed Foundation MLSys Team'
+author = u'Guangming Sheng, Chi Zhang, Yanghua Peng, Haibin Lin'
+# -- General configuration ---------------------------------------------------
+# The master toctree document.
+master_doc = 'index'
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = ['recommonmark',
+  'sphinx.ext.autodoc',
+  'sphinx.ext.autosummary',
+  'sphinx.ext.autosectionlabel',
+]
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+source_suffix = ['.rst', 'rest', '.md']
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = u'en'
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+# -- Options for HTML output -------------------------------------------------
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
\ No newline at end of file
--- a/docs/data.rst
+++ b/docs/data.rst
+Data interface
+=========================
+DataProto is the interface for data exchange.
+The :class:`verl.DataProto` class contains two key members:
+- batch: a :class:`tensordict.TensorDict` object for the actual data
+- meta_info: a :class:`Dict` with additional meta information
+TensorDict
+~~~~~~~~~~~~
+:attr:`DataProto.batch` is built on top of :class:`tensordict`, a project in the PyTorch ecosystem.
+A TensorDict is a dict-like container for tensors. To instantiate a TensorDict, you must specify key-value pairs as well as the batch size.
+.. code-block:: python
+    >>> import torch
+    >>> from tensordict import TensorDict
+    >>> tensordict = TensorDict({"zeros": torch.zeros(2, 3, 4), "ones": torch.ones(2, 3, 5)}, batch_size=[2,])
+    >>> tensordict["twos"] = 2 * torch.ones(2, 5, 6)
+    >>> zeros = tensordict["zeros"]
+    >>> tensordict
+    TensorDict(
+    fields={
+        ones: Tensor(shape=torch.Size([2, 3, 5]), device=cpu, dtype=torch.float32, is_shared=False),
+        twos: Tensor(shape=torch.Size([2, 5, 6]), device=cpu, dtype=torch.float32, is_shared=False),
+        zeros: Tensor(shape=torch.Size([2, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
+    batch_size=torch.Size([2]),
+    device=None,
+    is_shared=False)
+One can also index a tensordict along its batch_size. The contents of the TensorDict can be manipulated collectively as well.
+.. code-block:: python
+    >>> tensordict[..., :1]
+    TensorDict(
+    fields={
+        ones: Tensor(shape=torch.Size([1, 3, 5]), device=cpu, dtype=torch.float32, is_shared=False),
+        twos: Tensor(shape=torch.Size([1, 5, 6]), device=cpu, dtype=torch.float32, is_shared=False),
+        zeros: Tensor(shape=torch.Size([1, 3, 4]), device=cpu, dtype=torch.float32, is_shared=False)},
+    batch_size=torch.Size([1]),
+    device=None,
+    is_shared=False)
+    >>> tensordict = tensordict.to("cuda:0")
+    >>> tensordict = tensordict.reshape(6)
+For more about :class:`tensordict.TensorDict` usage, see the official tensordict_ documentation.
+.. _tensordict: https://pytorch.org/tensordict/overview.html
+Core APIs
+~~~~~~~~~~~~~~~~~
+.. autoclass::  verl.DataProto
+   :members: to, select, union, make_iterator, concat
--- a/docs/examples/config.rst
+++ b/docs/examples/config.rst
--- a/docs/examples/gsm8k_example.rst
+++ b/docs/examples/gsm8k_example.rst
+GSM8K Example
+=============
+Introduction
+------------
+In this example, we train an LLM to tackle the GSM8k task.
+Paper: https://arxiv.org/pdf/2110.14168
+Dataset: https://huggingface.co/datasets/gsm8k
+Note that the original paper mainly focuses on training a verifier (a
+reward model) to solve math problems via Best-of-N sampling. In this
+example, we train an RLHF agent using a rule-based reward model.
+Dataset Introduction
+--------------------
+GSM8k is a math problem dataset. The prompt is an elementary school
+problem. The LLM model is required to answer the math problem.
+The training set contains 7473 samples and the test set contains 1319
+samples.
+**An example**
+Prompt
+   Katy makes coffee using teaspoons of sugar and cups of water in the
+   ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups
+   of water, calculate the number of teaspoonfuls of sugar she used.
+Solution
+   The total ratio representing the ingredients she used to make the
+   coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the
+   number of teaspoons she used is 7/20, she used 7/20\ *120 =
+   <<7/20*\ 120=42>>42 #### 42
+Step 1: Prepare dataset
+-----------------------
+.. code:: bash
+   cd examples/data_preprocess
+   python3 gsm8k.py --local_dir ~/data/gsm8k
+Step 2: Download Model
+----------------------
+There're three ways to prepare the model checkpoints for post-training:
+- Download the required models from huggingface or modelscope
+.. code:: bash
+   huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct --local-dir-use-symlinks False
+   # or
+   modelscope download --model deepseek-ai/deepseek-math-7b-instruct --local_dir ~/models/deepseek-math-7b-instruct
+- Already store your store model in the local directory or HDFS path.
+- Also, you can directly use the model name in huggingface (e.g.,
+  deepseek-ai/deepseek-math-7b-instruct) in
+  ``actor_rollout_ref.model.path`` and ``critic.model.path`` field in
+  the run script. You can also download models from modelscope by setting environmental variable ``VERL_USE_MODELSCOPE=True``.
+  See examples/ppo_trainer/run_deepseek7b_llm_modelscope.sh for example.
+Noted that users should prepare checkpoints for actor, critic and reward
+model.
+[Optional] Step 3: SFT your Model
+---------------------------------
+We provide a SFT Trainer using PyTorch FSDP in
+`fsdp_sft_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_. 
+Users can customize their own SFT
+script using our FSDP SFT Trainer.
+We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft directory <https://github.com/volcengine/verl/blob/main/examples/sft/gsm8k/>`_.
+.. code:: shell
+   set -x
+   torchrun -m verl.trainer.fsdp_sft_trainer \
+       data.train_files=$HOME/data/gsm8k/train.parquet \
+       data.val_files=$HOME/data/gsm8k/test.parquet \
+       data.prompt_key=question \
+       data.response_key=answer \
+       data.micro_batch_size_per_gpu=8 \
+       model.partial_pretrain=deepseek-ai/deepseek-coder-6.7b-instruct \
+       trainer.default_hdfs_dir=hdfs://user/verl/experiments/gsm8k/deepseek-coder-6.7b-instruct/ \
+       trainer.project_name=gsm8k-sft \
+       trainer.experiment_name=gsm8k-sft-deepseek-coder-6.7b-instruct \
+       trainer.total_epochs=4 \
+       trainer.logger=['console','wandb']
+If you use AMD GPUs (ROCm kernel), you need to add the following environment variables into the run script:
+    .. code-block:: bash
+        export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+        export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+        export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+Step 4: Perform PPO training with your model on GSM8K Dataset
+-------------------------------------------------------------
+- Prepare your own run.sh script. Here's an example for GSM8k dataset
+  and deepseek-llm-7b-chat model.
+- Users could replace the ``data.train_files`` ,\ ``data.val_files``,
+  ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
+  their environment.
+- See :doc:`config` for detailed explanation of each config field.
+**Reward Model/Function**
+We use a rule-based reward model. We force the model to produce a final
+answer following 4 “#” as shown in the solution. We extract the final
+answer from both the solution and model's output using regular
+expression matching. We compare them and assign a reward of 1 to correct
+answer, 0.1 to incorrect answer and 0 to no answer.
+**Training Script**
+The training script example for FSDP and Megatron-LM backend are stored in examples/ppo_trainer directory.
+.. code:: bash
+   cd ../ppo_trainer
+   bash run_deepseek7b_llm.sh
+The script of run_deepseek7b_llm.sh
+.. code:: bash
+   set -x
+   python3 -m verl.trainer.main_ppo \
+      data.train_files=$HOME/data/gsm8k/train.parquet \
+      data.val_files=$HOME/data/gsm8k/test.parquet \
+      data.train_batch_size=1024 \
+      data.max_prompt_length=512 \
+      data.max_response_length=512 \
+      actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+      actor_rollout_ref.actor.optim.lr=1e-6 \
+      actor_rollout_ref.model.use_remove_padding=True \
+      actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+      actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+      actor_rollout_ref.actor.fsdp_config.param_offload=False \
+      actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+      actor_rollout_ref.model.enable_gradient_checkpointing=True \
+      actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
+      actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
+      actor_rollout_ref.rollout.name=vllm \
+      actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+      actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
+      actor_rollout_ref.ref.fsdp_config.param_offload=True \
+      critic.optim.lr=1e-5 \
+      critic.model.use_remove_padding=True \
+      critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
+      critic.model.enable_gradient_checkpointing=True \
+      critic.ppo_micro_batch_size_per_gpu=32 \
+      critic.model.fsdp_config.param_offload=False \
+      critic.model.fsdp_config.optimizer_offload=False \
+      algorithm.kl_ctrl.kl_coef=0.001 \
+      trainer.critic_warmup=0 \
+      trainer.logger=['console','wandb'] \
+      trainer.project_name='verl_example_gsm8k' \
+      trainer.experiment_name='deepseek_llm_7b_function_rm' \
+      trainer.n_gpus_per_node=8 \
+      trainer.nnodes=1 \
+      trainer.save_freq=-1 \
+      trainer.test_freq=1 \
+      trainer.total_epochs=15 $@
+If you use AMD GPUs (ROCm kernel), you need to add the following environment variables into the run script:
+    .. code-block:: bash
+        export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+        export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+        export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES
+If you encounter any issues in using AMD GPUs running VeRL, feel free to contact me - `Yusheng Su <https://yushengsu-thu.github.io/>`_.
\ No newline at end of file
--- a/docs/examples/ppo_code_architecture.rst
+++ b/docs/examples/ppo_code_architecture.rst
+PPO Example Architecture
+========================
+Let's start with the Proximal Policy Optimization algorithm, which is
+most widely used algorithm in LLM post-training.
+The main entry point of the PPO algorithm example is:
+`main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py>`_.
+In this tutorial, we will go through the code architecture in `main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py>`_.
+Define the data
+---------------
+Users need to preprocess and store the dataset in parquet files.
+And we implement `RLHFDataset` to load and tokenize the parquet files.
+For ``RLHFDataset`` (Default), at least 1 fields are required:
+- ``prompt``: Contains the string prompt
+We already provide some examples of processing the datasets to parquet
+files in `data_preprocess directory <https://github.com/volcengine/verl/blob/main/examples/data_preprocess>`_. Currently, we support
+preprocess of GSM8k, MATH, Hellasage, Full_hh_rlhf datasets. See :doc:`../preparation/prepare_data` for
+more information.
+Define the reward functions for different datasets
+--------------------------------------------------
+In this main entry point, the users only need to define their own reward
+function based on the datasets (or applications) utilized in PPO
+training.
+For example, we already provide reward functions for `GSM8k <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py>`_ 
+and `MATH <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/math.py>`_
+datasets in the ``_select_rm_score_fn``. In the ``RewardManager``, we
+will compute the reward score based on the data_source to select
+corresponding reward functions. For some RLHF datasets (e.g.,
+full_hh_rlhf), the reward model is utilized to assess the responses
+without any reward functions. In this case, the ``RewardManager`` will
+return the ``rm_score`` computed by the reward model directly.
+See `reward functions <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_ for detailed implementation.
+Define worker classes
+---------------------
+.. code:: python
+   if config.actor_rollout_ref.actor.strategy == 'fsdp': # for FSDP backend
+       assert config.actor_rollout_ref.actor.strategy == config.critic.strategy
+       from verl.workers.fsdp_workers import ActorRolloutRefWorker, CriticWorker
+       from verl.single_controller.ray import RayWorkerGroup
+       ray_worker_group_cls = RayWorkerGroup
+   elif config.actor_rollout_ref.actor.strategy == 'megatron': # for Megatron backend
+       assert config.actor_rollout_ref.actor.strategy == config.critic.strategy
+       from verl.workers.megatron_workers import ActorRolloutRefWorker, CriticWorker
+       from verl.single_controller.ray.megatron import NVMegatronRayWorkerGroup
+       ray_worker_group_cls = NVMegatronRayWorkerGroup # Ray worker class for Megatron-LM
+   else:
+       raise NotImplementedError
+   from verl.trainer.ppo.ray_trainer import ResourcePoolManager, Role
+   role_worker_mapping = {
+       Role.ActorRollout: ActorRolloutRefWorker,
+       Role.Critic: CriticWorker,
+       Role.RefPolicy: ActorRolloutRefWorker
+   }
+   global_pool_id = 'global_pool'
+   resource_pool_spec = {
+       global_pool_id: [config.trainer.n_gpus_per_node] * config.trainer.nnodes,
+   }
+   mapping = {
+       Role.ActorRollout: global_pool_id,
+       Role.Critic: global_pool_id,
+       Role.RefPolicy: global_pool_id,
+   }
+Step 1: Construct the mapping between roles and workers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A role represents a group of workers in the same process. We have
+pre-defined several roles in `ray_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py#L38>`_.
+.. code:: python
+   class Role(Enum):
+       """
+       To create more roles dynamically, you can subclass Role and add new members
+       """
+       Actor = 0  # This worker only has Actor
+       Rollout = 1 # This worker only has Rollout
+       ActorRollout = 2 # This worker has both actor and rollout, it's a HybridEngine
+       Critic = 3 # This worker only has critic
+       RefPolicy = 4 # This worker only has reference policy
+       RewardModel = 5 # This worker only has reward model
+       ActorRolloutRef = 6 # This worker contains actor, rollout and reference policy simultaneously 
+Step 2: Define the worker class corresponding to this role
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+- We have pre-implemented the ``ActorRolloutRefWorker``. Through
+  different configs, it can be a standalone actor, a standalone rollout,
+  an ActorRollout HybridEngine, or an ActorRolloutRef HybridEngine
+- We also pre-implemented workers for ``Actor``, ``Rollout``,
+  ``Critic``, ``Reward Model`` and ``Reference model`` on two different
+  backend: PyTorch FSDP
+  and Megatron-LM.
+  See `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_ 
+  and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py>`_
+  for more information.
+Step 3: Define resource pool id and resource pool spec
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+- Resource pool is a division of global GPU resources,
+  ``resource_pool_spec`` is a dict, mapping from id to # of GPUs
+  - In the above example, we defined a global resource pool:
+    global_pool_id, and then put all roles on this one resource pool
+    with all the GPUs in this post-training task. This refers to
+    *co-locate* placement where all the models share the same set of
+    GPUs.
+- See resource pool and placement for advance usage.
+Defining reward model/function
+------------------------------
+.. code:: python
+   # we should adopt a multi-source reward function here
+   # - for rule-based rm, we directly call a reward score
+   # - for model-based rm, we call a model
+   # - for code related prompt, we send to a sandbox if there are test cases
+   # - finally, we combine all the rewards together
+   # - The reward type depends on the tag of the data
+   if config.reward_model.enable:
+       from verl.workers.fsdp_workers import RewardModelWorker
+       role_worker_mapping[Role.RewardModel] = RewardModelWorker
+       mapping[Role.RewardModel] = global_pool_id
+   reward_fn = RewardManager(tokenizer=tokenizer, num_examine=0)
+   # Note that we always use function-based RM for validation
+   val_reward_fn = RewardManager(tokenizer=tokenizer, num_examine=1)
+   resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)
+Since not all tasks use model-based RM, users need to define here
+whether it's a model-based RM or a function-based RM
+- If it's a model-based RM, directly add the ``RewardModel`` role in the
+  resource mapping and add it to the resource pool mapping.
+  - Note that the pre-defined ``RewardModelWorker`` only supports models
+    with the structure of huggingface
+    ``AutoModelForSequenceClassification``. If it's not this model, you
+    need to define your own RewardModelWorker in `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_ 
+    and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/workers/megatron_workers.py>`_.
+- If it's a function-based RM, the users are required to classified the
+  reward function for each datasets.
+.. code:: python
+   def _select_rm_score_fn(data_source):
+       if data_source == 'openai/gsm8k':
+           return gsm8k.compute_score
+       elif data_source == 'lighteval/MATH':
+           return math.compute_score
+       else:
+           raise NotImplementedError
+See reward functions implemented in `directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/>`_ 
+for more information.
+Define, init and run the PPO Trainer
+------------------------------------
+.. code:: python
+   trainer = RayPPOTrainer(config=config,
+                           tokenizer=tokenizer,
+                           role_worker_mapping=role_worker_mapping,
+                           resource_pool_manager=resource_pool_manager,
+                           ray_worker_group_cls=ray_worker_group_cls,
+                           reward_fn=reward_fn,
+                           val_reward_fn=val_reward_fn)
+   trainer.init_workers()
+   trainer.fit()
+- We first initialize the ``RayPPOTrainer`` with user config, tokenizer
+  and all the above worker mapping, resource pool, worker group and
+  reward functions
+- We first call the ``trainer.init_workers()`` to initialize the models
+  on the allocated GPUs (in the resource pool)
+- The actual PPO training will be executed in ``trainer.fit()``
+verl can be easily extended to other RL algorithms by reusing the Ray
+model workers, resource pool and reward functions. See :doc:`extension<../advance/dpo_extension>` for
+more information.
+Details of the ``RayPPOTrainer`` is discussed in :doc:`Ray Trainer<../workers/ray_trainer>`.
--- a/docs/experiment/ppo.rst
+++ b/docs/experiment/ppo.rst
+.. _algo-baseline-page:
+Algorithm Baselines
+===================
+Datasets 
+------------------
+Assuming GSM8k/math dataset is preprocess via ``python3 examples/data_preprocess/*.py``
+Refer to the table below to reproduce RL training from different pre-trained models.
+NVIDIA GPUs
+--------------------------------
+.. _Huggingface: https://huggingface.co/google/gemma-2-2b-it#benchmark-results
+.. _SFT Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
+.. _SFT+PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
+.. _wandb: https://api.wandb.ai/links/verl-team/h7ux8602
+.. _Qwen Blog: https://qwenlm.github.io/blog/qwen2.5-llm/
+.. _PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
+.. _Megatron PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/deepseek-llm-7b-chat-megatron-bsz256_4-prompt512-resp512-0.695.log
+.. _Qwen7b GRPO Script: https://github.com/volcengine/verl/blob/a65c9157bc0b85b64cd753de19f94e80a11bd871/examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
+.. _Megatron wandb: https://wandb.ai/verl-team/verl_megatron_gsm8k_examples/runs/10fetyr3
+.. _Qwen7b ReMax Script: https://github.com/eric-haibin-lin/verl/blob/main/examples/remax_trainer/run_qwen2.5-3b_seq_balance.sh
+.. _Qwen7b ReMax Wandb: https://wandb.ai/liziniu1997/verl_remax_example_gsm8k/runs/vxl10pln
+.. _Qwen0.5b PRIME Script: https://github.com/volcengine/verl/blob/main/recipe/prime/run_prime_qwen.sh
+.. _Qwen0.5b PRIME Wandb: https://api.wandb.ai/links/zefan-wang-thu-tsinghua-university/rxd1btvb
+.. _Megatron Qwen2 7b GRPO Script with Math and GSM8k: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/qwen2-7b_math_megatron.log
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Model                            | Method                 | Test score |  Details                                                                                      |
+==================================+========================+============+=====================+=========================================================================+
+| google/gemma-2-2b-it             | pretrained checkpoint  | 23.9       |   `Huggingface`_                                                                              |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it             | SFT                    | 52.06      |   `SFT Command and Logs`_                                                                     |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it             | SFT + PPO              | 64.02      |   `SFT+PPO Command and Logs`_, `wandb`_                                                       |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct       | pretrained checkpoint  | 36.4       |   `Qwen Blog`_                                                                                |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct       | PPO                    | 56.7       |   `PPO Command and Logs`_                                                                     |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct       | PRIME                  | 58.7       |   `Qwen0.5b PRIME Script`_, `Qwen0.5b PRIME Wandb`_                                           |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| deepseek-ai/deepseek-llm-7b-chat | PPO (Megatron)         | 69.5 [1]_  |   `Megatron PPO Command and Logs`_, `Megatron wandb`_                                         |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2-7B-Instruct           | GRPO                   | 89         |   `Qwen7b GRPO Script`_                                                                       |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2-7B-Instruct           | GRPO (Megatron)        | 89.6       |   `Megatron Qwen2 7b GRPO Script with Math and GSM8k`_                                        |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-7B-Instruct         | ReMax                  | 97         |   `Qwen7b ReMax Script`_, `Qwen7b ReMax Wandb`_                                               |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+AMD GPUs (MI300)
+--------------------------------
+.. _ppo_run_deepseek7b_llm.sh:  https://github.com/yushengsu-thu/verl_training_log/blob/main/gsm8k/ppo_run_deepseek7b_llm.log
+.. _grpo_run_deepseek7b_llm.sh: https://github.com/yushengsu-thu/verl_training_log/blob/main/gsm8k/grpo_run_deepseek7b_llm.log
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Model                            | Method                 | Test score |  Details                                                                                      |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| deepseek-ai/deepseek-llm-7b-chat | PPO                    | 70.5 [1]_  |   `ppo_run_deepseek7b_llm.sh`_                                                                |                   
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| deepseek-ai/deepseek-llm-7b-chat | GRPO                   | 71.4 [1]_  |   `grpo_run_deepseek7b_llm.sh`_                                                               |                   
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+.. [1] During the evaluation, we have only extracted answers following the format "####". A more flexible answer exaction, longer response length and better prompt engineering may lead to higher score.
--- a/docs/faq/faq.rst
+++ b/docs/faq/faq.rst
+Frequently Asked Questions
+====================================
+Ray related
+------------
+How to add breakpoint for debugging with distributed Ray?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Please checkout the official debugging guide from Ray: https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html
+Distributed training
+------------------------
+How to run multi-node post-training with Ray?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+You can start a ray cluster and submit a ray job, following the official guide from Ray: https://docs.ray.io/en/latest/ray-core/starting-ray.html
+Then in the configuration, set the ``trainer.nnode`` config to the number of machines for your job.
+How to use verl on a Slurm-managed cluster?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Ray provides users with `this <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_ official
+tutorial to start a Ray cluster on top of Slurm. We have verified the :doc:`GSM8K example<../examples/gsm8k_example>`
+on a Slurm cluster under a multi-node setting with the following steps.
+1. [Optional] If your cluster support `Apptainer or Singularity <https://apptainer.org/docs/user/main/>`_ and you wish
+to use it, convert verl's Docker image to an Apptainer image. Alternatively, set up the environment with the package
+manager available on your cluster or use other container runtimes (e.g. through `Slurm's OCI support <https://slurm.schedmd.com/containers.html>`_) available to you.
+.. code:: bash
+    apptainer pull /your/dest/dir/vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3.sif docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
+2. Follow :doc:`GSM8K example<../examples/gsm8k_example>` to prepare the dataset and model checkpoints.
+3. Modify `examples/slurm/ray_on_slurm.slurm <https://github.com/volcengine/verl/blob/main/examples/slurm/ray_on_slurm.slurm>`_ with your cluster's own information.
+4. Submit the job script to the Slurm cluster with `sbatch`.
+Please note that Slurm cluster setup may vary. If you encounter any issues, please refer to Ray's
+`Slurm user guide <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_ for common caveats.
+If you changed Slurm resource specifications, please make sure to update the environment variables in the job script if necessary.
+Illegal memory access
+---------------------------------
+If you encounter the error message like ``CUDA error: an illegal memory access was encountered`` during rollout, most likely it is due to a known issue from vllm.
+Please set the following environment variable. The env var must be set before the ``ray start`` command if any.
+.. code:: bash
+    export VLLM_ATTENTION_BACKEND=XFORMERS
+If in doubt, print this env var in each rank to make sure it is properly set.
+Checkpoints
+------------------------
+If you want to convert the model checkpoint into huggingface safetensor format, please refer to ``scripts/model_merger.py``.
+Triton ``compile_module_from_src`` error
+------------------------------------------------
+If you encounter triton compilation error similar to the stacktrace below, please set the ``use_torch_compile`` flag according to
+https://verl.readthedocs.io/en/latest/examples/config.html to disable just-in-time compilation for fused kernels.
+.. code:: bash
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
+    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 338, in run
+    return self.fn.run(*args, **kwargs)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
+    device = driver.active.get_current_device()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
+    self._initialize_obj()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
+    self._obj = self._init_fn()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
+    return actives[0]()
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
+    self.utils = CudaUtils()  # TODO: make static
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
+    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
+    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in _build
+    ret = subprocess.check_call(cc_cmd)
+  File "/data/lbh/conda_envs/verl/lib/python3.10/subprocess.py", line 369, in check_call
+    raise CalledProcessError(retcode, cmd)
--- a/docs/hybrid_flow.rst
+++ b/docs/hybrid_flow.rst
+=========================================================
+HybridFlow Programming Guide
+=========================================================
+.. _vermouth: https://github.com/vermouth1992
+Author: `Chi Zhang <https://github.com/vermouth1992>`_
+verl is an open source implementation of the paper `HybridFlow <https://arxiv.org/abs/2409.19256v2>`_ [1]_. In this section, we will introduce the basic concepts of HybridFlow, the motivation and how to program with verl APIs.
+Motivation and Design
+------------------------
+We use dataflow to represent RL systems. [4]_.
+DataFlow
+~~~~~~~~~~~~~~~~~~~~
+Dataflow is an abstraction of computations. Neural Network training is a typical dataflow. It can be represented by computational graph. 
+.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/dataflow.jpeg?raw=true
+   :alt: The dataflow graph from CS231n 2024 lecture 4
+This figure [2]_ represents the computation graph of a polynomial function followed by a sigmoid function. In the data flow of neural network computation, each node represents an operator, and each edge represents the direction of forward/backward propagation. The computation graph determines the architecture of the neural network.
+RL as a dataflow problem
++++++++++++++++++++++++++++++++++++++++++++++
+Reinforcement learning (RL) training can also be represented as a dataflow. Below is the dataflow graph that represents the PPO algorithm used in RLHF [3]_:
+.. image:: https://picx.zhimg.com/70/v2-cb8ab5ee946a105aab6a563e92682ffa_1440w.avis?source=172ae18b&biz_tag=Post
+  :alt: PPO dataflow graph, credit to Zhihu 低级炼丹师
+However, the dataflow of RL has fundamental differences compared with dataflow of neural network training as follows:
+--------------------------+--------------------------------------------------+---------------------+
+| Workload                 | Node                                             | Edge                |
+--------------------------+--------------------------------------------------+---------------------+
+| Neural Network Training  | Operator (+/-/matmul/softmax)                    | Tensor movement     |
+--------------------------+--------------------------------------------------+---------------------+
+| Reinforcement Learning   | High-level operators (rollout/model forward)     | Data Movement       |
+--------------------------+--------------------------------------------------+---------------------+
+In the case of tabular reinforcement learning, each operator is a simple scalar math operation (e.g., bellman update). In deep reinforcement learning(DRL), each operator is a high-level neural network computation such as model inference/update. This makes RL a two-level dataflow problem:
+- Control flow: defines how the high-level operators are executed (e.g., In PPO, we first perform rollout. Then, we perform advantage computation. Finally, we perform training). It expresses the **core logics of RL algorithms**.
+- Computation flow: defines the dataflow of **neural network computation** (e.g., model forward/backward/optimizer).
+Design Choices
+~~~~~~~~~~~~~~~~~~~~
+The model size used in DRL before the LLM era is typically small. Thus, the high-level neural network computation can be done in a single process. This enables embedding the computation flow inside the control flow as a single process.
+However, in the LLM era, the computation flow (e.g., training neural network) becomes a multi-process program. This naturally leads to two design choices:
+1. Convert the control flow into a multi-process program as well. Then colocate with computation flow (unified multi-controller)
+- Advantages:
+  - Achieves the **optimal performance** under fixed computation flow and control flow as the communication overhead in both training and data transfer is minimized.
+- Disadvantages:
+  - The computation and/or control flow is **hard to reuse** from software perspective as computation code is coupled with specific controller code. For example, the training loop of PPO is generic. Say we have an PPO training flow implemented with a specific computation flow such as FSDP. Neither the control flow or computation flow can be reused if we want to switch the computation flow from FSDP to Megatron, due to the coupling of control and computation flows.
+  - Requires more efforts from the user under flexible and dynamic control flows, due to the multi-process nature of the program.
+2. Separate the flows: single process for the control flow and multi-process for computation flow
+- Advantages:
+  - The computation flow defined elsewhere can be **easily reused** after the decoupling.
+  - The controller runs on a single process. Implementing a new RL algorithm with a **different control flow is simple and easy**.
+- Disadvantages:
+  - Additional **data communication overhead** each time the controller process and computatation processes interact. The data has to be sent back and forth.
+In verl, the latter strategy with separate control flow and computation flow is adopted. verl is designed to decouple the control flow of RL algorithms, and the implementation of computation engines.
+Overall Execution Diagram
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Below is a simplified diagram denoting the execution of a reinforcement learning job. In the diagram, the controller runs on a single process, while the generator/actor workers, critic workers run on multiple processes, placed with specific resource groups. For rollout, the controller passes the data to the generator to perform sample generation. When the rollout is done, the data is passed back to controller for the next step of the algorithm. Similar execution is done for other workers. With the hybrid controller design, the data flow and computation is decoupled to provide both efficiency in computation and flexiblity in defining algorithm training loops.
+.. figure:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/driver_worker.png?raw=true
+   :alt: The execution diagram
+Codebase walkthrough (PPO)
+------------------------------------------------
+Entry function
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Code: https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py
+In this file, we define a remote function `main_task` that serves as the controller (driver) process as shown in the above figure. We also define a ``RewardManager``, where users can customize their reward function based on the data source in the dataset. Note that `RewardManager` should return the final token-level reward that is optimized by RL algorithms. Note that users can combine model-based rewards and rule-based rewards.
+The ``main_task`` constructs a RayPPOTrainer instance and launch the fit. Note that ``main_task`` **runs as a single process**.
+We highly recommend that the ``main_task`` is NOT scheduled on the head of the ray cluster because ``main_task`` will consume a lot of memory but the head usually contains very few resources.
+Ray trainer
+~~~~~~~~~~~~~~~~~~~~
+Code: https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py
+The RayPPOTrainer manages 
+- Worker and WorkerGroup construction
+- Runs the main loop of PPO algorithm
+Note that, the fit function of RayPPOTrainer **runs as a single process**.
+Worker and WorkerGroup construction
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each workerGroup manages a list of workers that runs remotely. Note that the worker group runs in the process of its construtor.
+Each worker inside the WorkerGroup runs on a GPU. The worker group serves as a proxy for the controller process to interact with a list of workers, in order to perform certain computations. **In order to do so, we have to bind the methods of the worker into the method of the WorkerGroup and define the data dispatch and data collection**. This is done via simple decoration that will be introduced in the Worker definition section.
+For example, in PPO, we define 3 worker groups:
+- ActorRolloutRef: manages actor, rollout and reference policy. ActorRolloutRefWorker can be instantiated as a single actor, a single rollout, a single reference policy, a combined actor/rollout or a combined actor/rollout/ref. This design is aimed for the maximum code reuse in various scenarios. The reason for colocating actor and rollout is for fast weight transfer using nccl. The reason for coloating actor and reference is to implement an efficient lora PPO as the reference policy is simply the base model of PPO in lora.
+- Critic: manages the critic model
+- Reward: manages the reward model
+The worker group will be constructed on the resource pool it designates. The resource pool is a set of GPUs in the ray cluster.
+Worker definition
+~~~~~~~~~~~~~~~~~~~~
+.. _ActorRolloutRefWorker: https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py
+We take `ActorRolloutRefWorker <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py>`_ for an example.
+The APIs it should expose to the controller process are:
+- init_model: build the underlying model
+- generate_sequences: given prompts, generate responses
+- compute_log_prob: compute the log-probability of a generated sequence using actor
+- compute_ref_log_prob: compute the log-probability of a generated sequence using reference policy
+- save_checkpoint: save the checkpoint
+Note that these methods are defined in the worker that can only be invoked via remote calls. For example, if the controller process wants to initialize the model, it has to call
+.. code-block:: python
+   for worker in actor_rollout_ref_wg:
+       worker.init_model.remote()
+If the controller process wants to generate sequences, it has to call
+.. code-block:: python
+   data = xxx
+   # split the data into dp chunks
+   data_dp_lst = data.split(dp_size)
+   output_dp_lst = []
+   for i, worker in enumerate(actor_rollout_ref_wg):
+       output_future = worker.generate_sequences.remote(data_dp_lst[i])
+       output_dp_lst.append(output_future)
+   output = torch.cat(ray.get(output_dp_lst), dim=0)
+We observe that controll process calling worker group methods in general can be divided into 3 parts:
+- Split the data into data parallel sizes
+- Dispatch the corresponding data into each worker
+- Collect and concatenate the data when the computation finishes
+In verl, we design a syntax sugar to encapsulate the 3 processes into a single call from the controller process.
+.. code-block:: python
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def generate_sequences(data):
+       ...
+   # on the driver
+   output = actor_rollout_ref_wg.generate_sequences(data)
+We decorate the method of the worker with a ``register`` that explicitly defines how the input data should be splitted and dispatch to each worker, and how the output data should be collected and concatenated by the controller. For example, ``Dispatch.DP_COMPUTE_PROTO`` splits the input data into dp chunks, dispatch each data to each worker, collect the output and concatenate the results. Note that this function requires the input and output to be a DataProto defined here (https://github.com/volcengine/verl/blob/main/verl/protocol.py).
+PPO main loop
+~~~~~~~~~~~~~~~~~~~~
+With the aforementioned APIs, we can implement the main loop of PPO as if it is a single process program
+.. code-block:: python
+   for prompt in dataloader:
+       output = actor_rollout_ref_wg.generate_sequences(prompt)
+       old_log_prob = actor_rollout_ref_wg.compute_log_prob(output)
+       ref_log_prob = actor_rollout_ref_wg.compute_ref_log_prob(output)
+       values = critic_wg.compute_values(output)
+       rewards = reward_wg.compute_scores(output)
+       # compute_advantages is running directly on the control process
+       advantages = compute_advantages(values, rewards)
+       output = output.union(old_log_prob)
+       output = output.union(ref_log_prob)
+       output = output.union(values)
+       output = output.union(rewards)
+       output = output.union(advantages)
+       # update actor
+       actor_rollout_ref_wg.update_actor(output)
+       critic.update_critic(output)
+Takeaways
+~~~~~~~~~~~~~~~~~~~~
+- This programming paradigm enables users to use different computation backend without modification of the control process.
+- This programming paradigm enables flexible placement (by changing the mapping of WorkerGroup and ResourcePool) without modification of the control process.
+Repository organization
+------------------------------------------------
+Important code files in the repository are organized as below:
+.. code-block:: bash
+   verl # the verl package
+     trainer
+       main_ppo.py  # the entrypoint for RL training
+       ppo
+         ray_trainer.py  # the training loop for RL algorithms such as PPO
+       fsdp_sft_trainer.py  # the SFT trainer with FSDP backend
+     config
+       generation.yaml  # configuration template for rollout
+       ppo_trainer.yaml  # configuration template for the RL trainer
+     workers
+       protocol.py  # the interface of DataProto
+       fsdp_workers.py   # the FSDP worker interfaces: ActorRolloutRefWorker, CriticWorker, RewardModelWorker
+       megatron_workers.py  # the Megatron worker interfaces: ActorRolloutRefWorker, CriticWorker, RewardModelWorker
+       actor
+         dp_actor.py  #  data parallel actor with FSDP backend
+         megatron_actor.py  # nD parallel actor with Megatron backend
+       critic
+         dp_critic.py  # data parallel critic with FSDP backend
+         megatron_critic.py  # nD parallel critic with FSDP backend
+       reward_model
+         megatron
+           reward_model.py  # reward model with Megatron backend
+       rollout
+         vllm
+           vllm_rollout.py  # rollout with vllm backend
+         hf_rollout.py  # rollout with huggingface TGI backend
+       sharding_manager
+         fsdp_ulysses.py  # data and model resharding when using FSDP + ulysses
+         fsdp_vllm.py  # data and model resharding when using FSDP + ulysses + vllm
+         megatron_vllm.py  # data and model resharding when using Megatron + vllm
+     utils
+       dataset  # datasets for SFT/RM/RL
+       reward_score  # function based reward
+         gsm8k.py  # reward function for gsm8k dataset
+         math.py  # reward function for math dataset
+       seqlen_balancing.py  # the sequence balance optimization
+     models
+       llama  # Megatron implementation for llama, deepseek, mistral, etc
+       transformers  # ulysses integration with transformer models such as llama, qwen, etc
+       weight_loader_registery.py  # registry of weight loaders for loading hf ckpt into Megatron
+     third_party
+       vllm  # adaptor for vllm's usage in RL
+         vllm_v_0_6_3  # vllm v0.6.3 adaptor
+           llm.py  # entrypoints for generate, sync_model_weight, offload_model_weights
+           parallel_state.py  # vllm related device mesh and process groups
+           dtensor_weight_loaders.py  # weight loader for huggingface models with FSDP
+           megatron_weight_loaders.py  # weight loader for Megatron models
+         vllm_spmd  # vllm >= v0.7 adaptor (coming soon)
+   examples  # example scripts
+   tests  # integration and unit tests
+   .github  # the configuration of continuous integration tests
+.. [1] HybridFlow: A Flexible and Efficient RLHF Framework: https://arxiv.org/abs/2409.19256v2
+.. [2] Data flow graph credit to CS231n 2024 lecture 4: https://cs231n.stanford.edu/slides/2024/lecture_4.pdf
+.. [3] PPO dataflow graph credit to 低级炼丹师 from Zhihu: https://zhuanlan.zhihu.com/p/635757674
+.. [4] RLFlow
\ No newline at end of file
--- a/docs/index.rst
+++ b/docs/index.rst
+Welcome to verl's documentation!
+================================================
+.. _hf_arxiv: https://arxiv.org/pdf/2409.19256
+verl is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
+verl is flexible and easy to use with:
+- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
+- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
+- **Flexible device mapping and parallelism**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
+- Ready integration with popular HuggingFace models
+verl is fast with:
+- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, verl achieves high generation and training throughput.
+- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
+--------------------------------------------
+.. _Contents:
+.. toctree::
+   :maxdepth: 5
+   :caption: Quickstart
+   start/install
+   start/quickstart
+   start/multinode
+.. toctree::
+   :maxdepth: 4
+   :caption: Programming guide
+   hybrid_flow
+.. toctree::
+   :maxdepth: 5
+   :caption: Data Preparation
+   preparation/prepare_data
+   preparation/reward_function
+.. toctree::
+   :maxdepth: 5
+   :caption: Configurations
+   examples/config
+.. toctree::
+   :maxdepth: 2
+   :caption: PPO Example
+   examples/ppo_code_architecture
+   examples/gsm8k_example
+.. toctree:: 
+   :maxdepth: 1
+   :caption: PPO Trainer and Workers
+   workers/ray_trainer
+   workers/fsdp_workers
+   workers/megatron_workers
+   workers/sglang_worker
+.. toctree::
+   :maxdepth: 1
+   :caption: Performance Tuning Guide
+   perf/perf_tuning
+   README_vllm0.8.md
+   perf/device_tuning
+.. toctree::
+   :maxdepth: 1
+   :caption: Experimental Results
+   experiment/ppo
+.. toctree::
+   :maxdepth: 1
+   :caption: Advance Usage and Extension
+   advance/placement
+   advance/dpo_extension
+   advance/fsdp_extension
+   advance/megatron_extension
+   advance/checkpoint
+.. toctree::
+   :maxdepth: 1
+   :caption: API References
+   data.rst
+.. toctree::
+   :maxdepth: 1
+   :caption: FAQ
+   faq/faq
+Contribution
+-------------
+verl is free software; you can redistribute it and/or modify it under the terms
+of the Apache License 2.0. We welcome contributions.
+Join us on `GitHub <https://github.com/volcengine/verl>`_, `Slack <https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA>`_ and `Wechat <https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG>`_ for discussions.
+Code formatting
+^^^^^^^^^^^^^^^^^^^^^^^^
+We use yapf (Google style) to enforce strict code formatting when reviewing MRs. Run yapf at the top level of verl repo:
+.. code-block:: bash
+   pip3 install yapf
+   yapf -ir -vv --style ./.style.yapf verl examples tests
+Adding CI tests
+^^^^^^^^^^^^^^^^^^^^^^^^
+If possible, please add CI test(s) for your new feature:
+1. Find the most relevant workflow yml file, which usually corresponds to a ``hydra`` default config (e.g. ``ppo_trainer``, ``ppo_megatron_trainer``, ``sft_trainer``, etc).
+2. Add related path patterns to the ``paths`` section if not already included.
+3. Minimize the workload of the test script(s) (see existing scripts for examples).
\ No newline at end of file
--- a/docs/perf/device_tuning.rst
+++ b/docs/perf/device_tuning.rst
--- a/docs/perf/perf_tuning.rst
+++ b/docs/perf/perf_tuning.rst
--- a/docs/preparation/prepare_data.rst
+++ b/docs/preparation/prepare_data.rst
--- a/docs/preparation/reward_function.rst
+++ b/docs/preparation/reward_function.rst
+Implement Reward Function for Dataset
+======================================
+For each dataset, we need to implement a reward function or utilize a reward model to compute the rewards for the generated responses.
+We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
+You can also use customized reward functions.
+Currently, we support reward functions for GSM8k and MATH datasets. For RLHF datasets (e.g.,
+full_hh_rlhf) and Code Generation (e.g., APPS), we utilize reward model
+and SandBox (will opensource soon) for evaluation respectively.
+RewardManager
+-------------
+In the entrypoint of the PPO Post-Training script `main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py#L33>`_,
+we implement a ``RewardManager`` that utilze pre-implemented reward functions to compute the scores for each response.
+In the ``RewardManager``, we implemented a ``__call__`` function to
+compute the score for each response. 
+All the reward functions are executed by ``compute_score_fn``.
+The input is a ``DataProto``, which includes:
+- ``input_ids``, ``attention_mask``: ``input_ids`` and ``attention_mask`` after applying
+  chat_template, including prompt and response
+- ``responses``: response tokens
+- ``ground_truth``: The ground truth string of the current prompt.
+  Stored in ``non_tensor_batch`` in the ``DataProto``, which should be
+  preprocessed in the parquet files.
+- ``data_source``: The dataset name of the current prompt. Stored in
+  ``non_tensor_batch`` in the ``DataProto``, which should be
+  preprocessed in the parquet files.
+After detokenize the responses, the responses string and the ground
+truth string will be input to the ``compute_score_fn`` to compute the
+score for each response.
+Reward Functions
+----------------
+Pre-implemented
+~~~~~~~~~~~~~~~
+We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
+- In the `GSM8k example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py>`_, we
+  force the response to output the final answer after four ####, then
+  use string matching to compare with the ground truth. If completely
+  correct, score 1 point; if the format is correct, score 0.1 points; if
+  the format is incorrect, score 0 points.
+- In the `MATH example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/math.py>`_, we follow
+  the implementation in `lm-evaluation-harness repository <https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hendrycks_math/utils.py>`_.
+Customized
+~~~~~~~~~~
+You can implement customized reward functions in a separate file and specify them using ``custom_reward_function.path`` and ``custom_reward_function.name``. For the set of them, please refer to :ref:`config-explain-page`.
+The parameters of your reward function should be ``data_source``, ``solution_str``, ``ground_truth``, and ``extra_info``.
+For example:
+.. code:: python
+  def my_reward_fn(data_source, solution_str, ground_truth, extra_info=None):
+    return len(solution_str)/100
+If you are testing only a single customized reward function, you can simply name it 'compute_score' and leave ``custom_reward_function.name`` unset.
+To run multiple tests with different customized reward functions, you can modify both ``custom_reward_function.path`` and ``custom_reward_function.name`` for each trial. 
+For instance, you might create a single `my_reward.py` file and implement multiple reward functions within it. This way, for different trials, you only need to adjust ``custom_reward_function.name``, making it more convenient to conduct multiple tests within scripts.
\ No newline at end of file
--- a/docs/requirements-docs.txt
+++ b/docs/requirements-docs.txt
+# markdown suport
+recommonmark
+# markdown table suport
+sphinx-markdown-tables
+# theme default rtd
+# crate-docs-theme
+sphinx-rtd-theme
+# pin tokenizers version to avoid env_logger version req
+tokenizers==0.19.1
--- a/docs/start/install.rst
+++ b/docs/start/install.rst
--- a/docs/start/multinode.rst
+++ b/docs/start/multinode.rst
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
--- a/docs/workers/fsdp_workers.rst
+++ b/docs/workers/fsdp_workers.rst