Unverified Commit 4ee847e4 authored by Aaron Hao's avatar Aaron Hao Committed by GitHub
Browse files

Comment fix for async rl example (#35244)


Signed-off-by: default avatarhao-aaron <ahao@anyscale.com>
parent 040a505f
...@@ -2,25 +2,38 @@ ...@@ -2,25 +2,38 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
""" """
Demonstrates async reinforcement learning using vLLM and Ray, Demonstrates async reinforcement learning using vLLM and Ray,
with native weight syncing APIs at engine instance. with native weight syncing APIs and batch-invariant generation.
The script separates training and inference workloads onto distinct GPUs The script separates training and inference workloads onto distinct GPUs
so that Ray can manage process placement and inter-process communication. so that Ray can manage process placement and inter-process communication.
A Hugging Face Transformer model occupies one GPU for training, whereas a A Hugging Face Transformer model occupies one GPU for training, and a
2x tensor-parallel vLLM inference engine occupies two GPUs. vLLM AsyncLLMEngine occupies another GPU for inference.
Batch invariance is enabled so that generation output is deterministic
regardless of how many requests are batched together. This is required
for the validation phase to succeed. Batch invariance currently requires
NVIDIA GPUs with compute capability 9.0 or higher:
- H-series: H100, H200
- B-series: B100, B200
The example performs the following steps: The example performs the following steps:
* Load the training model on one gpu (scheduled via ray) * Load the training model (Qwen3-1.7B) on one GPU via a Ray actor.
* Initialize the inference model with dummy weights across * Initialize the inference engine with a base model (Qwen3-1.7B-Base)
two gpus using vLLM's tensor parallelism and Ray placement groups. on a separate GPU using vLLM's AsyncLLMEngine with Ray as the
* Generate gibberish from a list of prompts using the randomly initialized distributed executor backend.
inference engine. * Set up an NCCL-based weight transfer channel between the trainer
* Pause generation once generation completes for one sequence and the inference engine.
* Update the weights of the training model and broadcast the updated weights * Submit generation requests for a batch of prompts.
to the inference engine by using a Ray collective RPC group. * Pause generation once any request reaches a token threshold.
* Resume generation and print out the results * Broadcast the training model's weights to the inference engine
via the NCCL weight transfer engine, replacing the base weights.
This example assumes a single-node cluster with three GPUs, but Ray * Resume generation and collect results, noting which tokens were
generated before vs. after the weight swap.
* Validate correctness by launching a fresh vLLM instance loaded
directly with the training model and comparing its output to the
post-swap tokens from the weight-synced engine.
This example assumes a single-node cluster with two GPUs, but Ray
supports multi-node clusters. vLLM expects the GPUs are only used for vLLM supports multi-node clusters. vLLM expects the GPUs are only used for vLLM
workloads. Residual GPU activity interferes with vLLM memory profiling and workloads. Residual GPU activity interferes with vLLM memory profiling and
causes unexpected behavior. causes unexpected behavior.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment