Comment fix for async rl example (#35244)

Signed-off-by: hao-aaron <ahao@anyscale.com>

Comment fix for async rl example (#35244)
Signed-off-by: hao-aaron <ahao@anyscale.com>
4ee847e4 · Aaron Hao · GitHub · 040a505f · 4ee847e4
Unverified Commit 4ee847e4 authored Mar 19, 2026 by Aaron Hao Committed by GitHub Mar 19, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 27 additions and 14 deletions

examples/rl/rlhf_async_new_apis.py examples/rl/rlhf_async_new_apis.py +27 -14

No files found.
--- a/examples/rl/rlhf_async_new_apis.py
+++ b/examples/rl/rlhf_async_new_apis.py
@@ -2,25 +2,38 @@
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 """
 Demonstrates async reinforcement learning using vLLM and Ray,
-with native weight syncing APIs at engine instance.
+with native weight syncing APIs and batch-invariant generation.
 The script separates training and inference workloads onto distinct GPUs
 so that Ray can manage process placement and inter-process communication.
-A Hugging Face Transformer model occupies one GPU for training, whereas a
+A Hugging Face Transformer model occupies one GPU for training, and a
-2x tensor-parallel vLLM inference engine occupies two GPUs.
+vLLM AsyncLLMEngine occupies another GPU for inference.
+Batch invariance is enabled so that generation output is deterministic
+regardless of how many requests are batched together. This is required
+for the validation phase to succeed. Batch invariance currently requires
+NVIDIA GPUs with compute capability 9.0 or higher:
+  - H-series: H100, H200
+  - B-series: B100, B200
 The example performs the following steps:
-* Load the training model on one gpu (scheduled via ray)
+* Load the training model (Qwen3-1.7B) on one GPU via a Ray actor.
-* Initialize the inference model with dummy weights across
+* Initialize the inference engine with a base model (Qwen3-1.7B-Base)
-  two gpus using vLLM's tensor parallelism and Ray placement groups.
+  on a separate GPU using vLLM's AsyncLLMEngine with Ray as the
-* Generate gibberish from a list of prompts using the randomly initialized
+  distributed executor backend.
-  inference engine.
+* Set up an NCCL-based weight transfer channel between the trainer
-* Pause generation once generation completes for one sequence
+  and the inference engine.
-* Update the weights of the training model and broadcast the updated weights
+* Submit generation requests for a batch of prompts.
-  to the inference engine by using a Ray collective RPC group.
+* Pause generation once any request reaches a token threshold.
-* Resume generation and print out the results
+* Broadcast the training model's weights to the inference engine
+  via the NCCL weight transfer engine, replacing the base weights.
-This example assumes a single-node cluster with three GPUs, but Ray
+* Resume generation and collect results, noting which tokens were
+  generated before vs. after the weight swap.
+* Validate correctness by launching a fresh vLLM instance loaded
+  directly with the training model and comparing its output to the
+  post-swap tokens from the weight-synced engine.
+This example assumes a single-node cluster with two GPUs, but Ray
 supports multi-node clusters. vLLM expects the GPUs are only used for vLLM
 workloads. Residual GPU activity interferes with vLLM memory profiling and
 causes unexpected behavior.