Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
vllm_cscc
Commits
4ee847e4
Unverified
Commit
4ee847e4
authored
Mar 19, 2026
by
Aaron Hao
Committed by
GitHub
Mar 19, 2026
Browse files
Comment fix for async rl example (#35244)
Signed-off-by:
hao-aaron
<
ahao@anyscale.com
>
parent
040a505f
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
27 additions
and
14 deletions
+27
-14
examples/rl/rlhf_async_new_apis.py
examples/rl/rlhf_async_new_apis.py
+27
-14
No files found.
examples/rl/rlhf_async_new_apis.py
View file @
4ee847e4
...
@@ -2,25 +2,38 @@
...
@@ -2,25 +2,38 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
"""
Demonstrates async reinforcement learning using vLLM and Ray,
Demonstrates async reinforcement learning using vLLM and Ray,
with native weight syncing APIs a
t engine instance
.
with native weight syncing APIs a
nd batch-invariant generation
.
The script separates training and inference workloads onto distinct GPUs
The script separates training and inference workloads onto distinct GPUs
so that Ray can manage process placement and inter-process communication.
so that Ray can manage process placement and inter-process communication.
A Hugging Face Transformer model occupies one GPU for training, whereas a
A Hugging Face Transformer model occupies one GPU for training, and a
2x tensor-parallel vLLM inference engine occupies two GPUs.
vLLM AsyncLLMEngine occupies another GPU for inference.
Batch invariance is enabled so that generation output is deterministic
regardless of how many requests are batched together. This is required
for the validation phase to succeed. Batch invariance currently requires
NVIDIA GPUs with compute capability 9.0 or higher:
- H-series: H100, H200
- B-series: B100, B200
The example performs the following steps:
The example performs the following steps:
* Load the training model on one gpu (scheduled via ray)
* Load the training model (Qwen3-1.7B) on one GPU via a Ray actor.
* Initialize the inference model with dummy weights across
* Initialize the inference engine with a base model (Qwen3-1.7B-Base)
two gpus using vLLM's tensor parallelism and Ray placement groups.
on a separate GPU using vLLM's AsyncLLMEngine with Ray as the
* Generate gibberish from a list of prompts using the randomly initialized
distributed executor backend.
inference engine.
* Set up an NCCL-based weight transfer channel between the trainer
* Pause generation once generation completes for one sequence
and the inference engine.
* Update the weights of the training model and broadcast the updated weights
* Submit generation requests for a batch of prompts.
to the inference engine by using a Ray collective RPC group.
* Pause generation once any request reaches a token threshold.
* Resume generation and print out the results
* Broadcast the training model's weights to the inference engine
via the NCCL weight transfer engine, replacing the base weights.
This example assumes a single-node cluster with three GPUs, but Ray
* Resume generation and collect results, noting which tokens were
generated before vs. after the weight swap.
* Validate correctness by launching a fresh vLLM instance loaded
directly with the training model and comparing its output to the
post-swap tokens from the weight-synced engine.
This example assumes a single-node cluster with two GPUs, but Ray
supports multi-node clusters. vLLM expects the GPUs are only used for vLLM
supports multi-node clusters. vLLM expects the GPUs are only used for vLLM
workloads. Residual GPU activity interferes with vLLM memory profiling and
workloads. Residual GPU activity interferes with vLLM memory profiling and
causes unexpected behavior.
causes unexpected behavior.
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment