Add Unsloth to RLHF.md (#21636)

41d3082c · Daniel Han · GitHub · 7cfea0df · 41d3082c
Unverified Commit 41d3082c authored Jul 25, 2025 by Daniel Han Committed by GitHub Jul 25, 2025
Show whitespace changes
Inline Side-by-side

Showing with 5 additions and 1 deletion

docs/training/rlhf.md docs/training/rlhf.md +5 -1

No files found.
--- a/docs/training/rlhf.md
+++ b/docs/training/rlhf.md
@@ -2,10 +2,14 @@

 Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors.

-vLLM can be used to generate the completions for RLHF. The best way to do this is with libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [verl](https://github.com/volcengine/verl).
+vLLM can be used to generate the completions for RLHF. Some ways to do this include using libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl) and [unsloth](https://github.com/unslothai/unsloth).

 See the following basic examples to get started if you don't want to use an existing library:

 - [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)
 - [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)
 - [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)
+
+See the following notebooks showing how to use vLLM for GRPO:
+
+- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)