[doc] explain nccl requirements for rlhf (#11381)

Signed-off-by: youkaichao <youkaichao@gmail.com>

[doc] explain nccl requirements for rlhf (#11381)
Signed-off-by: youkaichao <youkaichao@gmail.com>
5d2248d8 · youkaichao · GitHub · d573aead · 5d2248d8
Unverified Commit 5d2248d8 authored Dec 20, 2024 by youkaichao Committed by GitHub Dec 20, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 0 deletions

docs/source/getting_started/debugging.rst docs/source/getting_started/debugging.rst +1 -0

No files found.
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -200,3 +200,4 @@ try this instead:
 Known Issues
 ----------------------------------------
 - In ``v0.5.2``, ``v0.5.3``, and ``v0.5.3.post1``, there is a bug caused by `zmq <https://github.com/zeromq/pyzmq/issues/2000>`_ , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of ``vllm`` to include the `fix <https://github.com/vllm-project/vllm/pull/6759>`_.
+- To circumvent a NCCL `bug <https://github.com/NVIDIA/nccl/issues/1234>`__ , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in `the RLHF integration <https://github.com/OpenRLHF/OpenRLHF/pull/604>`__ and the `discussion <https://github.com/vllm-project/vllm/issues/5723#issuecomment-2554389656>`__ .