1.[GDRCopy](https://github.com/NVIDIA/gdrcopy)(v2.4 and above recommended) is a low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology, and *it requires kernel module installation with root privileges.*
2. Hardware requirements
- GPUDirect RDMA capable devices, see [GPUDirect RDMA Documentation](https://docs.nvidia.com/cuda/gpudirect-rdma/)
Hardware requirements:
- GPUs inside one node needs to be connected by NVLink
- GPUs across different nodes needs to be connected by RDMA devices, see [GPUDirect RDMA Documentation](https://docs.nvidia.com/cuda/gpudirect-rdma/)
- InfiniBand GPUDirect Async (IBGDA) support, see [IBGDA Overview](https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/)
- For more detailed requirements, see [NVSHMEM Hardware Specifications](https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/abstract.html#hardware-requirements)
## Installation procedure
### 1. Install GDRCopy
GDRCopy requires kernel module installation on the host system. Complete these steps on the bare-metal host before container deployment:
gdrcopy_copybw # Should show bandwidth test results
```
### 2. Acquiring NVSHMEM source code
### 1. Acquiring NVSHMEM source code
Download NVSHMEM v3.2.5 from the [NVIDIA NVSHMEM OPEN SOURCE PACKAGES](https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz).
### 3. Apply our custom patch
### 2. Apply our custom patch
Navigate to your NVSHMEM source directory and apply our provided patch:
...
...
@@ -75,7 +28,7 @@ Navigate to your NVSHMEM source directory and apply our provided patch:
### 3. Configure NVIDIA driver (required by inter-node communication)
Enable IBGDA by modifying `/etc/modprobe.d/nvidia.conf`:
...
...
@@ -92,26 +45,31 @@ sudo reboot
For more detailed configurations, please refer to the [NVSHMEM Installation Guide](https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/abstract.html).
### 5. Build and installation
### 4. Build and installation
The following example demonstrates building NVSHMEM with IBGDA support:
DeepEP uses NVLink for intra-node communication and IBGDA for inter-node communication. All the other features are disabled to reduce the dependencies.