[Feature][Performance] Implement NCCL wrapper for communicating NodeEmbeddings...
[Feature][Performance] Implement NCCL wrapper for communicating NodeEmbeddings and sparse gradients. (#2825) * Split NCCL wrapper from sparse optimizer and sparse embedding * Add more unit tests for single node nccl * Fix unit test for tf * Switch to device histogram * Fix histgram issues * Finish migration to histogram * Handle cases with zero send/recieve data * Start on partition object * Get compiling * Updates * Add unit tests * Switch to partition object * Fix linting issues * Rename partition file * Add python doc * Fix python assert and finish doxygen comments * Remove stubs for range based partition to satisfy pylint * Wrap unit test in GPU only * Wrap explicit cuda call in ifdef * Merge with partition.py * update docstrings * Cleanup partition_op * Add Workspace object * Switch to using workspace object * Move last remainder based function out of nccl_api * Add error messages * Update docs with examples * Fix lintin...
Showing
cmake/util/FindNccl.cmake
0 → 100644
python/dgl/cuda/__init__.py
0 → 100644
python/dgl/cuda/nccl.py
0 → 100644
src/partition/partition_op.h
0 → 100644
src/runtime/cuda/nccl_api.cu
0 → 100644
This diff is collapsed.
src/runtime/cuda/nccl_api.h
0 → 100644
src/runtime/workspace.h
0 → 100644
tests/compute/test_nccl.py
0 → 100644
Please register or sign in to comment