update doc for gpu&uva sampling (#3787)

Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>

update doc for gpu&uva sampling (#3787)
Co-authored-by: Quan (Andy) Gan <coin2028@hotmail.com>
8425c936 · Xin Yao · GitHub · 861666fa · 8425c936
Unverified Commit 8425c936 authored Feb 28, 2022 by Xin Yao Committed by GitHub Feb 28, 2022
Show whitespace changes
Inline Side-by-side

Showing with 15 additions and 32 deletions

docs/source/guide/minibatch-gpu-sampling.rst docs/source/guide/minibatch-gpu-sampling.rst +15 -32

No files found.
--- a/docs/source/guide/minibatch-gpu-sampling.rst
+++ b/docs/source/guide/minibatch-gpu-sampling.rst
@@ -13,35 +13,15 @@ For example, `OGB Products <https://ogb.stanford.edu/docs/nodeprop/#ogbn-product
 a graph depends on the number of edges.  Therefore it is entirely possible to fit the
 whole graph onto GPU.

-Put the node features onto GPU memory
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If the node features can also fit onto GPU memory, it is recommended to put them onto GPU
-to reduce the time for data transfer from CPU to GPU, which usually becomes a bottleneck
-when using GPU for sampling. For exampling, in the above OGB Products, each node has
-100-dimensional features and they take less than 1GB memory in total. It is easy to
-transfer these features to GPU before training via the following code.
-
-.. code:: python
-
-   # pop the features and labels
-   features = g.ndata.pop('features')
-   labels = g.ndata.pop('labels')
-   # put them onto GPU
-   features = features.to('cuda:0')
-   labels = labels.to('cuda:0')
-
-If the node features are too large to fit onto GPU memory, :class:`~dgl.contrib.UnifiedTensor`
-enables GPU zero-copy access to the features stored on CPU memory and greatly reduces
-the time for data transfer from CPU to GPU.
-

 Using GPU-based neighborhood sampling in DGL data loaders
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 One can use GPU-based neighborhood sampling with DGL data loaders via:

-* Putting the graph onto GPU.
+* Put the graph onto GPU.
+
+* Put the ``train_nid`` onto GPU.

 * Set ``device`` argument to a GPU device.

@@ -54,9 +34,10 @@ the same as the other user guides and tutorials.
 .. code:: python

   g = g.to('cuda:0')
+   train_nid = train_nid.to('cuda:0')
   dataloader = dgl.dataloading.DataLoader(
       g,                                # The graph must be on GPU.
-       train_nid,
+       train_nid,                        # train_nid must be on GPU.
       sampler,
       device=torch.device('cuda:0'),    # The device argument must be GPU.
       num_workers=0,                    # Number of workers must be 0.
@@ -82,28 +63,31 @@ CUDA UVA (Unified Virtual Addressing)-based sampling, in which GPUs perform the
 on the graph pinned on CPU memory via zero-copy access.
 You can enable UVA-based neighborhood sampling in DGL data loaders via:

-* Pin the graph to page-locked memory via :func:`dgl.DGLGraph.pin_memory_`.
+* Put the ``train_nid`` onto GPU.

 * Set ``device`` argument to a GPU device.

 * Set ``num_workers`` argument to 0, because CUDA does not allow multiple processes
  accessing the same context.

+* Set ``use_uva=True``.
+
 All the other arguments for the :class:`~dgl.dataloading.DataLoader` can be
 the same as the other user guides and tutorials.

 .. code:: python

-   g = g.pin_memory_()
+   train_nid = train_nid.to('cuda:0')
   dataloader = dgl.dataloading.DataLoader(
-       g,                                # The graph must be pinned.
-       train_nid,
+       g,
+       train_nid,                        # train_nid must be on GPU.
       sampler,
       device=torch.device('cuda:0'),    # The device argument must be GPU.
       num_workers=0,                    # Number of workers must be 0.
       batch_size=1000,
       drop_last=False,
-       shuffle=True)
+       shuffle=True,
+       use_uva=True)                     # Set use_uva=True

 UVA-based sampling is the recommended solution for mini-batch training on large graphs,
 especially for multi-GPU training.
@@ -111,9 +95,8 @@ especially for multi-GPU training.
 .. note::

  To use UVA-based sampling in multi-GPU training, you should first materialize all the
-  necessary sparse formats of the graph and copy them to the shared memory explicitly
-  before spawning training processes. Then you should pin the shared graph in each training
-  process respectively. Refer to our `GraphSAGE example <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/multi_gpu_node_classification.py>`_ for more details.
+  necessary sparse formats of the graph before spawning training processes.
+  Refer to our `GraphSAGE example <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/multi_gpu_node_classification.py>`_ for more details.


 Using GPU-based neighbor sampling with DGL functions