[doc] update user guide 6.8 (#6658)

018df054 · Rhett Ying · GitHub · bdaa1309 · 018df054 · bdaa1309
Unverified Commit 018df054 authored Dec 01, 2023 by Rhett Ying Committed by GitHub Dec 01, 2023
3 changed files
--- a/docs/source/guide/minibatch-parallelism.rst
+++ b/docs/source/guide/minibatch-parallelism.rst
+.. _guide-minibatch-parallelism:
+6.8 Data Loading Parallelism
+-----------------------
+In minibatch training of GNNs, we usually need to cover several stages to
+generate a minibatch, including:
+* Iterate over item set and generate minibatch seeds in batch size.
+* Sample negative items for each seed from graph.
+* Sample neighbors for each seed from graph.
+* Exclude seed edges from the sampled subgraphs.
+* Fetch node and edge features for the sampled subgraphs.
+* Convert the sampled subgraphs to DGLMiniBatches.
+* Copy the DGLMiniBatches to the target device.
+.. code:: python
+    datapipe = gb.ItemSampler(itemset, batch_size=1024, shuffle=True)
+    datapipe = datapipe.sample_uniform_negative(g, 5)
+    datapipe = datapipe.sample_neighbor(g, [10, 10]) # 2 layers.
+    datapipe = datapipe.transform(gb.exclude_seed_edges)
+    datapipe = datapipe.fetch_feature(feature, node_feature_keys=["feat"])
+    datapipe = datapipe.to_dgl()
+    datapipe = datapipe.copy_to(device)
+    dataloader = gb.MultiProcessDataLoader(datapipe, num_workers=0)
+All these stages are implemented in separate
+`IterableDataPipe <https://pytorch.org/data/main/torchdata.datapipes.iter.html>`__
+and stacked together with `PyTorch DataLoader <https://pytorch.org/docs/stable/data
+.html#torch.utils.data.DataLoader>`__.
+This design allows us to easily customize the data loading process by
+chaining different data pipes together. For example, if we want to sample
+negative items for each seed from graph, we can simply chain the
+:class:`~dgl.graphbolt.NegativeSampler` after the :class:`~dgl.graphbolt.ItemSampler`.
+But simply chaining data pipes together incurs performance overheads as various
+hardware resources such as CPU, GPU, PCIe, etc. are utilized by different stages.
+As a result, the data loading mechanism is optimized to minimize the overheads
+and achieve the best performance.
+In specific, GraphBolt wraps the data pipes before ``fetch_feature`` with
+multiprocessing which enables multiple processes to run in parallel. As for
+``fetch_feature`` data pipe, we keep it running in the main process to avoid
+data movement overheads between processes.
+What's more, in order to overlap the data movement and model computation, we
+wrap data pipes before ``copy_to`` with
+`torchdata.datapipes.iter.Perfetcher <https://pytorch.org/data/main/generated/
+torchdata.datapipes.iter.Prefetcher.html>`__
+which prefetches elements from previous data pipes and puts them into a buffer.
+Such prefetching is totally transparent to users and requires no extra code. It
+brings a significant performance boost to minibatch training of GNNs.
+Please refer to the source code of :class:`~dgl.graphbolt.MultiProcessDataLoader`
+for more details.
--- a/docs/source/guide/minibatch-prefetching.rst
+++ b/docs/source/guide/minibatch-prefetching.rst
-.. _guide-minibatch-prefetching:
-6.8 Feature Prefetching
-----------------------
-In minibatch training of GNNs, especially with neighbor sampling approaches, we often see
-that a large amount of node features need to be copied to the device for computing GNNs.
-To mitigate this bottleneck of data movement, DGL supports *feature prefetching*
-so that the model computation and data movement can happen in parallel.
-Enabling Prefetching with DGL's Builtin Samplers
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All the DGL samplers in :ref:`api-dataloading` allows users to specify which
-node and edge data to prefetch via arguments like :attr:`prefetch_node_feats`.
-For example, the following code asks :class:`dgl.dataloading.NeighborSampler` to prefetch
-the node data named ``feat`` and save it to the ``srcdata`` of the first message flow
-graph. It also asks the sampler to prefetch and save the node data named ``label``
-to the ``dstdata`` of the last message flow graph:
-.. code:: python
-   graph = ...                 # the graph to sample from
-   graph.ndata['feat'] = ...   # node feature
-   graph.ndata['label'] = ...  # node label
-   train_nids = ...  # an 1-D integer tensor of training node IDs
-   # create a sample and specify what data to prefetch
-   sampler = dgl.dataloading.NeighborSampler(
-       [15, 10, 5], prefetch_node_feats=['feat'], prefetch_labels=['label'])
-   # create a dataloader
-   dataloader = dgl.dataloading.DataLoader(
-       graph, train_nids, sampler,
-       batch_size=32,
-       ...    # other arguments
-   )
-   for mini_batch in dataloader:
-       # unpack mini batch
-       input_nodes, output_nodes, subgs = mini_batch
-       # the following data has been pre-fetched
-       feat = subgs[0].srcdata['feat']
-       label = subgs[-1].dstdata['label']
-       train(subgs, feat, label)
-.. note::
-    Even without specifying the the prefetch arguments, users can still access
-    ``subgs[0].srcdata['feat']`` and ``subgs[-1].dstdata['label']`` because DGL
-    internally keeps a reference to the node/edge data of the original graph when
-    a subgraph is created. Accessing subgraph features will incur data fetching
-    from the original graph immediately while prefetching ensures data
-    to be available before getting from data loader.
-Enabling Prefetching in Custom Samplers
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Users can implement their own rules of prefetching when writing custom samplers.
-Here is the code of ``NeighborSampler`` with prefetching:
-.. code:: python
-   class NeighborSampler(dgl.dataloading.Sampler):
-       def __init__(self,
-                    fanouts : list[int],
-                    prefetch_node_feats: list[str] = None,
-                    prefetch_edge_feats: list[str] = None,
-                    prefetch_labels: list[str] = None):
-           super().__init__()
-           self.fanouts = fanouts
-           self.prefetch_node_feats = prefetch_node_feats
-           self.prefetch_edge_feats = prefetch_edge_feats
-           self.prefetch_labels = prefetch_labels
-       def sample(self, g, seed_nodes):
-           output_nodes = seed_nodes
-           subgs = []
-           for fanout in reversed(self.fanouts):
-               # Sample a fixed number of neighbors of the current seed nodes.
-               sg = g.sample_neighbors(seed_nodes, fanout)
-               # Convert this subgraph to a message flow graph.
-               sg = dgl.to_block(sg, seed_nodes)
-               seed_nodes = sg.srcdata[NID]
-               subgs.insert(0, sg)
-            input_nodes = seed_nodes
-            # handle prefetching
-            dgl.set_src_lazy_features(subgs[0], self.prefetch_node_feats)
-            dgl.set_dst_lazy_features(subgs[-1], self.prefetch_labels)
-            for subg in subgs:
-                dgl.set_edge_lazy_features(subg, self.prefetch_edge_feats)
-            return input_nodes, output_nodes, subgs
-Using the :func:`~dgl.set_src_lazy_features`, :func:`~dgl.set_dst_lazy_features`
-and :func:`~dgl.set_edge_lazy_features`, users can tell ``DataLoader`` which
-features to prefetch and where to save them (``srcdata``, ``dstdata`` or ``edata``).
-See :ref:`guide-minibatch-customizing-neighborhood-sampler` for more explanations
-on how to write a custom graph sampler.
\ No newline at end of file
--- a/docs/source/guide/minibatch.rst
+++ b/docs/source/guide/minibatch.rst
@@ -75,4 +75,4 @@ sampling:
    minibatch-nn
    minibatch-inference
    minibatch-gpu-sampling
-    minibatch-prefetching
+    minibatch-parallelism