minibatch-gpu-sampling.rst 5.28 KB
Newer Older
1
2
.. _guide-minibatch-gpu-sampling:

3
6.8 Using GPU for Neighborhood Sampling
4
5
---------------------------------------

6
7
8
9
.. note::
  GraphBolt does not support GPU-based neighborhood sampling yet. So this guide is
  utilizing :class:`~dgl.dataloading.DataLoader` for illustration.

10
DGL since 0.7 has been supporting GPU-based neighborhood sampling, which has a significant
11
12
13
speed advantage over CPU-based neighborhood sampling.  If you estimate that your graph 
can fit onto GPU and your model does not take a lot of GPU memory, then it is best to
put the graph onto GPU memory and use GPU-based neighbor sampling.
14
15

For example, `OGB Products <https://ogb.stanford.edu/docs/nodeprop/#ogbn-products>`_ has
16
17
18
2.4M nodes and 61M edges.  The graph takes less than 1GB since the memory consumption of
a graph depends on the number of edges.  Therefore it is entirely possible to fit the
whole graph onto GPU.
19
20
21
22
23


Using GPU-based neighborhood sampling in DGL data loaders
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

24
One can use GPU-based neighborhood sampling with DGL data loaders via:
25

26
27
28
* Put the graph onto GPU.

* Put the ``train_nid`` onto GPU.
29

30
31
* Set ``device`` argument to a GPU device.

32
33
34
* Set ``num_workers`` argument to 0, because CUDA does not allow multiple processes
  accessing the same context.

35
All the other arguments for the :class:`~dgl.dataloading.DataLoader` can be
36
37
38
39
40
the same as the other user guides and tutorials.

.. code:: python

   g = g.to('cuda:0')
41
   train_nid = train_nid.to('cuda:0')
42
   dataloader = dgl.dataloading.DataLoader(
43
       g,                                # The graph must be on GPU.
44
       train_nid,                        # train_nid must be on GPU.
45
46
47
48
49
50
       sampler,
       device=torch.device('cuda:0'),    # The device argument must be GPU.
       num_workers=0,                    # Number of workers must be 0.
       batch_size=1000,
       drop_last=False,
       shuffle=True)
51
52
53
54
55
56
57
58
59
60

.. note::

  GPU-based neighbor sampling also works for custom neighborhood samplers as long as
  (1) your sampler is subclassed from :class:`~dgl.dataloading.BlockSampler`, and (2)
  your sampler entirely works on GPU.


Using CUDA UVA-based neighborhood sampling in DGL data loaders
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
61
62

.. note::
63
64
65
66
   New feature introduced in DGL 0.8.

For the case where the graph is too large to fit onto the GPU memory, we introduce the
CUDA UVA (Unified Virtual Addressing)-based sampling, in which GPUs perform the sampling
67
on the graph pinned in CPU memory via zero-copy access.
68
69
You can enable UVA-based neighborhood sampling in DGL data loaders via:

70
* Put the ``train_nid`` onto GPU.
71
72
73
74
75
76

* Set ``device`` argument to a GPU device.

* Set ``num_workers`` argument to 0, because CUDA does not allow multiple processes
  accessing the same context.

77
78
* Set ``use_uva=True``.

79
All the other arguments for the :class:`~dgl.dataloading.DataLoader` can be
80
81
82
83
the same as the other user guides and tutorials.

.. code:: python

84
   train_nid = train_nid.to('cuda:0')
85
   dataloader = dgl.dataloading.DataLoader(
86
87
       g,
       train_nid,                        # train_nid must be on GPU.
88
89
90
91
92
       sampler,
       device=torch.device('cuda:0'),    # The device argument must be GPU.
       num_workers=0,                    # Number of workers must be 0.
       batch_size=1000,
       drop_last=False,
93
94
       shuffle=True,
       use_uva=True)                     # Set use_uva=True
95
96
97
98
99
100
101

UVA-based sampling is the recommended solution for mini-batch training on large graphs,
especially for multi-GPU training.

.. note::

  To use UVA-based sampling in multi-GPU training, you should first materialize all the
102
103
  necessary sparse formats of the graph before spawning training processes.
  Refer to our `GraphSAGE example <https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/multi_gpu_node_classification.py>`_ for more details.
104
105


106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
UVA and GPU support for PinSAGESampler/RandomWalkNeighborSampler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

PinSAGESampler and RandomWalkNeighborSampler support UVA and GPU sampling.
You can enable them via:

* Pin the graph (for UVA sampling) or put the graph onto GPU (for GPU sampling).

* Put the ``train_nid`` onto GPU.

.. code:: python

  g = dgl.heterograph({
      ('item', 'bought-by', 'user'): ([0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 2, 3, 2, 3]),
      ('user', 'bought', 'item'): ([0, 1, 0, 1, 2, 3, 2, 3], [0, 0, 1, 1, 2, 2, 3, 3])})

  # UVA setup
  # g.create_formats_()
  # g.pin_memory_()

  # GPU setup
  device = torch.device('cuda:0')
  g = g.to(device)

  sampler1 = dgl.sampling.PinSAGESampler(g, 'item', 'user', 4, 0.5, 3, 2)
  sampler2 = dgl.sampling.RandomWalkNeighborSampler(g, 4, 0.5, 3, 2, ['bought-by', 'bought'])

  train_nid = torch.tensor([0, 2], dtype=g.idtype, device=device)
  sampler1(train_nid)
  sampler2(train_nid)


138
139
140
Using GPU-based neighbor sampling with DGL functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

141
142
You can build your own GPU sampling pipelines with the following functions that support
operating on GPU:
143
144

* :func:`dgl.sampling.sample_neighbors`
145
146
* :func:`dgl.sampling.random_walk`

147
148
149
150
151
152
153
154
155
156
157
Subgraph extraction ops:

* :func:`dgl.node_subgraph`
* :func:`dgl.edge_subgraph`
* :func:`dgl.in_subgraph`
* :func:`dgl.out_subgraph`

Graph transform ops for subgraph construction:

* :func:`dgl.to_block`
* :func:`dgl.compact_graph`