.. _guide-minibatch-gpu-sampling: 6.7 Using GPU for Neighborhood Sampling --------------------------------------- DGL since 0.7 has been supporting GPU-based neighborhood sampling, which has a significant speed advantage over CPU-based neighborhood sampling. If you estimate that your graph can fit onto GPU and your model does not take a lot of GPU memory, then it is best to put the graph onto GPU memory and use GPU-based neighbor sampling. For example, `OGB Products `_ has 2.4M nodes and 61M edges. The graph takes less than 1GB since the memory consumption of a graph depends on the number of edges. Therefore it is entirely possible to fit the whole graph onto GPU. Using GPU-based neighborhood sampling in DGL data loaders ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One can use GPU-based neighborhood sampling with DGL data loaders via: * Put the graph onto GPU. * Put the ``train_nid`` onto GPU. * Set ``device`` argument to a GPU device. * Set ``num_workers`` argument to 0, because CUDA does not allow multiple processes accessing the same context. All the other arguments for the :class:`~dgl.dataloading.DataLoader` can be the same as the other user guides and tutorials. .. code:: python g = g.to('cuda:0') train_nid = train_nid.to('cuda:0') dataloader = dgl.dataloading.DataLoader( g, # The graph must be on GPU. train_nid, # train_nid must be on GPU. sampler, device=torch.device('cuda:0'), # The device argument must be GPU. num_workers=0, # Number of workers must be 0. batch_size=1000, drop_last=False, shuffle=True) .. note:: GPU-based neighbor sampling also works for custom neighborhood samplers as long as (1) your sampler is subclassed from :class:`~dgl.dataloading.BlockSampler`, and (2) your sampler entirely works on GPU. Using CUDA UVA-based neighborhood sampling in DGL data loaders ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: New feature introduced in DGL 0.8. For the case where the graph is too large to fit onto the GPU memory, we introduce the CUDA UVA (Unified Virtual Addressing)-based sampling, in which GPUs perform the sampling on the graph pinned on CPU memory via zero-copy access. You can enable UVA-based neighborhood sampling in DGL data loaders via: * Put the ``train_nid`` onto GPU. * Set ``device`` argument to a GPU device. * Set ``num_workers`` argument to 0, because CUDA does not allow multiple processes accessing the same context. * Set ``use_uva=True``. All the other arguments for the :class:`~dgl.dataloading.DataLoader` can be the same as the other user guides and tutorials. .. code:: python train_nid = train_nid.to('cuda:0') dataloader = dgl.dataloading.DataLoader( g, train_nid, # train_nid must be on GPU. sampler, device=torch.device('cuda:0'), # The device argument must be GPU. num_workers=0, # Number of workers must be 0. batch_size=1000, drop_last=False, shuffle=True, use_uva=True) # Set use_uva=True UVA-based sampling is the recommended solution for mini-batch training on large graphs, especially for multi-GPU training. .. note:: To use UVA-based sampling in multi-GPU training, you should first materialize all the necessary sparse formats of the graph before spawning training processes. Refer to our `GraphSAGE example `_ for more details. Using GPU-based neighbor sampling with DGL functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can build your own GPU sampling pipelines with the following functions that support operating on GPU: * :func:`dgl.sampling.sample_neighbors` * Only has support for uniform sampling; non-uniform sampling can only run on CPU. Subgraph extraction ops: * :func:`dgl.node_subgraph` * :func:`dgl.edge_subgraph` * :func:`dgl.in_subgraph` * :func:`dgl.out_subgraph` Graph transform ops for subgraph construction: * :func:`dgl.to_block` * :func:`dgl.compact_graph`