[Doc] Add an official documentation of UnifiedTensor (#3194)

* Add pytorch-direct version * remove * add documentation for UnifiedTensor * Revert "add documentation for UnifiedTensor" This reverts commit 63ba42644d4aba197c1cb4ea4b85fa1bc43b8849. * add UnifiedTensor documentation * Update python/dgl/contrib/unified_tensor.py Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by: shhssdm <shhssdm@gmail.com> Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

[Doc] Add an official documentation of UnifiedTensor (#3194)
* Add pytorch-direct version * remove * add documentation for UnifiedTensor * Revert "add documentation for UnifiedTensor" This reverts commit 63ba42644d4aba197c1cb4ea4b85fa1bc43b8849. * add UnifiedTensor documentation * Update python/dgl/contrib/unified_tensor.py Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by: shhssdm <shhssdm@gmail.com> Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
b78acd67 · David Min · GitHub · f7ce2671 · b78acd67 · b78acd67
Unverified Commit b78acd67 authored Aug 02, 2021 by David Min Committed by GitHub Aug 02, 2021
4 changed files
--- a/docs/source/api/python/dgl.contrib.UnifiedTensor.rst
+++ b/docs/source/api/python/dgl.contrib.UnifiedTensor.rst
+.. _apiunifiedtensor:
+dgl.contrib.UnifiedTensor
+=========
+.. automodule:: dgl.contrib
+UnifiedTensor enables direct CPU memory access from GPU.
+This feature is especially useful when GPUs need to access sparse data structure stored in CPU memory for several reasons (e.g., when node features do not fit in GPU memory).
+Without using this feature, sparsely structured data located in CPU memory must be gathered (or packed) before transferring it to the GPU memory because GPU DMA engines can only transfer data in a block granularity.
+However, the gathering step wastes CPU cycles and increases the CPU to GPU data copy time.
+The goal of UnifiedTensor is to skip such CPU gathering step by letting GPUs to access even non-regular data in CPU memory.
+In a hardware-level, this function is enabled by NVIDIA GPUs' unified virtual address (UVM) and zero-copy access capabilities.
+For those who wish to further extend the capability of UnifiedTensor may read the following paper (`link <https://arxiv.org/abs/2103.03330>`_) which explains the underlying mechanism of UnifiedTensor in detail.
+Base Dataset Class
+---------------------------
+.. autoclass:: UnifiedTensor
+    :members: __getitem__
--- a/docs/source/api/python/index.rst
+++ b/docs/source/api/python/index.rst
@@ -13,4 +13,5 @@ API Reference
   nn
   dgl.ops
   dgl.sampling
+   dgl.contrib.UnifiedTensor
   udf
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -48,6 +48,7 @@ Welcome to Deep Graph Library Tutorials and Documentation
   api/python/dgl.optim
   api/python/dgl.sampling
   api/python/dgl.multiprocessing
+   api/python/dgl.contrib.UnifiedTensor
   api/python/udf
 .. toctree::

--- a/python/dgl/contrib/unified_tensor.py
+++ b/python/dgl/contrib/unified_tensor.py
@@ -7,6 +7,10 @@ from .. import utils
 class UnifiedTensor: #UnifiedTensor
    '''Class for storing unified tensor. Declaration of
    UnifiedTensor automatically pins the input tensor.
+    Upon a successful declaration of UnifiedTensor, the
+    target GPU device will have the address mapping of the
+    input CPU tensor for zero-copy (direct) access over
+    external interconnects (e.g., PCIe).
    Parameters
    ----------
@@ -14,7 +18,51 @@ class UnifiedTensor: #UnifiedTensor
        Tensor which we want to convert into the
        unified tensor.
    device : device
-        Device to create the mapping of the unified tensor.
+        GPU to create the address mapping of the input CPU tensor.
+    Examples
+    --------
+    With a given CPU tensor ``feats``, a new UnifiedTensor targetting a default
+    GPU can be created as follows:
+    >>> feats = torch.rand((128,128))
+    >>> feats = dgl.contrib.UnifiedTensor(feats, device=torch.device('cuda'))
+    Now, the elements of the new tensor ``feats`` can be accessed with ``[]``
+    indexing. The context of the index tensor is a switch to trigger the
+    zero-copy access from GPU. For example, to use the ordinary CPU-based
+    data access, one can use the following method:
+    >>> idx = torch.Tensor([0,1,2])
+    >>> output = feats[idx]
+    Now, to use GPU to do a zero-copy access, do this:
+    >>> idx = torch.Tensor([0,1,2]).to('cuda')
+    >>> output = feats[idx]
+    For the multi-GPU operation, to allow multiple GPUs to access the original CPU tensor
+    ``feats`` using UnifiedTensor, one can do the following:
+    >>> feats = torch.rand((128,128))
+    >>> feats_gpu0 = dgl.contrib.UnifiedTensor(feats, device=torch.device('cuda:0'))
+    >>> feats_gpu1 = dgl.contrib.UnifiedTensor(feats, device=torch.device('cuda:1'))
+    >>> feats_gpu2 = dgl.contrib.UnifiedTensor(feats, device=torch.device('cuda:2'))
+    Now, the ``cuda:0``, ``cuda:1``, and ``cuda:2`` devices will be able to access the
+    identical tensor located in the CPU memory using ``feats_gpu0``, ``feats_gpu1``, and ``feats_gpu2`` tensors, respectively.
+    One can simply use following operations to slice the sub tensors into different GPU devices directly.
+    >>> feats_idx_gpu0 = torch.randint(128, 16, device='cuda:0')
+    >>> feats_idx_gpu1 = torch.randint(128, 16, device='cuda:1')
+    >>> feats_idx_gpu2 = torch.randint(128, 16, device='cuda:2')
+    >>> sub_feat_gpu0 = feats_gpu0[feats_idx_gpu0]
+    >>> sub_feat_gpu1 = feats_gpu1[feats_idx_gpu1]
+    >>> sub_feat_gpu2 = feats_gpu2[feats_idx_gpu2]
+    ``feats_gpu2`` tensors, respectively.
    '''
    def __init__(self, input, device):