[GraphBolt][Tutorial] MultiGPU tutorial as a notebook (#7134)

b7b19780 · Muhammed Fatih BALIN · GitHub · e5b92d2b · b7b19780 · b7b19780
Unverified Commit b7b19780 authored Apr 06, 2024 by Muhammed Fatih BALIN Committed by GitHub Apr 06, 2024
3 changed files
--- a/docs/source/stochastic_training/index.rst
+++ b/docs/source/stochastic_training/index.rst
@@ -44,4 +44,5 @@ Scenarios
  neighbor_sampling_overview.nblink
  node_classification.nblink
  link_prediction.nblink
+  multigpu_node_classification.nblink
  ondisk-dataset.rst
--- a/docs/source/stochastic_training/multigpu_node_classification.nblink
+++ b/docs/source/stochastic_training/multigpu_node_classification.nblink
+{
+    "path": "../../../notebooks/stochastic_training/multigpu_node_classification.ipynb"
+}
--- a/notebooks/stochastic_training/multigpu_node_classification.ipynb
+++ b/notebooks/stochastic_training/multigpu_node_classification.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "2ppSJal9At7-"
+      },
+      "source": [
+        "# Multi-GPU Node Classification\n",
+        "\n",
+        "This tutorial shows how to train a multi-layer GraphSAGE for node classification on the `ogbn-products` dataset provided by [Open Graph\n",
+        "Benchmark (OGB)](https://ogb.stanford.edu/). The dataset contains around 2.4 million nodes and 62 million edges.\n",
+        "\n",
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/stochastic_training/multigpu_node_classification.ipynb)\n",
+        "[![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/stochastic_training/multigpu_node_classification.ipynb)\n",
+        "\n",
+        "By the end of this tutorial, you will be able to\n",
+        "\n",
+        "- Train a GNN model for node classification on multiple GPUs with DGL's neighbor sampling components. After learning how to use multiple GPUs, you will\n",
+        "be able to extend it to other scenarios such as link prediction."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "mzZKrVVk6Y_8"
+      },
+      "source": [
+        "## Install DGL package and other dependencies"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "id": "QTCc1RrD_5Id"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
+            "Looking in links: https://data.dgl.ai/wheels-test/cu121/repo.html\n",
+            "Requirement already satisfied: dgl in /localscratch/dgl-3/python (2.1)\n",
+            "Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from dgl) (1.24.4)\n",
+            "Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from dgl) (1.11.4)\n",
+            "Requirement already satisfied: networkx>=2.1 in /usr/local/lib/python3.10/dist-packages (from dgl) (2.6.3)\n",
+            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from dgl) (2.31.0)\n",
+            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from dgl) (4.66.1)\n",
+            "Requirement already satisfied: psutil>=5.8.0 in /usr/local/lib/python3.10/dist-packages (from dgl) (5.9.4)\n",
+            "Requirement already satisfied: torchdata>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from dgl) (0.7.0a0)\n",
+            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->dgl) (3.3.2)\n",
+            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->dgl) (3.6)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->dgl) (1.26.18)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->dgl) (2023.11.17)\n",
+            "Requirement already satisfied: torch>=2 in /usr/local/lib/python3.10/dist-packages (from torchdata>=0.5.0->dgl) (2.2.0a0+81ea7a4)\n",
+            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata>=0.5.0->dgl) (3.13.1)\n",
+            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata>=0.5.0->dgl) (4.8.0)\n",
+            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata>=0.5.0->dgl) (1.12)\n",
+            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata>=0.5.0->dgl) (3.1.2)\n",
+            "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=2->torchdata>=0.5.0->dgl) (2023.12.0)\n",
+            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=2->torchdata>=0.5.0->dgl) (2.1.3)\n",
+            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=2->torchdata>=0.5.0->dgl) (1.3.0)\n",
+            "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
+            "\u001b[0m\n",
+            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
+            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n",
+            "Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com\n",
+            "Requirement already satisfied: torchmetrics in /usr/local/lib/python3.10/dist-packages (1.3.0.post0)\n",
+            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (0.70.16)\n",
+            "Requirement already satisfied: numpy>1.20.0 in /usr/local/lib/python3.10/dist-packages (from torchmetrics) (1.24.4)\n",
+            "Requirement already satisfied: packaging>17.1 in /usr/local/lib/python3.10/dist-packages (from torchmetrics) (23.2)\n",
+            "Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from torchmetrics) (2.2.0a0+81ea7a4)\n",
+            "Requirement already satisfied: lightning-utilities>=0.8.0 in /usr/local/lib/python3.10/dist-packages (from torchmetrics) (0.10.1)\n",
+            "Requirement already satisfied: dill>=0.3.8 in /usr/local/lib/python3.10/dist-packages (from multiprocess) (0.3.8)\n",
+            "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from lightning-utilities>=0.8.0->torchmetrics) (68.2.2)\n",
+            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from lightning-utilities>=0.8.0->torchmetrics) (4.8.0)\n",
+            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->torchmetrics) (3.13.1)\n",
+            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->torchmetrics) (1.12)\n",
+            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->torchmetrics) (2.6.3)\n",
+            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->torchmetrics) (3.1.2)\n",
+            "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->torchmetrics) (2023.12.0)\n",
+            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->torchmetrics) (2.1.3)\n",
+            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->torchmetrics) (1.3.0)\n",
+            "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
+            "\u001b[0m\n",
+            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n",
+            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n",
+            "DGL installed!\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Install required packages.\n",
+        "import os\n",
+        "import torch\n",
+        "os.environ['TORCH'] = torch.__version__\n",
+        "os.environ['DGLBACKEND'] = \"pytorch\"\n",
+        "\n",
+        "# Install the CUDA version. If you want to install CPU version, please\n",
+        "# refer to https://www.dgl.ai/pages/start.html.\n",
+        "!pip install --pre dgl -f https://data.dgl.ai/wheels-test/cu121/repo.html\n",
+        "!pip install torchmetrics multiprocess\n",
+        "\n",
+        "try:\n",
+        "    import dgl\n",
+        "    import dgl.graphbolt as gb\n",
+        "    installed = True\n",
+        "except ImportError as error:\n",
+        "    installed = False\n",
+        "    print(error)\n",
+        "print(\"DGL installed!\" if installed else \"DGL not found!\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "q7GrcJTnZQjt"
+      },
+      "source": [
+        "## Defining Neighbor Sampler and Data Loader in DGL\n",
+        "\n",
+        "The major difference from the previous tutorial is that we will use `DistributedItemSampler` instead of `ItemSampler` to sample mini-batches of nodes. `DistributedItemSampler` is a distributed version of `ItemSampler` that works with `DistributedDataParallel`. It is implemented as a wrapper around `ItemSampler` and will sample the same minibatch on all replicas. It also supports dropping the last non-full minibatch to avoid the need for padding.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "id": "eel0Wn_aEYAd"
+      },
+      "outputs": [],
+      "source": [
+        "def create_dataloader(graph, features, itemset, device, is_train):\n",
+        "    datapipe = gb.DistributedItemSampler(\n",
+        "        item_set=itemset,\n",
+        "        batch_size=1024,\n",
+        "        drop_last=is_train,\n",
+        "        shuffle=is_train,\n",
+        "        drop_uneven_inputs=is_train,\n",
+        "    )\n",
+        "    datapipe = datapipe.copy_to(device, extra_attrs=[\"seed_nodes\"])\n",
+        "    # Now that we have moved to device, sample_neighbor and fetch_feature steps\n",
+        "    # will be executed on GPUs.\n",
+        "    datapipe = datapipe.sample_neighbor(graph, [10, 10, 10])\n",
+        "    datapipe = datapipe.fetch_feature(features, node_feature_keys=[\"feat\"])\n",
+        "    return gb.DataLoader(datapipe)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "uswPlvOLF1IX"
+      },
+      "source": [
+        "## Weighted reduction across GPUs\n",
+        "\n",
+        "As the different GPUs might process differing numbers of data points, we define a function to compute the exact average of values such as loss or accuracy in a\n",
+        "weighted manner.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "id": "VXP0hmzVGKnp"
+      },
+      "outputs": [],
+      "source": [
+        "import torch.distributed as dist\n",
+        "\n",
+        "def weighted_reduce(tensor, weight, dst=0):\n",
+        "    ########################################################################\n",
+        "    # (HIGHLIGHT) Collect accuracy and loss values from sub-processes and\n",
+        "    # obtain overall average values.\n",
+        "    #\n",
+        "    # `torch.distributed.reduce` is used to reduce tensors from all the\n",
+        "    # sub-processes to a specified process, ReduceOp.SUM is used by default.\n",
+        "    #\n",
+        "    # Because the GPUs may have differing numbers of processed items, we\n",
+        "    # perform a weighted mean to calculate the exact loss and accuracy.\n",
+        "    ########################################################################\n",
+        "    dist.reduce(tensor=tensor, dst=dst)\n",
+        "    weight = torch.tensor(weight, device=tensor.device)\n",
+        "    dist.reduce(tensor=weight, dst=dst)\n",
+        "    return tensor / weight"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fV6epnRxbZl4"
+      },
+      "source": [
+        "## Defining Model\n",
+        "Let’s consider training a 3-layer GraphSAGE with neighbor sampling. The model can be written as follows:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "id": "ft9Ldg-yEsa5"
+      },
+      "outputs": [],
+      "source": [
+        "from torch import nn\n",
+        "import torch.nn.functional as F\n",
+        "from dgl.nn import SAGEConv\n",
+        "\n",
+        "class SAGE(nn.Module):\n",
+        "    def __init__(self, in_size, hidden_size, out_size):\n",
+        "        super().__init__()\n",
+        "        self.layers = nn.ModuleList()\n",
+        "        # Three-layer GraphSAGE-mean.\n",
+        "        self.layers.append(SAGEConv(in_size, hidden_size, \"mean\"))\n",
+        "        self.layers.append(SAGEConv(hidden_size, hidden_size, \"mean\"))\n",
+        "        self.layers.append(SAGEConv(hidden_size, out_size, \"mean\"))\n",
+        "        self.dropout = nn.Dropout(0.5)\n",
+        "        self.hidden_size = hidden_size\n",
+        "        self.out_size = out_size\n",
+        "        # Set the dtype for the layers manually.\n",
+        "        self.float()\n",
+        "\n",
+        "    def forward(self, blocks, x):\n",
+        "        hidden_x = x\n",
+        "        for layer_idx, (layer, block) in enumerate(zip(self.layers, blocks)):\n",
+        "            hidden_x = layer(block, hidden_x)\n",
+        "            is_last_layer = layer_idx == len(self.layers) - 1\n",
+        "            if not is_last_layer:\n",
+        "                hidden_x = F.relu(hidden_x)\n",
+        "                hidden_x = self.dropout(hidden_x)\n",
+        "        return hidden_x"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CjuvDKDVGbPW"
+      },
+      "source": [
+        "## Evaluation function\n",
+        "\n",
+        "The evaluation function can be used to calculate the validation accuracy during training or the testing accuracy at the end of the training. The difference from\n",
+        "the previous tutorial is that we need to return the number of items processed\n",
+        "by each GPU to take a weighted average."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "j4djoX9tG7Ib"
+      },
+      "outputs": [],
+      "source": [
+        "import torchmetrics.functional as MF\n",
+        "import tqdm\n",
+        "\n",
+        "@torch.no_grad()\n",
+        "def evaluate(rank, model, graph, features, itemset, num_classes, device):\n",
+        "    model.eval()\n",
+        "    y = []\n",
+        "    y_hats = []\n",
+        "    dataloader = create_dataloader(\n",
+        "        graph,\n",
+        "        features,\n",
+        "        itemset,\n",
+        "        device,\n",
+        "        is_train=False,\n",
+        "    )\n",
+        "\n",
+        "    for data in tqdm.tqdm(dataloader) if rank == 0 else dataloader:\n",
+        "        blocks = data.blocks\n",
+        "        x = data.node_features[\"feat\"]\n",
+        "        y.append(data.labels)\n",
+        "        y_hats.append(model.module(blocks, x))\n",
+        "\n",
+        "    res = MF.accuracy(\n",
+        "        torch.cat(y_hats),\n",
+        "        torch.cat(y),\n",
+        "        task=\"multiclass\",\n",
+        "        num_classes=num_classes,\n",
+        "    )\n",
+        "\n",
+        "    return res.to(device), sum(y_i.size(0) for y_i in y)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "kN5BbnR4HSU2"
+      },
+      "source": [
+        "## Training Loop\n",
+        "\n",
+        "The training loop is almost identical to the previous tutorial. In this tutorial, we explicitly disable uneven inputs coming from the dataloader, however, the Join Context Manager could be used to train possibly with incomplete batches at the end of epochs. Please refer to [this tutorial](https://pytorch.org/tutorials/advanced/generic_join.html) for more information."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "id": "bdOceP3yH-eI"
+      },
+      "outputs": [],
+      "source": [
+        "import time\n",
+        "\n",
+        "def train(\n",
+        "    rank,\n",
+        "    graph,\n",
+        "    features,\n",
+        "    train_set,\n",
+        "    valid_set,\n",
+        "    num_classes,\n",
+        "    model,\n",
+        "    device,\n",
+        "):\n",
+        "    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n",
+        "    # Create training data loader.\n",
+        "    dataloader = create_dataloader(\n",
+        "        graph,\n",
+        "        features,\n",
+        "        train_set,\n",
+        "        device,\n",
+        "        is_train=True,\n",
+        "    )\n",
+        "\n",
+        "    for epoch in range(5):\n",
+        "        epoch_start = time.time()\n",
+        "\n",
+        "        model.train()\n",
+        "        total_loss = torch.tensor(0, dtype=torch.float, device=device)\n",
+        "        num_train_items = 0\n",
+        "        for data in tqdm.tqdm(dataloader) if rank == 0 else dataloader:\n",
+        "            # The input features are from the source nodes in the first\n",
+        "            # layer's computation graph.\n",
+        "            x = data.node_features[\"feat\"]\n",
+        "\n",
+        "            # The ground truth labels are from the destination nodes\n",
+        "            # in the last layer's computation graph.\n",
+        "            y = data.labels\n",
+        "\n",
+        "            blocks = data.blocks\n",
+        "\n",
+        "            y_hat = model(blocks, x)\n",
+        "\n",
+        "            # Compute loss.\n",
+        "            loss = F.cross_entropy(y_hat, y)\n",
+        "\n",
+        "            optimizer.zero_grad()\n",
+        "            loss.backward()\n",
+        "            optimizer.step()\n",
+        "\n",
+        "            total_loss += loss.detach() * y.size(0)\n",
+        "            num_train_items += y.size(0)\n",
+        "\n",
+        "        # Evaluate the model.\n",
+        "        if rank == 0:\n",
+        "            print(\"Validating...\")\n",
+        "        acc, num_val_items = evaluate(\n",
+        "            rank,\n",
+        "            model,\n",
+        "            graph,\n",
+        "            features,\n",
+        "            valid_set,\n",
+        "            num_classes,\n",
+        "            device,\n",
+        "        )\n",
+        "        total_loss = weighted_reduce(total_loss, num_train_items)\n",
+        "        acc = weighted_reduce(acc * num_val_items, num_val_items)\n",
+        "\n",
+        "        # We synchronize before measuring the epoch time.\n",
+        "        torch.cuda.synchronize()\n",
+        "        epoch_end = time.time()\n",
+        "        if rank == 0:\n",
+        "            print(\n",
+        "                f\"Epoch {epoch:05d} | \"\n",
+        "                f\"Average Loss {total_loss.item():.4f} | \"\n",
+        "                f\"Accuracy {acc.item():.4f} | \"\n",
+        "                f\"Time {epoch_end - epoch_start:.4f}\"\n",
+        "            )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "mA-Xu37uIHc4"
+      },
+      "source": [
+        "## Defining Training and Evaluation Procedures\n",
+        "\n",
+        "The following code defines the main function for each process. It is similar to the previous tutorial except that we need to initialize a distributed training context with `torch.distributed` and wrap the model with `torch.nn.parallel.DistributedDataParallel`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "metadata": {
+        "id": "sW__HeslIMTT"
+      },
+      "outputs": [],
+      "source": [
+        "def run(rank, world_size, devices, dataset):\n",
+        "    # Set up multiprocessing environment.\n",
+        "    device = devices[rank]\n",
+        "    torch.cuda.set_device(device)\n",
+        "    dist.init_process_group(\n",
+        "        backend=\"nccl\",  # Use NCCL backend for distributed GPU training\n",
+        "        init_method=\"tcp://127.0.0.1:12345\",\n",
+        "        world_size=world_size,\n",
+        "        rank=rank,\n",
+        "    )\n",
+        "\n",
+        "    # Pin the graph and features in-place to enable GPU access.\n",
+        "    graph = dataset.graph.pin_memory_()\n",
+        "    features = dataset.feature.pin_memory_()\n",
+        "    train_set = dataset.tasks[0].train_set\n",
+        "    valid_set = dataset.tasks[0].validation_set\n",
+        "    num_classes = dataset.tasks[0].metadata[\"num_classes\"]\n",
+        "\n",
+        "    in_size = features.size(\"node\", None, \"feat\")[0]\n",
+        "    hidden_size = 256\n",
+        "    out_size = num_classes\n",
+        "\n",
+        "    # Create GraphSAGE model. It should be copied onto a GPU as a replica.\n",
+        "    model = SAGE(in_size, hidden_size, out_size).to(device)\n",
+        "    model = nn.parallel.DistributedDataParallel(model)\n",
+        "\n",
+        "    # Model training.\n",
+        "    if rank == 0:\n",
+        "        print(\"Training...\")\n",
+        "    train(\n",
+        "        rank,\n",
+        "        graph,\n",
+        "        features,\n",
+        "        train_set,\n",
+        "        valid_set,\n",
+        "        num_classes,\n",
+        "        model,\n",
+        "        device,\n",
+        "    )\n",
+        "\n",
+        "    # Test the model.\n",
+        "    if rank == 0:\n",
+        "        print(\"Testing...\")\n",
+        "    test_set = dataset.tasks[0].test_set\n",
+        "    test_acc, num_test_items = evaluate(\n",
+        "        rank,\n",
+        "        model,\n",
+        "        graph,\n",
+        "        features,\n",
+        "        itemset=test_set,\n",
+        "        num_classes=num_classes,\n",
+        "        device=device,\n",
+        "    )\n",
+        "    test_acc = weighted_reduce(test_acc * num_test_items, num_test_items)\n",
+        "\n",
+        "    if rank == 0:\n",
+        "        print(f\"Test Accuracy {test_acc.item():.4f}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qMzt0aBFIfbS"
+      },
+      "source": [
+        "## Spawning Trainer Processes\n",
+        "\n",
+        "The following code spawns a process for each GPU and calls the run function defined above."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "id": "5Dt95eSVIiyM"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Training with 1 gpus.\n",
+            "The dataset is already preprocessed.\n",
+            "Training...\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "192it [00:09, 21.32it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Validating...\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "39it [00:00, 78.32it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Epoch 00000 | Average Loss 1.2953 | Accuracy 0.8556 | Time 9.5520\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "192it [00:03, 61.08it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Validating...\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "39it [00:00, 79.10it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Epoch 00001 | Average Loss 0.5859 | Accuracy 0.8788 | Time 3.6609\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "192it [00:03, 62.82it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Validating...\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "39it [00:00, 80.55it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Epoch 00002 | Average Loss 0.4858 | Accuracy 0.8852 | Time 3.5646\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "192it [00:03, 60.34it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Validating...\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "39it [00:00, 44.41it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Epoch 00003 | Average Loss 0.4407 | Accuracy 0.8920 | Time 4.0852\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "192it [00:03, 58.87it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Validating...\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "39it [00:00, 78.52it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Epoch 00004 | Average Loss 0.4122 | Accuracy 0.8943 | Time 3.7938\n",
+            "Testing...\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2162it [00:24, 89.75it/s]"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Test Accuracy 0.7514\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "import torch.multiprocessing as mp\n",
+        "\n",
+        "def main():\n",
+        "    if not torch.cuda.is_available():\n",
+        "        print(\"No GPU found!\")\n",
+        "        return\n",
+        "\n",
+        "    devices = [\n",
+        "        torch.device(f\"cuda:{i}\") for i in range(torch.cuda.device_count())\n",
+        "    ][:1]\n",
+        "    world_size = len(devices)\n",
+        "\n",
+        "    print(f\"Training with {world_size} gpus.\")\n",
+        "\n",
+        "    # Load and preprocess dataset.\n",
+        "    dataset = gb.BuiltinDataset(\"ogbn-products\").load()\n",
+        "\n",
+        "    # Thread limiting to avoid resource competition.\n",
+        "    os.environ[\"OMP_NUM_THREADS\"] = str(mp.cpu_count() // 2 // world_size)\n",
+        "\n",
+        "    if world_size > 1:\n",
+        "        # The following launch method is not supported in a notebook.\n",
+        "        mp.set_sharing_strategy(\"file_system\")\n",
+        "        mp.spawn(\n",
+        "            run,\n",
+        "            args=(world_size, devices, dataset),\n",
+        "            nprocs=world_size,\n",
+        "            join=True,\n",
+        "        )\n",
+        "    else:\n",
+        "        run(0, 1, devices, dataset)\n",
+        "\n",
+        "\n",
+        "if __name__ == \"__main__\":\n",
+        "    main()"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}