Unverified Commit 5472cd41 authored by Muhammed Fatih BALIN's avatar Muhammed Fatih BALIN Committed by GitHub
Browse files

[GraphBolt][CUDA] Node classification and Link prediction tutorials GPU sampling support. (#7142)

parent e5263013
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Ow8CQmZIV8Yn"
},
"source": [
"# Link Prediction\n",
"\n",
......@@ -30,34 +18,35 @@
"\n",
"- Train a GNN model for link prediction on target device with DGL's\n",
" neighbor sampling components.\n"
],
"metadata": {
"id": "Ow8CQmZIV8Yn"
}
]
},
{
"cell_type": "markdown",
"source": [
"## Install DGL package"
],
"metadata": {
"id": "onVijYWpWlMj"
}
},
"source": [
"## Install DGL package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QcpjTazg6hEo"
},
"outputs": [],
"source": [
"# Install required packages.\n",
"import os\n",
"import torch\n",
"import numpy as np\n",
"os.environ['TORCH'] = torch.__version__\n",
"os.environ['DGLBACKEND'] = \"pytorch\"\n",
"\n",
"# Install the CPU version. If you want to install CUDA version, please\n",
"# refer to https://www.dgl.ai/pages/start.html.\n",
"device = torch.device(\"cpu\")\n",
"!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
"device = torch.device(\"cuda\")\n",
"!pip install --pre dgl -f https://data.dgl.ai/wheels-test/cu121/repo.html\n",
"\n",
"try:\n",
" import dgl\n",
......@@ -67,61 +56,59 @@
" installed = False\n",
" print(error)\n",
"print(\"DGL installed!\" if installed else \"DGL not found!\")"
],
"metadata": {
"id": "QcpjTazg6hEo"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OOKZxxT7W1Rz"
},
"source": [
"## Loading Dataset\n",
"`cora` is already prepared as `BuiltinDataset` in **GraphBolt**.\n"
],
"metadata": {
"id": "OOKZxxT7W1Rz"
}
]
},
{
"cell_type": "code",
"source": [
"dataset = gb.BuiltinDataset(\"cora\").load()"
],
"execution_count": null,
"metadata": {
"id": "RnJkkSKhWiUG"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"dataset = gb.BuiltinDataset(\"cora\").load()"
]
},
{
"cell_type": "markdown",
"source": [
"Dataset consists of graph, feature and tasks. You can get the training-validation-test set from the tasks. Seed nodes and corresponding labels are already stored in each training-validation-test set. This dataset contains 2 tasks, one for node classification and the other for link prediction. We will use the link prediction task."
],
"metadata": {
"id": "WxnTMEQXXKsM"
}
},
"source": [
"Dataset consists of graph, feature and tasks. You can get the training-validation-test set from the tasks. Seed nodes and corresponding labels are already stored in each training-validation-test set. This dataset contains 2 tasks, one for node classification and the other for link prediction. We will use the link prediction task."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YCm8CGkOX9lK"
},
"outputs": [],
"source": [
"graph = dataset.graph\n",
"feature = dataset.feature\n",
"graph = dataset.graph.to(device)\n",
"feature = dataset.feature.to(device)\n",
"train_set = dataset.tasks[1].train_set\n",
"test_set = dataset.tasks[1].test_set\n",
"task_name = dataset.tasks[1].metadata[\"name\"]\n",
"print(f\"Task: {task_name}.\")"
],
"metadata": {
"id": "YCm8CGkOX9lK"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2y-P5omQYP00"
},
"source": [
"## Defining Neighbor Sampler and Data Loader in DGL\n",
"Different from the link prediction tutorial for full graph, a common practice to train GNN on large graphs is to iterate over the edges in minibatches, since computing the probability of all edges is usually impossible. For each minibatch of edges, you compute the output representation of their incident nodes using neighbor sampling and GNN, in a similar fashion introduced in the node classification tutorial.\n",
......@@ -130,63 +117,66 @@
"\n",
"Except for the negative sampler, the rest of the code is identical to the node classification tutorial.\n",
"\n"
],
"metadata": {
"id": "2y-P5omQYP00"
}
]
},
{
"cell_type": "code",
"source": [
"from functools import partial\n",
"datapipe = gb.ItemSampler(train_set, batch_size=256, shuffle=True)\n",
"datapipe = datapipe.sample_uniform_negative(graph, 5)\n",
"datapipe = datapipe.sample_neighbor(graph, [5, 5, 5])\n",
"datapipe = datapipe.transform(partial(gb.exclude_seed_edges, include_reverse_edges=True))\n",
"datapipe = datapipe.fetch_feature(feature, node_feature_keys=[\"feat\"])\n",
"datapipe = datapipe.copy_to(device)\n",
"train_dataloader = gb.DataLoader(datapipe)"
],
"execution_count": null,
"metadata": {
"id": "LZgXGfBvYijJ"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"from functools import partial\n",
"def create_train_dataloader():\n",
" datapipe = gb.ItemSampler(train_set, batch_size=256, shuffle=True)\n",
" datapipe = datapipe.copy_to(device)\n",
" datapipe = datapipe.sample_uniform_negative(graph, 5)\n",
" datapipe = datapipe.sample_neighbor(graph, [5, 5])\n",
" datapipe = datapipe.transform(partial(gb.exclude_seed_edges, include_reverse_edges=True))\n",
" datapipe = datapipe.fetch_feature(feature, node_feature_keys=[\"feat\"])\n",
" return gb.DataLoader(datapipe)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5sU_aulqYkwK"
},
"source": [
"You can peek one minibatch from train_dataloader and see what it will give you.\n",
"\n"
],
"metadata": {
"id": "5sU_aulqYkwK"
}
]
},
{
"cell_type": "code",
"source": [
"data = next(iter(train_dataloader))\n",
"print(f\"MiniBatch: {data}\")"
],
"execution_count": null,
"metadata": {
"id": "euEdzmerYmZi"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"data = next(iter(create_train_dataloader()))\n",
"print(f\"MiniBatch: {data}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WYQqfrDWYtU0"
},
"source": [
"## Defining Model for Node Representation\n",
"Let’s consider training a 2-layer GraphSAGE with neighbor sampling. The model can be written as follows:\n"
],
"metadata": {
"id": "WYQqfrDWYtU0"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0qQbBwO7Y3-Q"
},
"outputs": [],
"source": [
"import dgl.nn as dglnn\n",
"import torch.nn as nn\n",
......@@ -214,55 +204,55 @@
" if not is_last_layer:\n",
" hidden_x = F.relu(hidden_x)\n",
" return hidden_x"
],
"metadata": {
"id": "0qQbBwO7Y3-Q"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y23JppwHY5MC"
},
"source": [
"## Defining Traing Loop\n",
"The following initializes the model and defines the optimizer.\n",
"\n"
],
"metadata": {
"id": "y23JppwHY5MC"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "omSIB_ePZACg"
},
"outputs": [],
"source": [
"in_size = feature.size(\"node\", None, \"feat\")[0]\n",
"model = SAGE(in_size, 128).to(device)\n",
"optimizer = torch.optim.Adam(model.parameters(), lr=0.001)"
],
"metadata": {
"id": "omSIB_ePZACg"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QyWtzNZcZRgp"
},
"source": [
"The following is the training loop for link prediction and evaluation.\n",
"\n"
],
"metadata": {
"id": "QyWtzNZcZRgp"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SccLVrjSZSkd"
},
"outputs": [],
"source": [
"import tqdm\n",
"for epoch in range(3):\n",
" model.train()\n",
" total_loss = 0\n",
" for step, data in tqdm.tqdm(enumerate(train_dataloader)):\n",
" for step, data in tqdm.tqdm(enumerate(create_train_dataloader())):\n",
" # Get node pairs with labels for loss calculation.\n",
" compacted_pairs, labels = data.node_pairs_with_labels\n",
" node_feature = data.node_features[\"feat\"]\n",
......@@ -284,33 +274,33 @@
" total_loss += loss.item()\n",
"\n",
" print(f\"Epoch {epoch:03d} | Loss {total_loss / (step + 1):.3f}\")"
],
"metadata": {
"id": "SccLVrjSZSkd"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"source": [
"## Evaluating Performance with Link Prediction\n"
],
"metadata": {
"id": "pxow2XSkZXoO"
}
},
"source": [
"## Evaluating Performance with Link Prediction\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IMulfsnIZZVh"
},
"outputs": [],
"source": [
"model.eval()\n",
"\n",
"datapipe = gb.ItemSampler(test_set, batch_size=256, shuffle=False)\n",
"datapipe = datapipe.copy_to(device)\n",
"# Since we need to use all neghborhoods for evaluation, we set the fanout\n",
"# to -1.\n",
"datapipe = datapipe.sample_neighbor(graph, [-1, -1])\n",
"datapipe = datapipe.fetch_feature(feature, node_feature_keys=[\"feat\"])\n",
"datapipe = datapipe.copy_to(device)\n",
"eval_dataloader = gb.DataLoader(datapipe, num_workers=0)\n",
"\n",
"logits = []\n",
......@@ -342,22 +332,41 @@
"\n",
"auc = roc_auc_score(labels.cpu(), logits.cpu())\n",
"print(\"Link Prediction AUC:\", auc)"
],
"metadata": {
"id": "IMulfsnIZZVh"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KoCoIvqAZeCS"
},
"source": [
"## Conclusion\n",
"In this tutorial, you have learned how to train a multi-layer GraphSAGE for link prediction with neighbor sampling."
],
"metadata": {
"id": "KoCoIvqAZeCS"
}
]
}
],
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
]
}
\ No newline at end of file
},
"nbformat": 4,
"nbformat_minor": 0
}
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "OxbY2KlG4ZfJ"
},
"source": [
"# Node Classification\n",
"This tutorial shows how to train a multi-layer GraphSAGE for node\n",
......@@ -30,22 +18,24 @@
"\n",
"- Train a GNN model for node classification on a single GPU with DGL's\n",
" neighbor sampling components."
],
"metadata": {
"id": "OxbY2KlG4ZfJ"
}
]
},
{
"cell_type": "markdown",
"source": [
"## Install DGL package"
],
"metadata": {
"id": "mzZKrVVk6Y_8"
}
},
"source": [
"## Install DGL package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QcpjTazg6hEo"
},
"outputs": [],
"source": [
"# Install required packages.\n",
"import os\n",
......@@ -54,10 +44,10 @@
"os.environ['TORCH'] = torch.__version__\n",
"os.environ['DGLBACKEND'] = \"pytorch\"\n",
"\n",
"# Install the CPU version. If you want to install CUDA version, please\n",
"# Install the CUDA version. If you want to install CPU version, please\n",
"# refer to https://www.dgl.ai/pages/start.html.\n",
"device = torch.device(\"cpu\")\n",
"!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html\n",
"device = torch.device(\"cuda\")\n",
"!pip install --pre dgl -f https://data.dgl.ai/wheels-test/cu121/repo.html\n",
"\n",
"try:\n",
" import dgl\n",
......@@ -67,158 +57,159 @@
" installed = False\n",
" print(error)\n",
"print(\"DGL installed!\" if installed else \"DGL not found!\")"
],
"metadata": {
"id": "QcpjTazg6hEo"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XWdRZAM-51Cb"
},
"source": [
"## Loading Dataset\n",
"`ogbn-arxiv` is already prepared as ``BuiltinDataset`` in **GraphBolt**."
],
"metadata": {
"id": "XWdRZAM-51Cb"
}
]
},
{
"cell_type": "code",
"source": [
"dataset = gb.BuiltinDataset(\"ogbn-arxiv\").load()"
],
"execution_count": null,
"metadata": {
"id": "RnJkkSKhWiUG"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"dataset = gb.BuiltinDataset(\"ogbn-arxiv\").load()"
]
},
{
"cell_type": "markdown",
"source": [
"Dataset consists of graph, feature and tasks. You can get the training-validation-test set from the tasks. Seed nodes and corresponding labels are already stored in each training-validation-test set. Other metadata such as number of classes are also stored in the tasks. In this dataset, there is only one task: `node classification`."
],
"metadata": {
"id": "S8avoKBiXA9j"
}
},
"source": [
"Dataset consists of graph, feature and tasks. You can get the training-validation-test set from the tasks. Seed nodes and corresponding labels are already stored in each training-validation-test set. Other metadata such as number of classes are also stored in the tasks. In this dataset, there is only one task: `node classification`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IXGZmgIaXJWQ"
},
"outputs": [],
"source": [
"graph = dataset.graph\n",
"feature = dataset.feature\n",
"graph = dataset.graph.to(device)\n",
"feature = dataset.feature.to(device)\n",
"train_set = dataset.tasks[0].train_set\n",
"valid_set = dataset.tasks[0].validation_set\n",
"test_set = dataset.tasks[0].test_set\n",
"task_name = dataset.tasks[0].metadata[\"name\"]\n",
"num_classes = dataset.tasks[0].metadata[\"num_classes\"]\n",
"print(f\"Task: {task_name}. Number of classes: {num_classes}\")"
],
"metadata": {
"id": "IXGZmgIaXJWQ"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y8yn77Kg6HkW"
},
"source": [
"## How DGL Handles Computation Dependency¶\n",
"The computation dependency for message passing of a single node can be described as a series of message flow graphs (MFG).\n",
"\n",
"![DGL Computation](https://data.dgl.ai/tutorial/img/bipartite.gif)"
],
"metadata": {
"id": "y8yn77Kg6HkW"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q7GrcJTnZQjt"
},
"source": [
"## Defining Neighbor Sampler and Data Loader in DGL\n",
"\n",
"DGL provides tools to iterate over the dataset in minibatches while generating the computation dependencies to compute their outputs with the MFGs above. For node classification, you can use `dgl.graphbolt.DataLoader` for iterating over the dataset. It accepts a data pipe that generates minibatches of nodes and their labels, sample neighbors for each node, and generate the computation dependencies in the form of MFGs. Feature fetching, block creation and copying to target device are also supported. All these operations are split into separate stages in the data pipe, so that you can customize the data pipeline by inserting your own operations.\n",
"\n",
"Let’s say that each node will gather messages from 4 neighbors on each layer. The code defining the data loader and neighbor sampler will look like the following.\n"
],
"metadata": {
"id": "q7GrcJTnZQjt"
}
]
},
{
"cell_type": "code",
"source": [
"datapipe = gb.ItemSampler(train_set, batch_size=1024, shuffle=True)\n",
"datapipe = datapipe.sample_neighbor(graph, [4, 4])\n",
"datapipe = datapipe.fetch_feature(feature, node_feature_keys=[\"feat\"])\n",
"datapipe = datapipe.copy_to(device)\n",
"train_dataloader = gb.DataLoader(datapipe, num_workers=0)"
],
"execution_count": null,
"metadata": {
"id": "yQVYDO0ZbBvi"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"def create_dataloader(itemset, shuffle):\n",
" datapipe = gb.ItemSampler(itemset, batch_size=1024, shuffle=shuffle)\n",
" datapipe = datapipe.copy_to(device, extra_attrs=[\"seed_nodes\"])\n",
" datapipe = datapipe.sample_neighbor(graph, [4, 4])\n",
" datapipe = datapipe.fetch_feature(feature, node_feature_keys=[\"feat\"])\n",
" return gb.DataLoader(datapipe)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7Rp12SUhbEV1"
},
"source": [
"You can iterate over the data loader and a `MiniBatch` object is yielded.\n",
"\n"
],
"metadata": {
"id": "7Rp12SUhbEV1"
}
]
},
{
"cell_type": "code",
"source": [
"data = next(iter(train_dataloader))\n",
"print(data)"
],
"execution_count": null,
"metadata": {
"id": "V7vQiKj2bL_o"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"data = next(iter(create_dataloader(train_set, shuffle=True)))\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"source": [
"You can get the input node IDs from MFGs."
],
"metadata": {
"id": "-eBuPnT-bS-o"
}
},
"source": [
"You can get the input node IDs from MFGs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bN4sgZqFbUvd"
},
"outputs": [],
"source": [
"mfgs = data.blocks\n",
"input_nodes = mfgs[0].srcdata[dgl.NID]\n",
"print(f\"Input nodes: {input_nodes}.\")"
],
"metadata": {
"id": "bN4sgZqFbUvd"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fV6epnRxbZl4"
},
"source": [
"## Defining Model\n",
"Let’s consider training a 2-layer GraphSAGE with neighbor sampling. The model can be written as follows:\n",
"\n"
],
"metadata": {
"id": "fV6epnRxbZl4"
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iKhEIL0Ccmwx"
},
"outputs": [],
"source": [
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
......@@ -241,73 +232,69 @@
"\n",
"in_size = feature.size(\"node\", None, \"feat\")[0]\n",
"model = Model(in_size, 64, num_classes).to(device)"
],
"metadata": {
"id": "iKhEIL0Ccmwx"
},
"execution_count": null,
"outputs": []
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OGLN3kCcwCA8"
},
"source": [
"## Defining Training Loop\n",
"\n",
"The following initializes the model and defines the optimizer.\n"
],
"metadata": {
"id": "OGLN3kCcwCA8"
}
]
},
{
"cell_type": "code",
"source": [
"opt = torch.optim.Adam(model.parameters())"
],
"execution_count": null,
"metadata": {
"id": "dET8i_hewLUi"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"opt = torch.optim.Adam(model.parameters())"
]
},
{
"cell_type": "markdown",
"source": [
"When computing the validation score for model selection, usually you can also do neighbor sampling. To do that, you need to define another data loader."
],
"metadata": {
"id": "leZvFP4GwMcq"
}
},
"source": [
"When computing the validation score for model selection, usually you can also do neighbor sampling. We can just reuse our create_dataloader function to create two separate dataloaders for training and validation."
]
},
{
"cell_type": "code",
"source": [
"datapipe = gb.ItemSampler(valid_set, batch_size=1024, shuffle=False)\n",
"datapipe = datapipe.sample_neighbor(graph, [4, 4])\n",
"datapipe = datapipe.fetch_feature(feature, node_feature_keys=[\"feat\"])\n",
"datapipe = datapipe.copy_to(device)\n",
"valid_dataloader = gb.DataLoader(datapipe, num_workers=0)\n",
"\n",
"\n",
"import sklearn.metrics"
],
"execution_count": null,
"metadata": {
"id": "Gvd7vFWZwQI5"
},
"execution_count": null,
"outputs": []
"outputs": [],
"source": [
"train_dataloader = create_dataloader(train_set, shuffle=True)\n",
"valid_dataloader = create_dataloader(valid_set, shuffle=False)\n",
"\n",
"import sklearn.metrics"
]
},
{
"cell_type": "markdown",
"source": [
"The following is a training loop that performs validation every epoch. It also saves the model with the best validation accuracy into a file."
],
"metadata": {
"id": "nTIIfVMDwXqX"
}
},
"source": [
"The following is a training loop that performs validation every epoch. It also saves the model with the best validation accuracy into a file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wsfqhKUvwZEj"
},
"outputs": [],
"source": [
"import tqdm\n",
"\n",
......@@ -348,27 +335,45 @@
" predictions = np.concatenate(predictions)\n",
" labels = np.concatenate(labels)\n",
" accuracy = sklearn.metrics.accuracy_score(labels, predictions)\n",
" print(\"Epoch {} Validation Accuracy {}\".format(epoch, accuracy))\n",
"\n",
" # Note that this tutorial do not train the whole model to the end.\n",
" break"
],
"metadata": {
"id": "wsfqhKUvwZEj"
},
"execution_count": null,
"outputs": []
" print(\"Epoch {} Validation Accuracy {}\".format(epoch, accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kmHnUI0QwfJ4"
},
"source": [
"## Conclusion\n",
"\n",
"In this tutorial, you have learned how to train a multi-layer GraphSAGE with neighbor sampling.\n"
],
"metadata": {
"id": "kmHnUI0QwfJ4"
}
]
}
]
}
\ No newline at end of file
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment