# DGL-Enter (What is DGL-Enter? Why design this? What is it for?) DGL-Enter is a commanline tool for user to quickly bootstrap models with multiple datasets. And provide full capability for user to customize the pipeline into their own takks. ## Installation guide You can install DGL-enter easily by `pip install dglenter`. Then you should be able to use DGL-Enter in you commandline tool by type in `dgl-enter` ``` Usage: dgl-enter [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: config Generate the config files export Export the python file from config train Train the model ``` ## Train GraphSAGE on Cora from scratch Here we'll use one of the most classic model GraphSAGE and Cora citation graph dataset as an example, to show how easy to train a model with DGL-Enter. ### Step 1: Use `dgl-enter config` to generate a yaml configuration file Run `dgl-enter config nodepred --data cora --model sage --cfg cora_sage.yml`. Then you'll get a configuration file `cora_sage.yml` includes all the configuration to be tuned, with the comments Optionally, You can change the config as you want to acheive a better performance. Below is a modified sample based on the template generated by the command above. The early stop part is removed for simplicity ```yaml version: 0.0.1 pipeline_name: nodepred device: cpu data: name: cora split_ratio: # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset model: name: sage embed_size: -1 # The dimension of created embedding table. -1 means using original node embedding hidden_size: 16 # Hidden size. num_layers: 1 # Number of hidden layers. activation: relu # Activation function name under torch.nn.functional dropout: 0.5 # Dropout rate. aggregator_type: gcn # Aggregator type to use (``mean``, ``gcn``, ``pool``, ``lstm``). general_pipeline: num_epochs: 200 # Number of training epochs eval_period: 5 # Interval epochs between evaluations optimizer: name: Adam lr: 0.01 weight_decay: 0.0005 loss: CrossEntropyLoss num_runs: 1 # Number of experiments to run ``` ### Step 2: Use `dgl-enter train` to initiate the training process. Simply run `dgl-enter train --cfg cora_sage.yml` will start the training process ```log ... Epoch 00190 | Loss 1.5225 | TrainAcc 0.9500 | ValAcc 0.6840 Epoch 00191 | Loss 1.5416 | TrainAcc 0.9357 | ValAcc 0.6840 Epoch 00192 | Loss 1.5391 | TrainAcc 0.9357 | ValAcc 0.6840 Epoch 00193 | Loss 1.5257 | TrainAcc 0.9643 | ValAcc 0.6840 Epoch 00194 | Loss 1.5196 | TrainAcc 0.9286 | ValAcc 0.6840 EarlyStopping counter: 12 out of 20 Epoch 00195 | Loss 1.4862 | TrainAcc 0.9643 | ValAcc 0.6760 Epoch 00196 | Loss 1.5142 | TrainAcc 0.9714 | ValAcc 0.6760 Epoch 00197 | Loss 1.5145 | TrainAcc 0.9714 | ValAcc 0.6760 Epoch 00198 | Loss 1.5174 | TrainAcc 0.9571 | ValAcc 0.6760 Epoch 00199 | Loss 1.5235 | TrainAcc 0.9714 | ValAcc 0.6760 Test Accuracy 0.7740 Accuracy across 1 runs: 0.774 ± 0.0 ``` That's all! Basically you only need two line of command to train a graph neural network. ## Debug your model and advanced customization That's not everything yet. We belive you may want to change more than the configuration files, to change the training pipeline, calculate new metrics, or look into the code for details. DGL-Enter can export a self-contained, runnable python script for you to do anything you like. Try `dgl-enter export --cfg cora_sage.yml --output script.py`, and you'll get the script used to train the model, like a magic! Below ```python ... def train(cfg, pipeline_cfg, device, data, model, optimizer, loss_fcn): g = data[0] # Only train on the first graph g = dgl.remove_self_loop(g) g = dgl.add_self_loop(g) g = g.to(device) node_feat = g.ndata.get('feat', None) edge_feat = g.edata.get('feat', None) label = g.ndata['label'] train_mask, val_mask, test_mask = g.ndata['train_mask'].bool( ), g.ndata['val_mask'].bool(), g.ndata['test_mask'].bool() val_acc = 0. for epoch in range(pipeline_cfg['num_epochs']): model.train() logits = model(g, node_feat, edge_feat) loss = loss_fcn(logits[train_mask], label[train_mask]) optimizer.zero_grad() loss.backward() optimizer.step() train_acc = accuracy(logits[train_mask], label[train_mask]) if epoch != 0 and epoch % pipeline_cfg['eval_period'] == 0: val_acc = accuracy(logits[val_mask], label[val_mask]) print("Epoch {:05d} | Loss {:.4f} | TrainAcc {:.4f} | ValAcc {:.4f}". format(epoch, loss.item(), train_acc, val_acc)) model.eval() with torch.no_grad(): logits = model(g, node_feat, edge_feat) test_acc = accuracy(logits[test_mask], label[test_mask]) return test_acc def main(): cfg = { 'version': '0.0.1', 'device': 'cpu', 'data': { 'split_ratio': None}, 'model': { 'embed_size': -1, 'hidden_size': 16, 'num_layers': 1, 'activation': 'relu', 'dropout': 0.5, 'aggregator_type': 'gcn'}, 'general_pipeline': { 'num_epochs': 200, 'eval_period': 5, 'optimizer': { 'lr': 0.01, 'weight_decay': 0.0005}, 'loss': 'CrossEntropyLoss', 'num_runs': 1}} device = cfg['device'] pipeline_cfg = cfg['general_pipeline'] # load data data = AsNodePredDataset(CoraGraphDataset()) # create model model_cfg = cfg["model"] cfg["model"]["data_info"] = { "in_size": model_cfg['embed_size'] if model_cfg['embed_size'] > 0 else data[0].ndata['feat'].shape[1], "out_size": data.num_classes, "num_nodes": data[0].num_nodes() } model = GraphSAGE(**cfg["model"]) model = model.to(device) loss = torch.nn.CrossEntropyLoss() optimizer = torch.optim.Adam( model.parameters(), **pipeline_cfg["optimizer"]) # train test_acc = train(cfg, pipeline_cfg, device, data, model, optimizer, loss) return test_acc ... ``` ## Recipes We've prepared a set of finetuned config under `enter/recipes`, that you can try easily to get a reproducable result. For example, using GCN with pubmet dataset, you can use `enter/recipes/nodepred_pubmed_gcn.yml`. To try it, type in `dgl-enter train --cfg recipes/nodepred_pubmed_gcn.yml` to train the model, or `dgl-enter export --cfg recipes/nodepred_pubmed_gcn.yml` to get the full training script. ## Use DGL-Enter on your own dataset You can modify the generated script in anyway you want. However, we also provided an end2end way to use your own dataset, by using our `CSVDataset`. Step 1: Prepare your csv and metadata file. Following the tutorial at [Loading data from CSV files](https://docs.dgl.ai/en/latest/guide/data-loadcsv.html#guide-data-pipeline-loadcsv`), Prepare your own CSV dataset includes three files minimally, node data csv, edge data csv and the meta data file (meta.yml). ```yml dataset_name: my_csv_dataset edge_data: - file_name: edges.csv node_data: - file_name: nodes.csv ``` Step 2: Choose to csv dataset in the `dgl-enter config` stage Try `dgl-enter config nodepred --data csv --model sage --cfg csv_sage.yml`, to use SAGE model for your dataset. You'll see the data part is now the configuration related to CSV dataset. `data_path` is used to specify the data folder, and `./` means the current folder. If your dataset doesn't have the builtin split on the nodes for train/val/test, you need to manually set the split ratio in the config yml file, DGL will random generate the split for you. ```yml data: name: csv split_ratio: # Ratio to generate split masks, for example set to [0.8, 0.1, 0.1] for 80% train/10% val/10% test. Leave blank to use builtin split in original dataset data_path: ./ # metadata.yaml, nodes.csv, edges.csv should in this folder ``` Step 3: `train` the model/`export` the script Then you can do the same as the tutorial above, either train the model by `dgl-eneter train --cfg csv_sage.yaml` or use `dgl-enter export --cfg csv_sage.yml --output my_dataset.py` to get the training script. ## API Referencce DGL enter is a new tool for user to bootstrap datasets and common models. The entry point of enter is `dgl-enter`, and it has three subcommand `config`, `train` and `export`. ### Config The config stage is to generate a configuration file on the specific pipeline. `dgl-enter` currently provides 3 pipelines: - nodepred (Node prediction tasks, suitable for small dataset to prototype) - nodepred-ns (Node prediction tasks with sampling method, suitable for medium and large dataset) - linkpred (Link prediction tasks, to predict whether edge exists among node pairs based on node features) You can get the full list by `dgl-enter config --help` ``` Usage: dgl-enter config [OPTIONS] COMMAND [ARGS]... Generate the config files Options: --help Show this message and exit. Commands: linkpred Link prediction pipeline nodepred Node classification pipeline nodepred-ns Node classification sampling pipeline ``` For each pipeline it will have diffirent options to specified. For example, for node prediction pipeline, you can do `dgl-enter config nodepred --help`, you'll get: ``` Usage: dgl-enter config nodepred [OPTIONS] Node classification pipeline Options: --data [cora|citeseer|ogbl-collab|csv|reddit|co-buy-computer] input data name [required] --cfg TEXT output configuration path [default: cfg.yml] --model [gcn|gat|sage|sgc|gin] Model name [required] --device [cpu|cuda] Device, cpu or cuda [default: cpu] --help Show this message and exit. ``` You can always get the detailed help information by adding `--help` to the command line ### Train You can train a model on the dataset based on the configuration file generated by `dgl-enter config`, by `dgl-enter train`. ``` Usage: dgl-enter train [OPTIONS] Train the model Options: --cfg TEXT yaml file name [default: cfg.yml] --help Show this message and exit. ``` ### Export Get the self-contained, runnable python script derived from the configuration file by `dgl-enter export`.