Merge dev-adl2 into Master (#3117)

055885d9 · SparkSnail · GitHub · 2c5d89a7 · 055885d9 · 055885d9
Unverified Commit 055885d9 authored Nov 25, 2020 by SparkSnail Committed by GitHub Nov 25, 2020
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -51,6 +51,7 @@ build/Release
 # Dependency directories
 node_modules/
 jspm_packages/
+**/package-lock.json
 # TypeScript v1 declaration files
 typings/
@@ -81,6 +82,8 @@ __pycache__
 build
 *.egg-info
 setup.pye
+**/__init__.pye
+**/.ipynb_checkpoints
 # Environments
 .env

--- a/docs/en_US/TrainingService/AdaptDLMode.md
+++ b/docs/en_US/TrainingService/AdaptDLMode.md
+# Run an Experiment on AdaptDL
+Now NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl). Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.
+AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
+## Prerequisite for Kubernetes Service
+1. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes [on Azure](https://azure.microsoft.com/en-us/services/kubernetes-service/), or [on-premise](https://kubernetes.io/docs/setup/) with [cephfs](https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd), or [microk8s with storage add-on enabled](https://microk8s.io/docs/addons).
+2. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this [guideline](https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html) to setup AdaptDL scheduler.
+3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
+4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
+5. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
+6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
+### Verify Prerequisites
+```bash
+nnictl --version
+# Expected: <version_number>
+```
+```bash
+kubectl version
+# Expected that the kubectl client version matches the server version.
+```
+```bash
+kubectl api-versions | grep adaptdl
+# Expected: adaptdl.petuum.com/v1
+```
+## Run an experiment
+We have a CIFAR10 example that fully leverages the AdaptDL scheduler under `examples/trials/cifar10_pytorch` folder. (`main_adl.py` and `config_adl.yaml`)
+Here is a template configuration specification to use AdaptDL as a training service.
+```yaml
+authorName: default
+experimentName: minimal_adl
+trainingServicePlatform: adl
+nniManagerIp: 10.1.10.11
+logCollection: http
+tuner:
+  builtinTunerName: GridSearch
+searchSpacePath: search_space.json
+trialConcurrency: 2
+maxTrialNum: 2
+trial:
+  adaptive: false # optional.
+  image: <image_tag>
+  imagePullSecrets:  # optional
+    - name: stagingsecret
+  codeDir: .
+  command: python main.py
+  gpuNum: 1
+  cpuNum: 1  # optional
+  memorySize: 8Gi  # optional
+  nfs: # optional
+    server: 10.20.41.55
+    path: /
+    containerMountPath: /nfs
+  checkpoint: # optional
+    storageClass: microk8s-hostpath
+    storageSize: 1Gi
+```
+Those configs not mentioned below, are following the
+[default specs defined in the NNI doc](https://nni.readthedocs.io/en/latest/Tutorial/ExperimentConfig.html#configuration-spec).
+* **trainingServicePlatform**: Choose `adl` to use the Kubernetes cluster with AdaptDL scheduler.
+* **nniManagerIp**: *Required* to get the correct info and metrics back from the cluster, for `adl` training service.
+IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
+* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http.
+* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
+* **trial**: It defines the specs of an `adl` trial.
+    * **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive.
+    * **image**: Docker image for the trial
+    * **imagePullSecret**: (*Optional*) If you are using a private registry,
+    you need to provide the secret to successfully pull the image.
+    * **codeDir**: the working directory of the container. `.` means the default working directory defined by the image.
+    * **command**: the bash command to start the trial
+    * **gpuNum**: the number of GPUs requested for this trial. It must be non-negative integer.
+    * **cpuNum**: (*Optional*) the number of CPUs requested for this trial.  It must be non-negative integer.
+    * **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes
+    [default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
+    * **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph.
+    * **checkpoint** (*Optional*) [storage settings](https://kubernetes.io/docs/concepts/storage/storage-classes/) for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.
+### NFS Storage
+As you may have noticed in the above configuration spec,
+an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
+Note that `adl` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
+The `adl` training service can then mount it to the kubernetes for every trials, with the proper configurations:
+* **server**: NFS server address, e.g. IP address or domain
+* **path**: NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
+* **containerMountPath**: In container absolute path to mount the NFS **path** above,
+so that every trial will have the access to the NFS.
+In the trial containers, you can access the NFS with this path.
+Use cases:
+* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
+ and mount it so that it can be shared across multiple trials.
+* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
+So if you want to export your trained models,
+you may mount the NFS to the trial to persist and export your trained models.
+In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
+## Monitor via Log Stream
+Follow the log streaming of a certain trial:
+```bash
+nnictl log trial --trial_id=<trial_id>
+```
+```bash
+nnictl log trial <experiment_id> --trial_id=<trial_id>
+```
+Note that *after* a trial has done and its pod has been deleted,
+no logs can be retrieved then via this command.
+However you may still be able to access the past trial logs
+according to the following approach.
+## Monitor via TensorBoard
+In the context of NNI, an experiment has multiple trials.
+For easy comparison across trials for a model tuning process,
+we support TensorBoard integration. Here one experiment has
+an independent TensorBoard logging directory thus dashboard.
+You can only use the TensorBoard while the monitored experiment is running.
+In other words, it is not supported to monitor stopped experiments.
+In the trial container you may have access to two environment variables:
+* `ADAPTDL_TENSORBOARD_LOGDIR`: the TensorBoard logging directory for the current experiment,
+* `NNI_TRIAL_JOB_ID`: the `trial` job id for the current trial.
+It is recommended for to have them joined as the directory for trial,
+for example in Python:
+```python
+import os
+tensorboard_logdir = os.path.join(
+    os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
+    os.getenv("NNI_TRIAL_JOB_ID")
+)
+```
+If an experiment is stopped, the data logged here
+(defined by *the above envs* for monitoring with the following commands)
+will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
+to export it and view the TensorBoard locally.
+With the above setting, you can monitor the experiment easily
+via TensorBoard by
+```bash
+nnictl tensorboard start
+```
+If having multiple experiment running at the same time, you may use
+```bash
+nnictl tensorboard start <experiment_id>
+```
+It will provide you the web url to access the tensorboard.
+Note that you have the flexibility to set up the local `--port`
+for the TensorBoard.
--- a/docs/en_US/TrainingService/Overview.md
+++ b/docs/en_US/TrainingService/Overview.md
@@ -4,7 +4,7 @@
 NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
-Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*.
+Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [AdaptDL](./AdaptDLMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*.
 If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details.
@@ -24,6 +24,7 @@ In case users intend to use large files in their experiment (like large-scaled d
 |[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
 |[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.|
 |[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
+|[__AdaptDL__](./AdaptDLMode.md)|NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl), called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.|
 |[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|
 |[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.|
 |[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode.

--- a/docs/en_US/Tutorial/ExperimentConfig.md
+++ b/docs/en_US/Tutorial/ExperimentConfig.md
@@ -260,6 +260,8 @@ Specifies the platform to run the experiment, including __local__, __remote__, _
 * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
+* __adl__ submit trial jobs to [AdaptDL](https://www.kubeflow.org/docs/about/kubeflow/), NNI support AdaptDL on Kubernetes cluster. For detail please refer to [AdaptDL Docs](../TrainingService/AdaptDLMode.md)
 * TODO: explain frameworkcontroller.
 ### searchSpacePath

--- a/docs/en_US/Tutorial/InstallationLinux.md
+++ b/docs/en_US/Tutorial/InstallationLinux.md
@@ -118,3 +118,4 @@ Due to potential programming changes, the minimum system requirements of NNI may
 * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
 * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
 * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
+* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
\ No newline at end of file
--- a/docs/en_US/Tutorial/QuickStart.md
+++ b/docs/en_US/Tutorial/QuickStart.md
@@ -281,3 +281,4 @@ Below is the status of all trials. Specifically:
 * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
 * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
 * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
+* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
\ No newline at end of file
--- a/docs/en_US/training_services.rst
+++ b/docs/en_US/training_services.rst
@@ -8,6 +8,7 @@ Introduction to NNI Training Services
    OpenPAI<./TrainingService/PaiMode>
    OpenPAI Yarn Mode<./TrainingService/PaiYarnMode>
    Kubeflow<./TrainingService/KubeflowMode>
+    AdaptDL<./TrainingService/AdaptDLMode>
    FrameworkController<./TrainingService/FrameworkControllerMode>
    DLTS<./TrainingService/DLTSMode>
    AML<./TrainingService/AMLMode>
--- a/examples/trials/cifar10_pytorch/config_adl.yml
+++ b/examples/trials/cifar10_pytorch/config_adl.yml
+authorName: default
+experimentName: example_pytorch_cifar10
+trialConcurrency: 1
+maxExecDuration: 100h
+maxTrialNum: 10
+nniManagerIp: {replace_with_your_ip}
+trainingServicePlatform: adl
+searchSpacePath: search_space_adl.json
+logCollection: http
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 main_adl.py
+  codeDir: .
+  gpuNum: 1
+  image: {replace_with_the_image_that_has_adaptdl_installed}
+  adaptive: true
+  checkpoint:
+    storageClass: dfs
+    storageSize: 1Gi
+  cpuNum: 1
+  memorySize: 1Gi
--- a/examples/trials/cifar10_pytorch/main_adl.py
+++ b/examples/trials/cifar10_pytorch/main_adl.py
+# Copyright 2020 Petuum, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''
+Train CIFAR10 with PyTorch and AdaptDL. This example is based on:
+https://github.com/petuum/adaptdl/blob/master/examples/pytorch-cifar/main.py
+'''
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.backends.cudnn as cudnn
+import torch.distributed as dist
+import torchvision
+import torchvision.transforms as transforms
+import os
+import argparse
+from models import *
+import adaptdl
+import adaptdl.torch as adl
+from torch.optim.lr_scheduler import MultiStepLR
+from torch.utils.tensorboard import SummaryWriter
+import nni
+parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
+parser.add_argument('--bs', default=128, type=int, help='batch size')
+parser.add_argument('--lr', default=0.1, type=float, help='learning rate')
+parser.add_argument('--epochs', default=30, type=int, help='number of epochs')
+parser.add_argument('--model', default='ResNet18', type=str, help='model')
+parser.add_argument('--autoscale-bsz', dest='autoscale_bsz', default=True, action='store_true', help='autoscale batchsize')
+args = parser.parse_args()
+# load the parameters from nni
+RCV_CONFIG = nni.get_next_parameter()
+args.lr = RCV_CONFIG["lr"]
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+# Data
+print('==> Preparing data..')
+transform_train = transforms.Compose([
+    transforms.RandomCrop(32, padding=4),
+    transforms.RandomHorizontalFlip(),
+    transforms.ToTensor(),
+    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+])
+transform_test = transforms.Compose([
+    transforms.ToTensor(),
+    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+])
+adaptdl.torch.init_process_group("nccl" if torch.cuda.is_available() else "gloo")
+if adaptdl.env.replica_rank() == 0:
+    trainset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=True, download=True, transform=transform_train)
+    trainloader = adl.AdaptiveDataLoader(trainset, batch_size=args.bs, shuffle=True, num_workers=2, drop_last=True)
+    dist.barrier()  # We use a barrier here so that non-master replicas would wait for master to download the data
+else:
+    dist.barrier()
+    trainset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=True, download=False, transform=transform_train)
+    trainloader = adl.AdaptiveDataLoader(trainset, batch_size=args.bs, shuffle=True, num_workers=2, drop_last=True)
+if args.autoscale_bsz:
+    trainloader.autoscale_batch_size(4096, local_bsz_bounds=(32, 1024), gradient_accumulation=True)
+validset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=False, download=False, transform=transform_test)
+validloader = adl.AdaptiveDataLoader(validset, batch_size=100, shuffle=False, num_workers=2)
+# Model
+print('==> Building model..')
+net = eval(args.model)()
+net = net.to(device)
+if device == 'cuda':
+    cudnn.benchmark = True
+criterion = nn.CrossEntropyLoss()
+optimizer = optim.SGD([{"params": [param]} for param in net.parameters()],
+                      lr=args.lr, momentum=0.9, weight_decay=5e-4)
+lr_scheduler = MultiStepLR(optimizer, [30, 45], 0.1)
+net = adl.AdaptiveDataParallel(net, optimizer, lr_scheduler)
+# Training
+def train(epoch):
+    print('\nEpoch: %d' % epoch)
+    net.train()
+    stats = adl.Accumulator()
+    for inputs, targets in trainloader:
+        inputs, targets = inputs.to(device), targets.to(device)
+        optimizer.zero_grad()
+        outputs = net(inputs)
+        loss = criterion(outputs, targets)
+        loss.backward()
+        optimizer.step()
+        stats["loss_sum"] += loss.item() * targets.size(0)
+        _, predicted = outputs.max(1)
+        stats["total"] += targets.size(0)
+        stats["correct"] += predicted.eq(targets).sum().item()
+    trainloader.to_tensorboard(writer, epoch, tag_prefix="AdaptDL/Data/")
+    net.to_tensorboard(writer, epoch, tag_prefix="AdaptDL/Model/")
+    with stats.synchronized():
+        stats["loss_avg"] = stats["loss_sum"] / stats["total"]
+        stats["accuracy"] = stats["correct"] / stats["total"]
+        writer.add_scalar("Loss/Train", stats["loss_avg"], epoch)
+        writer.add_scalar("Accuracy/Train", stats["accuracy"], epoch)
+        print("Train:", stats)
+def valid(epoch):
+    net.eval()
+    stats = adl.Accumulator()
+    with torch.no_grad():
+        for inputs, targets in validloader:
+            inputs, targets = inputs.to(device), targets.to(device)
+            outputs = net(inputs)
+            loss = criterion(outputs, targets)
+            stats["loss_sum"] += loss.item() * targets.size(0)
+            _, predicted = outputs.max(1)
+            stats["total"] += targets.size(0)
+            stats["correct"] += predicted.eq(targets).sum().item()
+    with stats.synchronized():
+        stats["loss_avg"] = stats["loss_sum"] / stats["total"]
+        stats["accuracy"] = stats["correct"] / stats["total"]
+        writer.add_scalar("Loss/Valid", stats["loss_avg"], epoch)
+        writer.add_scalar("Accuracy/Valid", stats["accuracy"], epoch)
+        if adaptdl.env.replica_rank() == 0:
+            nni.report_intermediate_result(stats["accuracy"], accum=stats)
+        print("Valid:", stats)
+        return stats["accuracy"]
+tensorboard_dir = os.path.join(
+    os.getenv("ADAPTDL_TENSORBOARD_LOGDIR", "/adaptdl/tensorboard"),
+    os.getenv("NNI_TRIAL_JOB_ID", "cifar-adaptdl")
+)
+if not os.path.exists(tensorboard_dir):
+    os.makedirs(tensorboard_dir)
+with SummaryWriter(tensorboard_dir) as writer:
+    acc = 0
+    for epoch in adl.remaining_epochs_until(args.epochs):
+        train(epoch)
+        acc = valid(epoch)
+        lr_scheduler.step()
+    if adaptdl.env.replica_rank() == 0:
+        nni.report_final_result(acc)
--- a/examples/trials/cifar10_pytorch/search_space_adl.json
+++ b/examples/trials/cifar10_pytorch/search_space_adl.json
+{
+    "lr":{"_type":"choice", "_value":[0.1, 0.01, 0.001]},
+    "bs":{"_type":"choice","_value":[64, 96, 128]},
+    "model":{"_type":"choice", "_value":["ResNet18", "SENet18", "MobileNet"]}
+}
--- a/examples/trials/mnist-pytorch/config_adl.yml
+++ b/examples/trials/mnist-pytorch/config_adl.yml
+authorName: default
+experimentName: example_mnist_pytorch
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+logCollection: http
+trainingServicePlatform: adl
+searchSpacePath: search_space.json
+useAnnotation: false
+tuner:
+  builtinTunerName: TPE
+  classArgs:
+    optimize_mode: maximize
+trial:
+  image: {replace_to_your_image_tag}
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 1
--- a/nni/runtime/platform/__init__.py
+++ b/nni/runtime/platform/__init__.py
@@ -9,7 +9,7 @@ if trial_env_vars.NNI_PLATFORM is None:
    from .standalone import *
 elif trial_env_vars.NNI_PLATFORM == 'unittest':
    from .test import *
-elif trial_env_vars.NNI_PLATFORM in ('local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'):
+elif trial_env_vars.NNI_PLATFORM in ('adl', 'local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'):
    from .local import *
 else:
    raise RuntimeError('Unknown platform %s' % trial_env_vars.NNI_PLATFORM)
--- a/nni/tools/nnictl/config_schema.py
+++ b/nni/tools/nnictl/config_schema.py
@@ -124,7 +124,7 @@ common_schema = {
    Optional('maxExecDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxExecDuration format is [digit]{s,m,h,d}')),
    Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999),
    'trainingServicePlatform': setChoice(
-        'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'),
+        'trainingServicePlatform', 'adl', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'),
    Optional('searchSpacePath'): And(os.path.exists, error=SCHEMA_PATH_ERROR % 'searchSpacePath'),
    Optional('multiPhase'): setType('multiPhase', bool),
    Optional('multiThread'): setType('multiThread', bool),
@@ -262,6 +262,30 @@ aml_config_schema = {
    }
 }
+adl_trial_schema = {
+    'trial':{
+        'codeDir': setType('codeDir', str),
+        'command': setType('command', str),
+        'gpuNum': setNumberRange('gpuNum', int, 0, 99999),
+        'image': setType('image', str),
+        Optional('imagePullSecrets'): [{
+            'name': setType('name', str)
+        }],
+        Optional('nfs'): {
+            'server': setType('server', str),
+            'path': setType('path', str),
+            'containerMountPath': setType('containerMountPath', str)
+        },
+        Optional('adaptive'): setType('adaptive', bool),
+        Optional('checkpoint'): {
+            'storageClass': setType('storageClass', str),
+            'storageSize': setType('storageSize', str)
+        },
+        Optional('cpuNum'): setNumberRange('cpuNum', int, 0, 99999),
+        Optional('memorySize'): setType('memorySize', str)
+    }
+}
 kubeflow_trial_schema = {
    'trial': {
        'codeDir':  setPathCheck('codeDir'),
@@ -404,6 +428,7 @@ machine_list_schema = {
 }
 training_service_schema_dict = {
+    'adl': Schema({**common_schema, **adl_trial_schema}),
    'local': Schema({**common_schema, **common_trial_schema}),
    'remote': Schema({**common_schema, **common_trial_schema, **machine_list_schema, **remote_config_schema}),
    'pai': Schema({**common_schema, **pai_trial_schema, **pai_config_schema}),

--- a/nni/tools/nnictl/launcher.py
+++ b/nni/tools/nnictl/launcher.py
@@ -136,6 +136,14 @@ def set_local_config(experiment_config, port, config_file_name):
    return set_trial_config(experiment_config, port, config_file_name), None
+def set_adl_config(experiment_config, port, config_file_name):
+    '''set adl configuration'''
+    result, message = setNNIManagerIp(experiment_config, port, config_file_name)
+    if not result:
+        return result, message
+    #set trial_config
+    return set_trial_config(experiment_config, port, config_file_name), None
 def set_remote_config(experiment_config, port, config_file_name):
    '''Call setClusterMetadata to pass trial'''
    #set machine_list
@@ -393,7 +401,9 @@ def set_platform_config(platform, experiment_config, port, config_file_name, res
    '''call set_cluster_metadata for specific platform'''
    print_normal('Setting {0} config...'.format(platform))
    config_result, err_msg = None, None
-    if platform == 'local':
+    if platform == 'adl':
+        config_result, err_msg = set_adl_config(experiment_config, port, config_file_name)
+    elif platform == 'local':
        config_result, err_msg = set_local_config(experiment_config, port, config_file_name)
    elif platform == 'remote':
        config_result, err_msg = set_remote_config(experiment_config, port, config_file_name)

--- a/nni/tools/nnictl/nnictl_utils.py
+++ b/nni/tools/nnictl/nnictl_utils.py
@@ -10,6 +10,7 @@ import re
 import shutil
 import subprocess
 from functools import cmp_to_key
+import traceback
 from datetime import datetime, timezone
 from subprocess import Popen
 from pyhdfs import HdfsClient
@@ -21,6 +22,7 @@ from .config_utils import Config, Experiments
 from .constants import NNICTL_HOME_DIR, NNI_HOME_DIR, EXPERIMENT_INFORMATION_FORMAT, EXPERIMENT_DETAIL_FORMAT, \
     EXPERIMENT_MONITOR_INFO, TRIAL_MONITOR_HEAD, TRIAL_MONITOR_CONTENT, TRIAL_MONITOR_TAIL, REST_TIME_OUT
 from .common_utils import print_normal, print_error, print_warning, detect_process, get_yml_content, generate_temp_dir
+from .common_utils import print_green
 from .command_utils import check_output_command, kill_command
 from .ssh_utils import create_ssh_sftp_client, remove_remote_directory
@@ -372,6 +374,40 @@ def log_stderr(args):
    '''get stderr log'''
    log_internal(args, 'stderr')
+def log_trial_adl_helper(args, experiment_id):
+    # adljob_id format should be consistent to the one in "adlTrainingService.ts":
+    #   const adlJobName: string = `nni-exp-${this.experimentId}-trial-${trialJobId}`.toLowerCase();
+    adlJobName = "nni-exp-{}-trial-{}".format(experiment_id, args.trial_id).lower()
+    print_warning('Note that no log will show when trial is pending or done (succeeded or failed). '
+                  'You can retry the command.')
+    print_green('>>> Trial log streaming:')
+    try:
+        subprocess.run(
+            [
+                "kubectl", "logs",
+                "-l", "adaptdl/job=%s" % adlJobName,
+                "-f"  # Follow the stream
+            ],  # TODO: support remaining argument, uncomment the lines in nnictl.py
+        )  # TODO: emulate tee behaviors, not necessary tho.
+    except KeyboardInterrupt:
+        pass
+    except Exception:
+        print_error('Error! Please check kubectl:')
+        traceback.print_exc()
+        exit(1)
+    finally:
+        print_green('<<< [adlJobName:%s]' % adlJobName)
+        nni_manager_collection_path = os.path.expanduser('~/nni-experiments/%s/trials/%s/stdout_log_collection.log' %
+                                                         (experiment_id, args.trial_id))
+        print_green('>>> (Optional) How to persist the complete trial log locally:')
+        print(
+            'Please ensure `logCollection: http` '
+            'exists in the experiment configuration yaml. '
+            'After trial done, you can check it from the file below: \n  %s'
+            % nni_manager_collection_path
+        )
 def log_trial(args):
    ''''get trial log path'''
    trial_id_path_dict = {}
@@ -394,10 +430,18 @@ def log_trial(args):
    else:
        print_error('Restful server is not running...')
        exit(1)
+    is_adl = nni_config.get_config('experimentConfig').get('trainingServicePlatform') == 'adl'
+    if is_adl and not args.trial_id:
+        print_error('Trial ID is required to retrieve the log for adl. Please specify it with "--trial_id".')
+        exit(1)
    if args.trial_id:
        if args.trial_id not in trial_id_list:
            print_error('Trial id {0} not correct, please check your command!'.format(args.trial_id))
            exit(1)
+        if is_adl:
+            log_trial_adl_helper(args, nni_config.get_config('experimentId'))
+            # adl has its own way to log trial, and it thus returns right after the helper returns
+            return
        if trial_id_path_dict.get(args.trial_id):
            print_normal('id:' + args.trial_id + ' path:' + trial_id_path_dict[args.trial_id])
        else:

--- a/nni/tools/nnictl/tensorboard_utils.py
+++ b/nni/tools/nnictl/tensorboard_utils.py
@@ -10,7 +10,7 @@ from .rest_utils import rest_get, check_rest_server_quick, check_response
 from .config_utils import Config, Experiments
 from .url_utils import trial_jobs_url, get_local_urls
 from .constants import REST_TIME_OUT
-from .common_utils import print_normal, print_error, print_green, detect_process, detect_port, check_tensorboard_version
+from .common_utils import print_normal, print_warning, print_error, print_green, detect_process, detect_port, check_tensorboard_version
 from .nnictl_utils import check_experiment_id, check_experiment_id
 from .ssh_utils import create_ssh_sftp_client, copy_remote_directory_to_local
@@ -110,14 +110,36 @@ def stop_tensorboard(args):
    else:
        print_error('No tensorboard configuration!')
+def adl_tensorboard_helper(args):
+    '''start tensorboard on adl'''
+    import subprocess
+    if args.trial_id is not None:
+        print_warning('Tensorboard on adl platform will show all trials. No trial ids needed.')
+    cmd = "kubectl port-forward --address 0.0.0.0 deployment/{} {}:{}".format(
+        "adaptdl-tensorboard" + "-" + args.id.lower(),
+        args.port,
+        6006
+    )
+    print_green('Tensorboard is accessible at 0.0.0.0:{port} or localhost:{port}'.format(port=args.port))
+    subprocess.run(args=cmd, shell=True)
 def start_tensorboard(args):
    '''start tensorboard'''
    experiment_id = check_experiment_id(args)
+    if not experiment_id:
+        return
+    if args.id is None:
+        args.id = experiment_id
    experiment_config = Experiments()
    experiment_dict = experiment_config.get_all_experiments()
+    if experiment_dict[args.id]["status"] == "STOPPED":
+        print_error("Experiment {} is stopped...".format(args.id))
+        return
    config_file_name = experiment_dict[experiment_id]['fileName']
    nni_config = Config(config_file_name)
+    if nni_config.get_config('experimentConfig').get('trainingServicePlatform') == 'adl':
+        adl_tensorboard_helper(args)
+        return
    rest_port = nni_config.get_config('restServerPort')
    rest_pid = nni_config.get_config('restServerPid')
    if not detect_process(rest_pid):
@@ -144,4 +166,4 @@ def start_tensorboard(args):
    os.makedirs(temp_nni_path, exist_ok=True)
    path_list = get_path_list(args, nni_config, trial_content, temp_nni_path)
    start_tensorboard_process(args, nni_config, path_list, temp_nni_path)
\ No newline at end of file
--- a/nni/tools/trial_tool/trial_keeper.py
+++ b/nni/tools/trial_tool/trial_keeper.py
@@ -28,6 +28,7 @@ logger = logging.getLogger('trial_keeper')
 regular = re.compile('v?(?P<version>[0-9](\.[0-9]){0,1}).*')
 _hdfs_client = None
+_trial_process = None
 def get_hdfs_client(args):
@@ -62,6 +63,7 @@ def get_hdfs_client(args):
 def main_loop(args):
    '''main loop logic for trial keeper'''
+    global _trial_process
    if not os.path.exists(LOG_DIR):
        os.makedirs(LOG_DIR)
@@ -90,13 +92,13 @@ def main_loop(args):
    # Notice: We don't appoint env, which means subprocess wil inherit current environment and that is expected behavior
    log_pipe_stdout = trial_syslogger_stdout.get_pipelog_reader()
-    process = Popen(args.trial_command, shell=True, stdout=log_pipe_stdout, stderr=log_pipe_stdout)
+    _trial_process = Popen(args.trial_command, shell=True, stdout=log_pipe_stdout, stderr=log_pipe_stdout, preexec_fn=os.setsid)
-    nni_log(LogType.Info, 'Trial keeper spawns a subprocess (pid {0}) to run command: {1}'.format(process.pid,
+    nni_log(LogType.Info, 'Trial keeper spawns a subprocess (pid {0}) to run command: {1}'.format(_trial_process.pid,
                                                                                                  shlex.split(
                                                                                                      args.trial_command)))
    while True:
-        retCode = process.poll()
+        retCode = _trial_process.poll()
        # child worker process exits and all stdout data is read
        if retCode is not None and log_pipe_stdout.set_process_exit() and log_pipe_stdout.is_read_completed == True:
            # In Windows, the retCode -1 is 4294967295. It's larger than c_long, and raise OverflowError.
@@ -213,6 +215,20 @@ def fetch_parameter_file(args):
    fetch_file_thread.start()
+def _set_adaptdl_signal_handler():
+    import signal
+    global _trial_process
+    def _handler(signum, frame):
+        nni_log(LogType.Info, "RECEIVED SIGNAL {}".format(signum))
+        nni_log(LogType.Debug, "TRIAL PROCESS ID {}".format(_trial_process.pid))
+        if _trial_process and (signum == signal.SIGTERM or signum == signal.SIGINT):
+            os.killpg(os.getpgid(_trial_process.pid), signal.SIGINT)
+            os.waitpid(_trial_process.pid, 0)
+        exit(1)
+    signal.signal(signal.SIGTERM, _handler)
+    signal.signal(signal.SIGINT, _handler)
 if __name__ == '__main__':
    '''NNI Trial Keeper main function'''
    PARSER = argparse.ArgumentParser()
@@ -237,6 +253,8 @@ if __name__ == '__main__':
    try:
        if NNI_PLATFORM == 'paiYarn' and is_multi_phase():
            fetch_parameter_file(args)
+        if NNI_PLATFORM == 'adl':
+            _set_adaptdl_signal_handler()
        main_loop(args)
    except SystemExit as se:
        nni_log(LogType.Info, 'NNI trial keeper exit with code {}'.format(se.code))

--- a/nni/trial.py
+++ b/nni/trial.py
@@ -97,6 +97,21 @@ def get_sequence_id():
 _intermediate_seq = 0
+def overwrite_intermediate_seq(value):
+    """
+    Overwrite intermediate sequence value.
+    Parameters
+    ----------
+    value:
+        int
+    """
+    assert isinstance(value, int)
+    global _intermediate_seq
+    _intermediate_seq = value
 def report_intermediate_result(metric):
    """
    Reports intermediate result to NNI.

--- a/ts/nni_manager/common/datastore.ts
+++ b/ts/nni_manager/common/datastore.ts
@@ -46,6 +46,7 @@ interface TrialJobInfo {
    trialJobId: string;
    sequenceId?: number;
    status: TrialJobStatus;
+    message?: string;
    startTime?: number;
    endTime?: number;
    hyperParameters?: string[];

--- a/ts/nni_manager/common/manager.ts
+++ b/ts/nni_manager/common/manager.ts
@@ -105,6 +105,7 @@ abstract class Manager {
    public abstract getTrialLog(trialJobId: string, logType: LogType): Promise<string>;
    public abstract getTrialJobStatistics(): Promise<TrialJobStatistics[]>;
+    public abstract getTrialJobMessage(trialJobId: string): string | undefined;
    public abstract getStatus(): NNIManagerStatus;
 }