"examples/git@developer.sourcefind.cn:OpenDAS/nni.git" did not exist on "3394f2c399bf8248931eb7b9f559a8be52ad9a07"
Unverified Commit 055885d9 authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Merge dev-adl2 into Master (#3117)

parent 2c5d89a7
...@@ -51,6 +51,7 @@ build/Release ...@@ -51,6 +51,7 @@ build/Release
# Dependency directories # Dependency directories
node_modules/ node_modules/
jspm_packages/ jspm_packages/
**/package-lock.json
# TypeScript v1 declaration files # TypeScript v1 declaration files
typings/ typings/
...@@ -81,6 +82,8 @@ __pycache__ ...@@ -81,6 +82,8 @@ __pycache__
build build
*.egg-info *.egg-info
setup.pye setup.pye
**/__init__.pye
**/.ipynb_checkpoints
# Environments # Environments
.env .env
......
# Run an Experiment on AdaptDL
Now NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl). Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
## Prerequisite for Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes [on Azure](https://azure.microsoft.com/en-us/services/kubernetes-service/), or [on-premise](https://kubernetes.io/docs/setup/) with [cephfs](https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd), or [microk8s with storage add-on enabled](https://microk8s.io/docs/addons).
2. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this [guideline](https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html) to setup AdaptDL scheduler.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
### Verify Prerequisites
```bash
nnictl --version
# Expected: <version_number>
```
```bash
kubectl version
# Expected that the kubectl client version matches the server version.
```
```bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
```
## Run an experiment
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under `examples/trials/cifar10_pytorch` folder. (`main_adl.py` and `config_adl.yaml`)
Here is a template configuration specification to use AdaptDL as a training service.
```yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: microk8s-hostpath
storageSize: 1Gi
```
Those configs not mentioned below, are following the
[default specs defined in the NNI doc](https://nni.readthedocs.io/en/latest/Tutorial/ExperimentConfig.html#configuration-spec).
* **trainingServicePlatform**: Choose `adl` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**: *Required* to get the correct info and metrics back from the cluster, for `adl` training service.
IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http.
* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**: It defines the specs of an `adl` trial.
* **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive.
* **image**: Docker image for the trial
* **imagePullSecret**: (*Optional*) If you are using a private registry,
you need to provide the secret to successfully pull the image.
* **codeDir**: the working directory of the container. `.` means the default working directory defined by the image.
* **command**: the bash command to start the trial
* **gpuNum**: the number of GPUs requested for this trial. It must be non-negative integer.
* **cpuNum**: (*Optional*) the number of CPUs requested for this trial. It must be non-negative integer.
* **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes
[default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
* **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*) [storage settings](https://kubernetes.io/docs/concepts/storage/storage-classes/) for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.
### NFS Storage
As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
Note that `adl` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The `adl` training service can then mount it to the kubernetes for every trials, with the proper configurations:
* **server**: NFS server address, e.g. IP address or domain
* **path**: NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**: In container absolute path to mount the NFS **path** above,
so that every trial will have the access to the NFS.
In the trial containers, you can access the NFS with this path.
Use cases:
* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
So if you want to export your trained models,
you may mount the NFS to the trial to persist and export your trained models.
In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
## Monitor via Log Stream
Follow the log streaming of a certain trial:
```bash
nnictl log trial --trial_id=<trial_id>
```
```bash
nnictl log trial <experiment_id> --trial_id=<trial_id>
```
Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.
## Monitor via TensorBoard
In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.
You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.
In the trial container you may have access to two environment variables:
* `ADAPTDL_TENSORBOARD_LOGDIR`: the TensorBoard logging directory for the current experiment,
* `NNI_TRIAL_JOB_ID`: the `trial` job id for the current trial.
It is recommended for to have them joined as the directory for trial,
for example in Python:
```python
import os
tensorboard_logdir = os.path.join(
os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
os.getenv("NNI_TRIAL_JOB_ID")
)
```
If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.
With the above setting, you can monitor the experiment easily
via TensorBoard by
```bash
nnictl tensorboard start
```
If having multiple experiment running at the same time, you may use
```bash
nnictl tensorboard start <experiment_id>
```
It will provide you the web url to access the tensorboard.
Note that you have the flexibility to set up the local `--port`
for the TensorBoard.
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled. NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*. Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [AdaptDL](./AdaptDLMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details. If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details.
...@@ -24,6 +24,7 @@ In case users intend to use large files in their experiment (like large-scaled d ...@@ -24,6 +24,7 @@ In case users intend to use large files in their experiment (like large-scaled d
|[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.| |[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
|[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.| |[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.|
|[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| |[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
|[__AdaptDL__](./AdaptDLMode.md)|NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl), called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.|
|[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| |[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|
|[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.| |[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.|
|[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode. |[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode.
......
...@@ -260,6 +260,8 @@ Specifies the platform to run the experiment, including __local__, __remote__, _ ...@@ -260,6 +260,8 @@ Specifies the platform to run the experiment, including __local__, __remote__, _
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md) * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
* __adl__ submit trial jobs to [AdaptDL](https://www.kubeflow.org/docs/about/kubeflow/), NNI support AdaptDL on Kubernetes cluster. For detail please refer to [AdaptDL Docs](../TrainingService/AdaptDLMode.md)
* TODO: explain frameworkcontroller. * TODO: explain frameworkcontroller.
### searchSpacePath ### searchSpacePath
......
...@@ -118,3 +118,4 @@ Due to potential programming changes, the minimum system requirements of NNI may ...@@ -118,3 +118,4 @@ Due to potential programming changes, the minimum system requirements of NNI may
* [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md) * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
* [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md) * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
* [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md) * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
\ No newline at end of file
...@@ -281,3 +281,4 @@ Below is the status of all trials. Specifically: ...@@ -281,3 +281,4 @@ Below is the status of all trials. Specifically:
* [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md) * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
* [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md) * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
* [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md) * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
\ No newline at end of file
...@@ -8,6 +8,7 @@ Introduction to NNI Training Services ...@@ -8,6 +8,7 @@ Introduction to NNI Training Services
OpenPAI<./TrainingService/PaiMode> OpenPAI<./TrainingService/PaiMode>
OpenPAI Yarn Mode<./TrainingService/PaiYarnMode> OpenPAI Yarn Mode<./TrainingService/PaiYarnMode>
Kubeflow<./TrainingService/KubeflowMode> Kubeflow<./TrainingService/KubeflowMode>
AdaptDL<./TrainingService/AdaptDLMode>
FrameworkController<./TrainingService/FrameworkControllerMode> FrameworkController<./TrainingService/FrameworkControllerMode>
DLTS<./TrainingService/DLTSMode> DLTS<./TrainingService/DLTSMode>
AML<./TrainingService/AMLMode> AML<./TrainingService/AMLMode>
authorName: default
experimentName: example_pytorch_cifar10
trialConcurrency: 1
maxExecDuration: 100h
maxTrialNum: 10
nniManagerIp: {replace_with_your_ip}
trainingServicePlatform: adl
searchSpacePath: search_space_adl.json
logCollection: http
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 main_adl.py
codeDir: .
gpuNum: 1
image: {replace_with_the_image_that_has_adaptdl_installed}
adaptive: true
checkpoint:
storageClass: dfs
storageSize: 1Gi
cpuNum: 1
memorySize: 1Gi
# Copyright 2020 Petuum, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
Train CIFAR10 with PyTorch and AdaptDL. This example is based on:
https://github.com/petuum/adaptdl/blob/master/examples/pytorch-cifar/main.py
'''
import torch
import torch.nn as nn
import torch.optim as optim
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torchvision
import torchvision.transforms as transforms
import os
import argparse
from models import *
import adaptdl
import adaptdl.torch as adl
from torch.optim.lr_scheduler import MultiStepLR
from torch.utils.tensorboard import SummaryWriter
import nni
parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
parser.add_argument('--bs', default=128, type=int, help='batch size')
parser.add_argument('--lr', default=0.1, type=float, help='learning rate')
parser.add_argument('--epochs', default=30, type=int, help='number of epochs')
parser.add_argument('--model', default='ResNet18', type=str, help='model')
parser.add_argument('--autoscale-bsz', dest='autoscale_bsz', default=True, action='store_true', help='autoscale batchsize')
args = parser.parse_args()
# load the parameters from nni
RCV_CONFIG = nni.get_next_parameter()
args.lr = RCV_CONFIG["lr"]
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Data
print('==> Preparing data..')
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
adaptdl.torch.init_process_group("nccl" if torch.cuda.is_available() else "gloo")
if adaptdl.env.replica_rank() == 0:
trainset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=True, download=True, transform=transform_train)
trainloader = adl.AdaptiveDataLoader(trainset, batch_size=args.bs, shuffle=True, num_workers=2, drop_last=True)
dist.barrier() # We use a barrier here so that non-master replicas would wait for master to download the data
else:
dist.barrier()
trainset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=True, download=False, transform=transform_train)
trainloader = adl.AdaptiveDataLoader(trainset, batch_size=args.bs, shuffle=True, num_workers=2, drop_last=True)
if args.autoscale_bsz:
trainloader.autoscale_batch_size(4096, local_bsz_bounds=(32, 1024), gradient_accumulation=True)
validset = torchvision.datasets.CIFAR10(root=adaptdl.env.share_path(), train=False, download=False, transform=transform_test)
validloader = adl.AdaptiveDataLoader(validset, batch_size=100, shuffle=False, num_workers=2)
# Model
print('==> Building model..')
net = eval(args.model)()
net = net.to(device)
if device == 'cuda':
cudnn.benchmark = True
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD([{"params": [param]} for param in net.parameters()],
lr=args.lr, momentum=0.9, weight_decay=5e-4)
lr_scheduler = MultiStepLR(optimizer, [30, 45], 0.1)
net = adl.AdaptiveDataParallel(net, optimizer, lr_scheduler)
# Training
def train(epoch):
print('\nEpoch: %d' % epoch)
net.train()
stats = adl.Accumulator()
for inputs, targets in trainloader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
stats["loss_sum"] += loss.item() * targets.size(0)
_, predicted = outputs.max(1)
stats["total"] += targets.size(0)
stats["correct"] += predicted.eq(targets).sum().item()
trainloader.to_tensorboard(writer, epoch, tag_prefix="AdaptDL/Data/")
net.to_tensorboard(writer, epoch, tag_prefix="AdaptDL/Model/")
with stats.synchronized():
stats["loss_avg"] = stats["loss_sum"] / stats["total"]
stats["accuracy"] = stats["correct"] / stats["total"]
writer.add_scalar("Loss/Train", stats["loss_avg"], epoch)
writer.add_scalar("Accuracy/Train", stats["accuracy"], epoch)
print("Train:", stats)
def valid(epoch):
net.eval()
stats = adl.Accumulator()
with torch.no_grad():
for inputs, targets in validloader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = net(inputs)
loss = criterion(outputs, targets)
stats["loss_sum"] += loss.item() * targets.size(0)
_, predicted = outputs.max(1)
stats["total"] += targets.size(0)
stats["correct"] += predicted.eq(targets).sum().item()
with stats.synchronized():
stats["loss_avg"] = stats["loss_sum"] / stats["total"]
stats["accuracy"] = stats["correct"] / stats["total"]
writer.add_scalar("Loss/Valid", stats["loss_avg"], epoch)
writer.add_scalar("Accuracy/Valid", stats["accuracy"], epoch)
if adaptdl.env.replica_rank() == 0:
nni.report_intermediate_result(stats["accuracy"], accum=stats)
print("Valid:", stats)
return stats["accuracy"]
tensorboard_dir = os.path.join(
os.getenv("ADAPTDL_TENSORBOARD_LOGDIR", "/adaptdl/tensorboard"),
os.getenv("NNI_TRIAL_JOB_ID", "cifar-adaptdl")
)
if not os.path.exists(tensorboard_dir):
os.makedirs(tensorboard_dir)
with SummaryWriter(tensorboard_dir) as writer:
acc = 0
for epoch in adl.remaining_epochs_until(args.epochs):
train(epoch)
acc = valid(epoch)
lr_scheduler.step()
if adaptdl.env.replica_rank() == 0:
nni.report_final_result(acc)
{
"lr":{"_type":"choice", "_value":[0.1, 0.01, 0.001]},
"bs":{"_type":"choice","_value":[64, 96, 128]},
"model":{"_type":"choice", "_value":["ResNet18", "SENet18", "MobileNet"]}
}
authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
logCollection: http
trainingServicePlatform: adl
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
image: {replace_to_your_image_tag}
command: python3 mnist.py
codeDir: .
gpuNum: 1
...@@ -9,7 +9,7 @@ if trial_env_vars.NNI_PLATFORM is None: ...@@ -9,7 +9,7 @@ if trial_env_vars.NNI_PLATFORM is None:
from .standalone import * from .standalone import *
elif trial_env_vars.NNI_PLATFORM == 'unittest': elif trial_env_vars.NNI_PLATFORM == 'unittest':
from .test import * from .test import *
elif trial_env_vars.NNI_PLATFORM in ('local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'): elif trial_env_vars.NNI_PLATFORM in ('adl', 'local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'):
from .local import * from .local import *
else: else:
raise RuntimeError('Unknown platform %s' % trial_env_vars.NNI_PLATFORM) raise RuntimeError('Unknown platform %s' % trial_env_vars.NNI_PLATFORM)
...@@ -124,7 +124,7 @@ common_schema = { ...@@ -124,7 +124,7 @@ common_schema = {
Optional('maxExecDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxExecDuration format is [digit]{s,m,h,d}')), Optional('maxExecDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxExecDuration format is [digit]{s,m,h,d}')),
Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999), Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999),
'trainingServicePlatform': setChoice( 'trainingServicePlatform': setChoice(
'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'), 'trainingServicePlatform', 'adl', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts', 'aml'),
Optional('searchSpacePath'): And(os.path.exists, error=SCHEMA_PATH_ERROR % 'searchSpacePath'), Optional('searchSpacePath'): And(os.path.exists, error=SCHEMA_PATH_ERROR % 'searchSpacePath'),
Optional('multiPhase'): setType('multiPhase', bool), Optional('multiPhase'): setType('multiPhase', bool),
Optional('multiThread'): setType('multiThread', bool), Optional('multiThread'): setType('multiThread', bool),
...@@ -262,6 +262,30 @@ aml_config_schema = { ...@@ -262,6 +262,30 @@ aml_config_schema = {
} }
} }
adl_trial_schema = {
'trial':{
'codeDir': setType('codeDir', str),
'command': setType('command', str),
'gpuNum': setNumberRange('gpuNum', int, 0, 99999),
'image': setType('image', str),
Optional('imagePullSecrets'): [{
'name': setType('name', str)
}],
Optional('nfs'): {
'server': setType('server', str),
'path': setType('path', str),
'containerMountPath': setType('containerMountPath', str)
},
Optional('adaptive'): setType('adaptive', bool),
Optional('checkpoint'): {
'storageClass': setType('storageClass', str),
'storageSize': setType('storageSize', str)
},
Optional('cpuNum'): setNumberRange('cpuNum', int, 0, 99999),
Optional('memorySize'): setType('memorySize', str)
}
}
kubeflow_trial_schema = { kubeflow_trial_schema = {
'trial': { 'trial': {
'codeDir': setPathCheck('codeDir'), 'codeDir': setPathCheck('codeDir'),
...@@ -404,6 +428,7 @@ machine_list_schema = { ...@@ -404,6 +428,7 @@ machine_list_schema = {
} }
training_service_schema_dict = { training_service_schema_dict = {
'adl': Schema({**common_schema, **adl_trial_schema}),
'local': Schema({**common_schema, **common_trial_schema}), 'local': Schema({**common_schema, **common_trial_schema}),
'remote': Schema({**common_schema, **common_trial_schema, **machine_list_schema, **remote_config_schema}), 'remote': Schema({**common_schema, **common_trial_schema, **machine_list_schema, **remote_config_schema}),
'pai': Schema({**common_schema, **pai_trial_schema, **pai_config_schema}), 'pai': Schema({**common_schema, **pai_trial_schema, **pai_config_schema}),
......
...@@ -136,6 +136,14 @@ def set_local_config(experiment_config, port, config_file_name): ...@@ -136,6 +136,14 @@ def set_local_config(experiment_config, port, config_file_name):
return set_trial_config(experiment_config, port, config_file_name), None return set_trial_config(experiment_config, port, config_file_name), None
def set_adl_config(experiment_config, port, config_file_name):
'''set adl configuration'''
result, message = setNNIManagerIp(experiment_config, port, config_file_name)
if not result:
return result, message
#set trial_config
return set_trial_config(experiment_config, port, config_file_name), None
def set_remote_config(experiment_config, port, config_file_name): def set_remote_config(experiment_config, port, config_file_name):
'''Call setClusterMetadata to pass trial''' '''Call setClusterMetadata to pass trial'''
#set machine_list #set machine_list
...@@ -393,7 +401,9 @@ def set_platform_config(platform, experiment_config, port, config_file_name, res ...@@ -393,7 +401,9 @@ def set_platform_config(platform, experiment_config, port, config_file_name, res
'''call set_cluster_metadata for specific platform''' '''call set_cluster_metadata for specific platform'''
print_normal('Setting {0} config...'.format(platform)) print_normal('Setting {0} config...'.format(platform))
config_result, err_msg = None, None config_result, err_msg = None, None
if platform == 'local': if platform == 'adl':
config_result, err_msg = set_adl_config(experiment_config, port, config_file_name)
elif platform == 'local':
config_result, err_msg = set_local_config(experiment_config, port, config_file_name) config_result, err_msg = set_local_config(experiment_config, port, config_file_name)
elif platform == 'remote': elif platform == 'remote':
config_result, err_msg = set_remote_config(experiment_config, port, config_file_name) config_result, err_msg = set_remote_config(experiment_config, port, config_file_name)
......
...@@ -10,6 +10,7 @@ import re ...@@ -10,6 +10,7 @@ import re
import shutil import shutil
import subprocess import subprocess
from functools import cmp_to_key from functools import cmp_to_key
import traceback
from datetime import datetime, timezone from datetime import datetime, timezone
from subprocess import Popen from subprocess import Popen
from pyhdfs import HdfsClient from pyhdfs import HdfsClient
...@@ -21,6 +22,7 @@ from .config_utils import Config, Experiments ...@@ -21,6 +22,7 @@ from .config_utils import Config, Experiments
from .constants import NNICTL_HOME_DIR, NNI_HOME_DIR, EXPERIMENT_INFORMATION_FORMAT, EXPERIMENT_DETAIL_FORMAT, \ from .constants import NNICTL_HOME_DIR, NNI_HOME_DIR, EXPERIMENT_INFORMATION_FORMAT, EXPERIMENT_DETAIL_FORMAT, \
EXPERIMENT_MONITOR_INFO, TRIAL_MONITOR_HEAD, TRIAL_MONITOR_CONTENT, TRIAL_MONITOR_TAIL, REST_TIME_OUT EXPERIMENT_MONITOR_INFO, TRIAL_MONITOR_HEAD, TRIAL_MONITOR_CONTENT, TRIAL_MONITOR_TAIL, REST_TIME_OUT
from .common_utils import print_normal, print_error, print_warning, detect_process, get_yml_content, generate_temp_dir from .common_utils import print_normal, print_error, print_warning, detect_process, get_yml_content, generate_temp_dir
from .common_utils import print_green
from .command_utils import check_output_command, kill_command from .command_utils import check_output_command, kill_command
from .ssh_utils import create_ssh_sftp_client, remove_remote_directory from .ssh_utils import create_ssh_sftp_client, remove_remote_directory
...@@ -372,6 +374,40 @@ def log_stderr(args): ...@@ -372,6 +374,40 @@ def log_stderr(args):
'''get stderr log''' '''get stderr log'''
log_internal(args, 'stderr') log_internal(args, 'stderr')
def log_trial_adl_helper(args, experiment_id):
# adljob_id format should be consistent to the one in "adlTrainingService.ts":
# const adlJobName: string = `nni-exp-${this.experimentId}-trial-${trialJobId}`.toLowerCase();
adlJobName = "nni-exp-{}-trial-{}".format(experiment_id, args.trial_id).lower()
print_warning('Note that no log will show when trial is pending or done (succeeded or failed). '
'You can retry the command.')
print_green('>>> Trial log streaming:')
try:
subprocess.run(
[
"kubectl", "logs",
"-l", "adaptdl/job=%s" % adlJobName,
"-f" # Follow the stream
], # TODO: support remaining argument, uncomment the lines in nnictl.py
) # TODO: emulate tee behaviors, not necessary tho.
except KeyboardInterrupt:
pass
except Exception:
print_error('Error! Please check kubectl:')
traceback.print_exc()
exit(1)
finally:
print_green('<<< [adlJobName:%s]' % adlJobName)
nni_manager_collection_path = os.path.expanduser('~/nni-experiments/%s/trials/%s/stdout_log_collection.log' %
(experiment_id, args.trial_id))
print_green('>>> (Optional) How to persist the complete trial log locally:')
print(
'Please ensure `logCollection: http` '
'exists in the experiment configuration yaml. '
'After trial done, you can check it from the file below: \n %s'
% nni_manager_collection_path
)
def log_trial(args): def log_trial(args):
''''get trial log path''' ''''get trial log path'''
trial_id_path_dict = {} trial_id_path_dict = {}
...@@ -394,10 +430,18 @@ def log_trial(args): ...@@ -394,10 +430,18 @@ def log_trial(args):
else: else:
print_error('Restful server is not running...') print_error('Restful server is not running...')
exit(1) exit(1)
is_adl = nni_config.get_config('experimentConfig').get('trainingServicePlatform') == 'adl'
if is_adl and not args.trial_id:
print_error('Trial ID is required to retrieve the log for adl. Please specify it with "--trial_id".')
exit(1)
if args.trial_id: if args.trial_id:
if args.trial_id not in trial_id_list: if args.trial_id not in trial_id_list:
print_error('Trial id {0} not correct, please check your command!'.format(args.trial_id)) print_error('Trial id {0} not correct, please check your command!'.format(args.trial_id))
exit(1) exit(1)
if is_adl:
log_trial_adl_helper(args, nni_config.get_config('experimentId'))
# adl has its own way to log trial, and it thus returns right after the helper returns
return
if trial_id_path_dict.get(args.trial_id): if trial_id_path_dict.get(args.trial_id):
print_normal('id:' + args.trial_id + ' path:' + trial_id_path_dict[args.trial_id]) print_normal('id:' + args.trial_id + ' path:' + trial_id_path_dict[args.trial_id])
else: else:
......
...@@ -10,7 +10,7 @@ from .rest_utils import rest_get, check_rest_server_quick, check_response ...@@ -10,7 +10,7 @@ from .rest_utils import rest_get, check_rest_server_quick, check_response
from .config_utils import Config, Experiments from .config_utils import Config, Experiments
from .url_utils import trial_jobs_url, get_local_urls from .url_utils import trial_jobs_url, get_local_urls
from .constants import REST_TIME_OUT from .constants import REST_TIME_OUT
from .common_utils import print_normal, print_error, print_green, detect_process, detect_port, check_tensorboard_version from .common_utils import print_normal, print_warning, print_error, print_green, detect_process, detect_port, check_tensorboard_version
from .nnictl_utils import check_experiment_id, check_experiment_id from .nnictl_utils import check_experiment_id, check_experiment_id
from .ssh_utils import create_ssh_sftp_client, copy_remote_directory_to_local from .ssh_utils import create_ssh_sftp_client, copy_remote_directory_to_local
...@@ -110,14 +110,36 @@ def stop_tensorboard(args): ...@@ -110,14 +110,36 @@ def stop_tensorboard(args):
else: else:
print_error('No tensorboard configuration!') print_error('No tensorboard configuration!')
def adl_tensorboard_helper(args):
'''start tensorboard on adl'''
import subprocess
if args.trial_id is not None:
print_warning('Tensorboard on adl platform will show all trials. No trial ids needed.')
cmd = "kubectl port-forward --address 0.0.0.0 deployment/{} {}:{}".format(
"adaptdl-tensorboard" + "-" + args.id.lower(),
args.port,
6006
)
print_green('Tensorboard is accessible at 0.0.0.0:{port} or localhost:{port}'.format(port=args.port))
subprocess.run(args=cmd, shell=True)
def start_tensorboard(args): def start_tensorboard(args):
'''start tensorboard''' '''start tensorboard'''
experiment_id = check_experiment_id(args) experiment_id = check_experiment_id(args)
if not experiment_id:
return
if args.id is None:
args.id = experiment_id
experiment_config = Experiments() experiment_config = Experiments()
experiment_dict = experiment_config.get_all_experiments() experiment_dict = experiment_config.get_all_experiments()
if experiment_dict[args.id]["status"] == "STOPPED":
print_error("Experiment {} is stopped...".format(args.id))
return
config_file_name = experiment_dict[experiment_id]['fileName'] config_file_name = experiment_dict[experiment_id]['fileName']
nni_config = Config(config_file_name) nni_config = Config(config_file_name)
if nni_config.get_config('experimentConfig').get('trainingServicePlatform') == 'adl':
adl_tensorboard_helper(args)
return
rest_port = nni_config.get_config('restServerPort') rest_port = nni_config.get_config('restServerPort')
rest_pid = nni_config.get_config('restServerPid') rest_pid = nni_config.get_config('restServerPid')
if not detect_process(rest_pid): if not detect_process(rest_pid):
...@@ -144,4 +166,4 @@ def start_tensorboard(args): ...@@ -144,4 +166,4 @@ def start_tensorboard(args):
os.makedirs(temp_nni_path, exist_ok=True) os.makedirs(temp_nni_path, exist_ok=True)
path_list = get_path_list(args, nni_config, trial_content, temp_nni_path) path_list = get_path_list(args, nni_config, trial_content, temp_nni_path)
start_tensorboard_process(args, nni_config, path_list, temp_nni_path) start_tensorboard_process(args, nni_config, path_list, temp_nni_path)
\ No newline at end of file
...@@ -28,6 +28,7 @@ logger = logging.getLogger('trial_keeper') ...@@ -28,6 +28,7 @@ logger = logging.getLogger('trial_keeper')
regular = re.compile('v?(?P<version>[0-9](\.[0-9]){0,1}).*') regular = re.compile('v?(?P<version>[0-9](\.[0-9]){0,1}).*')
_hdfs_client = None _hdfs_client = None
_trial_process = None
def get_hdfs_client(args): def get_hdfs_client(args):
...@@ -62,6 +63,7 @@ def get_hdfs_client(args): ...@@ -62,6 +63,7 @@ def get_hdfs_client(args):
def main_loop(args): def main_loop(args):
'''main loop logic for trial keeper''' '''main loop logic for trial keeper'''
global _trial_process
if not os.path.exists(LOG_DIR): if not os.path.exists(LOG_DIR):
os.makedirs(LOG_DIR) os.makedirs(LOG_DIR)
...@@ -90,13 +92,13 @@ def main_loop(args): ...@@ -90,13 +92,13 @@ def main_loop(args):
# Notice: We don't appoint env, which means subprocess wil inherit current environment and that is expected behavior # Notice: We don't appoint env, which means subprocess wil inherit current environment and that is expected behavior
log_pipe_stdout = trial_syslogger_stdout.get_pipelog_reader() log_pipe_stdout = trial_syslogger_stdout.get_pipelog_reader()
process = Popen(args.trial_command, shell=True, stdout=log_pipe_stdout, stderr=log_pipe_stdout) _trial_process = Popen(args.trial_command, shell=True, stdout=log_pipe_stdout, stderr=log_pipe_stdout, preexec_fn=os.setsid)
nni_log(LogType.Info, 'Trial keeper spawns a subprocess (pid {0}) to run command: {1}'.format(process.pid, nni_log(LogType.Info, 'Trial keeper spawns a subprocess (pid {0}) to run command: {1}'.format(_trial_process.pid,
shlex.split( shlex.split(
args.trial_command))) args.trial_command)))
while True: while True:
retCode = process.poll() retCode = _trial_process.poll()
# child worker process exits and all stdout data is read # child worker process exits and all stdout data is read
if retCode is not None and log_pipe_stdout.set_process_exit() and log_pipe_stdout.is_read_completed == True: if retCode is not None and log_pipe_stdout.set_process_exit() and log_pipe_stdout.is_read_completed == True:
# In Windows, the retCode -1 is 4294967295. It's larger than c_long, and raise OverflowError. # In Windows, the retCode -1 is 4294967295. It's larger than c_long, and raise OverflowError.
...@@ -213,6 +215,20 @@ def fetch_parameter_file(args): ...@@ -213,6 +215,20 @@ def fetch_parameter_file(args):
fetch_file_thread.start() fetch_file_thread.start()
def _set_adaptdl_signal_handler():
import signal
global _trial_process
def _handler(signum, frame):
nni_log(LogType.Info, "RECEIVED SIGNAL {}".format(signum))
nni_log(LogType.Debug, "TRIAL PROCESS ID {}".format(_trial_process.pid))
if _trial_process and (signum == signal.SIGTERM or signum == signal.SIGINT):
os.killpg(os.getpgid(_trial_process.pid), signal.SIGINT)
os.waitpid(_trial_process.pid, 0)
exit(1)
signal.signal(signal.SIGTERM, _handler)
signal.signal(signal.SIGINT, _handler)
if __name__ == '__main__': if __name__ == '__main__':
'''NNI Trial Keeper main function''' '''NNI Trial Keeper main function'''
PARSER = argparse.ArgumentParser() PARSER = argparse.ArgumentParser()
...@@ -237,6 +253,8 @@ if __name__ == '__main__': ...@@ -237,6 +253,8 @@ if __name__ == '__main__':
try: try:
if NNI_PLATFORM == 'paiYarn' and is_multi_phase(): if NNI_PLATFORM == 'paiYarn' and is_multi_phase():
fetch_parameter_file(args) fetch_parameter_file(args)
if NNI_PLATFORM == 'adl':
_set_adaptdl_signal_handler()
main_loop(args) main_loop(args)
except SystemExit as se: except SystemExit as se:
nni_log(LogType.Info, 'NNI trial keeper exit with code {}'.format(se.code)) nni_log(LogType.Info, 'NNI trial keeper exit with code {}'.format(se.code))
......
...@@ -97,6 +97,21 @@ def get_sequence_id(): ...@@ -97,6 +97,21 @@ def get_sequence_id():
_intermediate_seq = 0 _intermediate_seq = 0
def overwrite_intermediate_seq(value):
"""
Overwrite intermediate sequence value.
Parameters
----------
value:
int
"""
assert isinstance(value, int)
global _intermediate_seq
_intermediate_seq = value
def report_intermediate_result(metric): def report_intermediate_result(metric):
""" """
Reports intermediate result to NNI. Reports intermediate result to NNI.
......
...@@ -46,6 +46,7 @@ interface TrialJobInfo { ...@@ -46,6 +46,7 @@ interface TrialJobInfo {
trialJobId: string; trialJobId: string;
sequenceId?: number; sequenceId?: number;
status: TrialJobStatus; status: TrialJobStatus;
message?: string;
startTime?: number; startTime?: number;
endTime?: number; endTime?: number;
hyperParameters?: string[]; hyperParameters?: string[];
......
...@@ -105,6 +105,7 @@ abstract class Manager { ...@@ -105,6 +105,7 @@ abstract class Manager {
public abstract getTrialLog(trialJobId: string, logType: LogType): Promise<string>; public abstract getTrialLog(trialJobId: string, logType: LogType): Promise<string>;
public abstract getTrialJobStatistics(): Promise<TrialJobStatistics[]>; public abstract getTrialJobStatistics(): Promise<TrialJobStatistics[]>;
public abstract getTrialJobMessage(trialJobId: string): string | undefined;
public abstract getStatus(): NNIManagerStatus; public abstract getStatus(): NNIManagerStatus;
} }
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment