Unverified Commit d6530f07 authored by xuehui's avatar xuehui Committed by GitHub
Browse files

Merge pull request #611 from Microsoft/v0.5

Merge v0.5 back to master
parents efa479b0 1d6db235
......@@ -46,7 +46,7 @@ We encourage researchers and students leverage these projects to accelerate the
* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) in our current stage.
* Run the following commands in an environment that has `python >= 3.5`, `git` and `wget`.
```bash
git clone -b v0.4.1 https://github.com/Microsoft/nni.git
git clone -b v0.5 https://github.com/Microsoft/nni.git
cd nni
source install.sh
```
......@@ -58,7 +58,7 @@ For the system requirements of NNI, please refer to [Install NNI](docs/Installat
The following example is an experiment built on TensorFlow. Make sure you have **TensorFlow installed** before running it.
* Download the examples via clone the source code.
```bash
git clone -b v0.4.1 https://github.com/Microsoft/nni.git
git clone -b v0.5 https://github.com/Microsoft/nni.git
```
* Run the mnist example.
```bash
......
......@@ -46,33 +46,43 @@ RUN DEBIAN_FRONTEND=noninteractive && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
#
# update pip
#
RUN python3 -m pip install --upgrade pip
# numpy 1.14.3 scipy 1.1.0
RUN python3 -m pip --no-cache-dir install \
numpy==1.14.3 scipy==1.1.0
#
#Tensorflow 1.10.0
# Tensorflow 1.10.0
#
RUN python3 -m pip --no-cache-dir install tensorflow-gpu==1.10.0
#
#Keras 2.1.6
# Keras 2.1.6
#
RUN python3 -m pip --no-cache-dir install Keras==2.1.6
#
#PyTorch
# PyTorch
#
RUN python3 -m pip --no-cache-dir install torch==0.4.1
RUN python3 -m pip install torchvision==0.2.1
#
#sklearn 0.20.0
# sklearn 0.20.0
#
RUN python3 -m pip --no-cache-dir install scikit-learn==0.20.0
#
#Install NNI
# pandas==0.23.4 lightgbm==2.2.2
#
RUN python3 -m pip --no-cache-dir install pandas==0.23.4 lightgbm==2.2.2
#
# Install NNI
#
RUN python3 -m pip --no-cache-dir install nni
......
......@@ -6,11 +6,13 @@ This is the Dockerfile of nni project. It includes serveral popular deep learnin
```
CUDA 9.0, CuDNN 7.0
numpy 1.14.3,scipy 1.1.0
TensorFlow 1.5.0
TensorFlow 1.10.0
Keras 2.1.6
PyTorch 0.4.1
scikit-learn 0.20.0
NNI v0.3
pandas 0.23.4
lightgbm 2.2.2
NNI v0.5
```
You can take this Dockerfile as a reference for your own customized Dockerfile.
......
......@@ -32,29 +32,31 @@ tf.init_from_checkpoint(params['restore_path'])
where `'save_path'` and `'restore_path'` in hyper-parameter can be managed by the tuner.
### NFS Setup
In NFS, files are physically stored on a server machine, and trials on the client machine can read/write those files in the same way that they access local files.
NFS follows the Client-Server Architecture, with an NFS server providing physical storage, trials on the remote machine with an NFS client can read/write those files in the same way that they access local files.
#### Install NFS on server machine
First, install NFS server:
#### NFS Server
An NFS server can be any machine as long as it can provide enough physical storage, and network connection with **remote machine** for NNI trials. Usually you can choose one of the remote machine as NFS Server.
On Ubuntu, install NFS server through `apt-get`:
```bash
sudo apt-get install nfs-kernel-server
```
Suppose `/tmp/nni/shared` is used as the physical storage, then run:
```bash
sudo mkdir -p /tmp/nni/shared
mkdir -p /tmp/nni/shared
sudo echo "/tmp/nni/shared *(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
sudo service nfs-kernel-server restart
```
You can check if the above directory is successfully exported by NFS using `sudo showmount -e localhost`
#### Install NFS on client machine
First, install NFS client:
#### NFS Client
For a trial on remote machine able to access shared files with NFS, an NFS client needs to be installed. For example, on Ubuntu:
```bash
sudo apt-get install nfs-common
```
Then create & mount the mounted directory of shared files:
```bash
sudo mkdir -p /mnt/nfs/nni/
mkdir -p /mnt/nfs/nni/
sudo mount -t nfs 10.10.10.10:/tmp/nni/shared /mnt/nfs/nni
```
where `10.10.10.10` should be replaced by the real IP of NFS server machine in practice.
......
......@@ -9,12 +9,12 @@ When met errors like below, try to clean up **tmp** folder first.
### Cannot get trials' metrics in OpenPAI mode
In OpenPAI training mode, we start a rest server which listens on 51189 port in nniManager to receive metrcis reported from trials running in OpenPAI cluster. If you didn't see any metrics from WebUI in OpenPAI mode, check your machine where nniManager runs on to make sure 51189 port is turned on in the firewall rule.
### Segmentation Fault (core dumped) when installing from source code
### Segmentation Fault (core dumped) when installing
> make: *** [install-XXX] Segmentation fault (core dumped)
There are two options:
Please try the following solutions in turn:
* Update or reinstall you current python's pip like `python3 -m pip install -U pip`
* Install nni with --no-cache-dir flag like `python3 -m pip install nni --no-cache-dir`
* Install nni with `--no-cache-dir` flag like `python3 -m pip install nni --no-cache-dir`
### Job management error: getIPV4Address() failed because os.networkInterfaces().eth0 is undefined.
Your machine don't have eth0 device, please set nniManagerIp in your config file manually. [refer](https://github.com/Microsoft/nni/blob/master/docs/ExperimentConfig.md)
......
......@@ -24,7 +24,7 @@
* __Install NNI through source code__
```bash
git clone -b v0.4.1 https://github.com/Microsoft/nni.git
git clone -b v0.5 https://github.com/Microsoft/nni.git
cd nni
source install.sh
```
......
......@@ -221,7 +221,8 @@ Metis belongs to the class of sequential model-based optimization (SMBO), and it
* It finds the global optimal point in the Gaussian Process space. This point represents the optimal configuration.
* It identifies the next hyper-parameter candidate. This is achieved by inferring the potential information gain of exploration, exploitation, and re-sampling.
Note that the only acceptable types of search space are `choice`, `quniform`, `uniform` and `randint`.
Note that the only acceptable types of search space are `choice`, `quniform`, `uniform` and `randint`. We only support
numerical `choice` now. More features will support later.
More details can be found in our paper: https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/
......
......@@ -23,7 +23,7 @@ Currently we only support installation on Linux & Mac.
* __Install NNI through source code__
```bash
git clone -b v0.4.1 https://github.com/Microsoft/nni.git
git clone -b v0.5 https://github.com/Microsoft/nni.git
cd nni
source install.sh
```
......
# Release 0.5.0 - 01/14/2019
## Major Features
### New tuner and assessor supports
* Support [Metis tuner](./HowToChooseTuner.md#MetisTuner) as a new NNI tuner. Metis algorithm has been proofed to be well performed for **online** hyper-parameter tuning.
* Support [ENAS customized tuner](https://github.com/countif/enas_nni), a tuner contributed by github community user, is an algorithm for neural network search, it could learn neural network architecture via reinforcement learning and serve a better performance than NAS.
* Support [Curve fitting assessor](./HowToChooseTuner.md#Curvefitting) for early stop policy using learning curve extrapolation.
* Advanced Support of [Weight Sharing](./AdvancedNAS.md): Enable weight sharing for NAS tuners, currently through NFS.
### Training Service Ehancement
* [FrameworkController Training service](./FrameworkControllerMode.md): Support run experiments using frameworkcontroller on kubernetes
* FrameworkController is a Controller on kubernetes that is general enough to run (distributed) jobs with various machine learning frameworks, such as tensorflow, pytorch, MXNet.
* NNI provides unified and simple specification for job definition.
* MNIST example for how to use FrameworkController.
### User Experience improvements
* A better trial logging support for NNI experiments in PAI, Kubeflow and FrameworkController mode:
* An improved logging architecture to send stdout/stderr of trials to NNI manager via Http post. NNI manager will store trial's stdout/stderr messages in local log file.
* Show the link for trial log file on WebUI.
* Support to show final result's all key-value pairs.
# Release 0.4.1 - 12/14/2018
## Major Features
### New tuner supports
* Support [network morphism](./HowToChooseTuner.md#NetworkMorphism) as a new tuner
### Training Service improvements
* Migrate [Kubeflow training service](https://github.com/Microsoft/nni/blob/master/docs/KubeflowMode.md)'s dependency from kubectl CLI to [Kubernetes API](https://kubernetes.io/docs/concepts/overview/kubernetes-api/) client
* [Pytorch-operator](https://github.com/kubeflow/pytorch-operator) support for Kubeflow training service
* Improvement on local code files uploading to OpenPAI HDFS
* Fixed OpenPAI integration WebUI bug: WebUI doesn't show latest trial job status, which is caused by OpenPAI token expiration
### NNICTL improvements
* Show version information both in nnictl and WebUI. You can run **nnictl -v** to show your current installed NNI version
### WebUI improvements
* Enable modify concurrency number during experiment
* Add feedback link to NNI github 'create issue' page
* Enable customize top 10 trials regarding to metric numbers (largest or smallest)
* Enable download logs for dispatcher & nnimanager
* Enable automatic scaling of axes for metric number
* Update annotation to support displaying real choice in searchspace
## New examples
* [FashionMnist](https://github.com/Microsoft/nni/tree/master/examples/trials/network_morphism), work together with network morphism tuner
* [Distributed MNIST example](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed-pytorch) written in PyTorch
# Release 0.4 - 12/6/2018
## Major Features
......
......@@ -7,10 +7,10 @@ Click the tab "Overview".
* See the experiment trial profile and search space message.
* Support to download the experiment result.
![](./img/over1.png)
![](./img/webui-img/over1.png)
* See good performance trials.
![](./img/over2.png)
![](./img/webui-img/over2.png)
## View job default metric
......@@ -38,9 +38,14 @@ Click the tab "Trial Duration" to see the bar graph.
Click the tab "Trials Detail" to see the status of the all trials. Specifically:
* Trial detail: trial's id, trial's duration, start time, end time, status, accuracy and search space file.
* If you run a pai experiment, you can also see the hdfsLogPath.
![](./img/table_openrow.png)
![](./img/webui-img/detail-local.png)
* If you run a pai or kubeflow experiment, you can also see the hdfsLog.
![](./img/webui-img/detail-pai.png)
![](./img/webui-img/trialog.png)
* Kill: you can kill a job that status is running.
* Support to search for a specific trial.
......
authorName: default
experimentName: example_mnist-smartparam
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: local
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
authorName: default
experimentName: example_dist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 1
command: python3 mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
apiVersion: v1alpha2
storage: nfs
nfs:
server: 10.10.10.10
path: /var/nfs/general
\ No newline at end of file
authorName: default
experimentName: example_mnist-smartparam
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: pai
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
cpuNum: 1
memoryMB: 8196
#The docker image to run nni job on pai
image: msranni/nni:latest
#The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
dataDir: hdfs://10.10.10.10:9000/username/nni
#The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
outputDir: hdfs://10.10.10.10:9000/username/nni
paiConfig:
#The username to login pai
userName: username
#The password to login pai
passWord: password
#The host of restful server of pai
host: 10.10.10.10
\ No newline at end of file
"""A deep MNIST classifier using convolutional layers."""
import logging
import math
import tempfile
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import nni
FLAGS = None
logger = logging.getLogger('mnist_AutoML')
class MnistNetwork(object):
'''
MnistNetwork is for initlizing and building basic network for mnist.
'''
def __init__(self,
channel_1_num,
channel_2_num,
pool_size,
x_dim=784,
y_dim=10):
self.channel_1_num = channel_1_num
self.channel_2_num = channel_2_num
self.conv_size = nni.choice(2, 3, 5, 7, name='conv-size')
self.hidden_size = nni.choice(124, 512, 1024) # example: without name
self.pool_size = pool_size
self.learning_rate = nni.uniform(0.0001, 0.1, name='learning_rate')
self.x_dim = x_dim
self.y_dim = y_dim
self.images = tf.placeholder(tf.float32, [None, self.x_dim], name='input_x')
self.labels = tf.placeholder(tf.float32, [None, self.y_dim], name='input_y')
self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')
self.train_step = None
self.accuracy = None
def build_network(self):
'''
Building network for mnist
'''
# Reshape to use within a convolutional neural net.
# Last dimension is for "features" - there is only one here, since images are
# grayscale -- it would be 3 for an RGB image, 4 for RGBA, etc.
with tf.name_scope('reshape'):
try:
input_dim = int(math.sqrt(self.x_dim))
except:
print(
'input dim cannot be sqrt and reshape. input dim: ' + str(self.x_dim))
logger.debug(
'input dim cannot be sqrt and reshape. input dim: %s', str(self.x_dim))
raise
x_image = tf.reshape(self.images, [-1, input_dim, input_dim, 1])
# First convolutional layer - maps one grayscale image to 32 feature maps.
with tf.name_scope('conv1'):
w_conv1 = weight_variable(
[self.conv_size, self.conv_size, 1, self.channel_1_num])
b_conv1 = bias_variable([self.channel_1_num])
h_conv1 = nni.function_choice(
lambda: tf.nn.relu(conv2d(x_image, w_conv1) + b_conv1),
lambda: tf.nn.sigmoid(conv2d(x_image, w_conv1) + b_conv1),
lambda: tf.nn.tanh(conv2d(x_image, w_conv1) + b_conv1)
) # example: without name
# Pooling layer - downsamples by 2X.
with tf.name_scope('pool1'):
h_pool1 = max_pool(h_conv1, self.pool_size)
h_pool1 = nni.function_choice(
lambda: max_pool(h_conv1, self.pool_size),
lambda: avg_pool(h_conv1, self.pool_size),
name='h_pool1')
# Second convolutional layer -- maps 32 feature maps to 64.
with tf.name_scope('conv2'):
w_conv2 = weight_variable([self.conv_size, self.conv_size,
self.channel_1_num, self.channel_2_num])
b_conv2 = bias_variable([self.channel_2_num])
h_conv2 = tf.nn.relu(conv2d(h_pool1, w_conv2) + b_conv2)
# Second pooling layer.
with tf.name_scope('pool2'): # example: another style
h_pool2 = max_pool(h_conv2, self.pool_size)
# Fully connected layer 1 -- after 2 round of downsampling, our 28x28 image
# is down to 7x7x64 feature maps -- maps this to 1024 features.
last_dim = int(input_dim / (self.pool_size * self.pool_size))
with tf.name_scope('fc1'):
w_fc1 = weight_variable(
[last_dim * last_dim * self.channel_2_num, self.hidden_size])
b_fc1 = bias_variable([self.hidden_size])
h_pool2_flat = tf.reshape(
h_pool2, [-1, last_dim * last_dim * self.channel_2_num])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, w_fc1) + b_fc1)
# Dropout - controls the complexity of the model, prevents co-adaptation of features.
with tf.name_scope('dropout'):
h_fc1_drop = tf.nn.dropout(h_fc1, self.keep_prob)
# Map the 1024 features to 10 classes, one for each digit
with tf.name_scope('fc2'):
w_fc2 = weight_variable([self.hidden_size, self.y_dim])
b_fc2 = bias_variable([self.y_dim])
y_conv = tf.matmul(h_fc1_drop, w_fc2) + b_fc2
with tf.name_scope('loss'):
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=self.labels, logits=y_conv))
with tf.name_scope('adam_optimizer'):
self.train_step = tf.train.AdamOptimizer(
self.learning_rate).minimize(cross_entropy)
with tf.name_scope('accuracy'):
correct_prediction = tf.equal(
tf.argmax(y_conv, 1), tf.argmax(self.labels, 1))
self.accuracy = tf.reduce_mean(
tf.cast(correct_prediction, tf.float32))
def conv2d(x_input, w_matrix):
"""conv2d returns a 2d convolution layer with full stride."""
return tf.nn.conv2d(x_input, w_matrix, strides=[1, 1, 1, 1], padding='SAME')
def max_pool(x_input, pool_size):
"""max_pool downsamples a feature map by 2X."""
return tf.nn.max_pool(x_input, ksize=[1, pool_size, pool_size, 1],
strides=[1, pool_size, pool_size, 1], padding='SAME')
def avg_pool(x_input, pool_size):
return tf.nn.avg_pool(x_input, ksize=[1, pool_size, pool_size, 1],
strides=[1, pool_size, pool_size, 1], padding='SAME')
def weight_variable(shape):
"""weight_variable generates a weight variable of a given shape."""
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
"""bias_variable generates a bias variable of a given shape."""
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def main(params):
'''
Main function, build mnist network, run and send result to NNI.
'''
# Import data
mnist = input_data.read_data_sets(params['data_dir'], one_hot=True)
print('Mnist download data down.')
logger.debug('Mnist download data down.')
# Create the model
# Build the graph for the deep net
mnist_network = MnistNetwork(channel_1_num=params['channel_1_num'],
channel_2_num=params['channel_2_num'],
pool_size=params['pool_size'])
mnist_network.build_network()
logger.debug('Mnist build network done.')
# Write log
graph_location = tempfile.mkdtemp()
logger.debug('Saving graph to: %s', graph_location)
train_writer = tf.summary.FileWriter(graph_location)
train_writer.add_graph(tf.get_default_graph())
test_acc = 0.0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch_size = nni.choice(1, 4, 8, 16, 32, name='batch_size')
for i in range(2000):
batch = mnist.train.next_batch(batch_size)
dropout_rate = nni.choice(0.5, 0.9, name='dropout_rate')
mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0],
mnist_network.labels: batch[1],
mnist_network.keep_prob: 1 - dropout_rate}
)
if i % 100 == 0:
test_acc = mnist_network.accuracy.eval(
feed_dict={mnist_network.images: mnist.test.images,
mnist_network.labels: mnist.test.labels,
mnist_network.keep_prob: 1.0})
nni.report_intermediate_result(test_acc)
logger.debug('test accuracy %g', test_acc)
logger.debug('Pipe send intermediate result done.')
test_acc = mnist_network.accuracy.eval(
feed_dict={mnist_network.images: mnist.test.images,
mnist_network.labels: mnist.test.labels,
mnist_network.keep_prob: 1.0})
nni.report_final_result(test_acc)
logger.debug('Final result is %g', test_acc)
logger.debug('Send final result done.')
def generate_defualt_params():
'''
Generate default parameters for mnist network.
'''
params = {
'data_dir': '/tmp/tensorflow/mnist/input_data',
'channel_1_num': 32,
'channel_2_num': 64,
'pool_size': 2}
return params
if __name__ == '__main__':
try:
nni.get_next_parameter()
main(generate_defualt_params())
except Exception as exception:
logger.exception(exception)
raise
......@@ -132,10 +132,15 @@ class MetisTuner(Tuner):
self.x_types[idx] = 'range_continuous'
elif key_type == 'choice':
self.x_bounds[idx] = key_range
for key_value in key_range:
if not isinstance(key_value, (int, float)):
raise RuntimeError("Metis Tuner only support numerical choice.")
self.x_types[idx] = 'discrete_int'
else:
logger.info("Metis Tuner doesn't support this kind of variable.")
raise RuntimeError("Metis Tuner doesn't support this kind of variable.")
logger.info("Metis Tuner doesn't support this kind of variable: " + str(key_type))
raise RuntimeError("Metis Tuner doesn't support this kind of variable: " + str(key_type))
else:
logger.info("The format of search space is not a dict.")
raise RuntimeError("The format of search space is not a dict.")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment