Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * add config_pai.yml * refactor nnictl create logic and add colorful print * fix nnictl stop logic * add annotation for config_pai.yml * add document for start experiment * fix config.yml * fix document

Refactor nnictl and add config_pai.yml (#144)
* fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * add config_pai.yml * refactor nnictl create logic and add colorful print * fix nnictl stop logic * add annotation for config_pai.yml * add document for start experiment * fix config.yml * fix document
d6bfe2a9 · SparkSnail · GitHub · 5c042627 · d6bfe2a9 · d6bfe2a9
Unverified Commit d6bfe2a9 authored Sep 30, 2018 by SparkSnail Committed by GitHub Sep 30, 2018
20 changed files
--- a/docs/StartExperiment.md
+++ b/docs/StartExperiment.md
+How to start an experiment
+===
+## 1.Introduce
+There are few steps to start an new experiment of nni, here are the  process.
+<img src="./img/experiment_process.jpg" width="50%" height="50%" />
+## 2.Details
+### 2.1 Check environment
+The first step to start an experiment is to check whether the environment is ready, nnictl will check if there is an old experiment running or the port of restfurl server is occupied.
+NNICTL will also validate the content of config yaml file, to ensure the experiment config is in correct format.
+### 2.2 Start restful server
+After check environment, nnictl will start an restful server process to manage nni experiment, the devault port is 51188.
+### 2.3 Check restful server
+Before next steps, nnictl will check whether restful server is successfully started, or the starting process will stop and show error message.
+### 2.4 Set experiment config
+NNICTL need to set experiment config before start an experiment, experiment config includes the config values in config yaml file.
+### 2.5 Check experiment cofig
+NNICTL will ensure the request to set config is successfully executed.
+### 2.6 Start Web UI
+NNICTL will start a Web UI process to show Web UI information,the default port of Web UI is 8080.
+### 2.7 Check Web UI
+If Web UI is not successfully started, nnictl will give a warning information, and will continue to start experiment.
+### 2.8 Start Experiment
+This is the most import step of starting an nni experiment, nnictl will call restful server process to setup an experiment.
+### 2.9 Check experiment
+After start experiment, nnictl will check whether the experiment is correctly created, and show more information of this experiment to users.
\ No newline at end of file
--- a/docs/img/experiment_process.jpg
+++ b/docs/img/experiment_process.jpg
--- a/examples/trials/auto-gbdt/config.yml
+++ b/examples/trials/auto-gbdt/config.yml
@@ -3,7 +3,7 @@ experimentName: example_auto-gbdt
 trialConcurrency: 1
 maxExecDuration: 10h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

--- a/examples/trials/auto-gbdt/config_pai.yml
+++ b/examples/trials/auto-gbdt/config_pai.yml
+authorName: default
+experimentName: example_auto-gbdt
+trialConcurrency: 1
+maxExecDuration: 10h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: minimize
+trial:
+  command: python3 main.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
\ No newline at end of file
--- a/examples/trials/ga_squad/config.yml
+++ b/examples/trials/ga_squad/config.yml
@@ -3,7 +3,7 @@ experimentName: example_ga_squad
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 #choice: true, false
 useAnnotation: false

--- a/examples/trials/ga_squad/config_pai.yml
+++ b/examples/trials/ga_squad/config_pai.yml
+authorName: default
+experimentName: example_ga_squad
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+#choice: true, false
+useAnnotation: false
+tuner:
+  codeDir: ../tuners/ga_customer_tuner
+  classFileName: customer_tuner.py
+  className: CustomerTuner
+  classArgs:
+    optimize_mode: maximize
+trial:
+  command: python3 trial.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
\ No newline at end of file
--- a/examples/trials/mnist-annotation/config.yml
+++ b/examples/trials/mnist-annotation/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 #choice: true, false
 useAnnotation: true

--- a/examples/trials/mnist-annotation/config_pai.yml
+++ b/examples/trials/mnist-annotation/config_pai.yml
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+#choice: true, false
+useAnnotation: true
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
\ No newline at end of file
--- a/examples/trials/mnist-batch-tune-keras/config.yml
+++ b/examples/trials/mnist-batch-tune-keras/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist-keras
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

--- a/examples/trials/mnist-batch-tune-keras/config_pai.yml
+++ b/examples/trials/mnist-batch-tune-keras/config_pai.yml
+authorName: default
+experimentName: example_mnist-keras
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution, BatchTuner
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: BatchTuner
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist-keras.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
\ No newline at end of file
--- a/examples/trials/mnist-keras/config.yml
+++ b/examples/trials/mnist-keras/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist-keras
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

--- a/examples/trials/mnist-keras/config_pai.yml
+++ b/examples/trials/mnist-keras/config_pai.yml
+authorName: default
+experimentName: example_mnist-keras
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist-keras.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
\ No newline at end of file
--- a/examples/trials/mnist-smartparam/config.yml
+++ b/examples/trials/mnist-smartparam/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist-smartparam
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 #choice: true, false
 useAnnotation: true

--- a/examples/trials/mnist-smartparam/config_pai.yml
+++ b/examples/trials/mnist-smartparam/config_pai.yml
+authorName: default
+experimentName: example_mnist-smartparam
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+#choice: true, false
+useAnnotation: true
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
\ No newline at end of file
--- a/examples/trials/mnist/config.yml
+++ b/examples/trials/mnist/config.yml
@@ -3,7 +3,7 @@ experimentName: example_mnist
 trialConcurrency: 1
 maxExecDuration: 1h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

--- a/examples/trials/mnist/config_pai.yml
+++ b/examples/trials/mnist/config_pai.yml
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
\ No newline at end of file
--- a/examples/trials/pytorch_cifar10/config.yml
+++ b/examples/trials/pytorch_cifar10/config.yml
@@ -3,7 +3,7 @@ experimentName: example_pytorch_cifar10
 trialConcurrency: 1
 maxExecDuration: 100h
 maxTrialNum: 10
-#choice: local, remote
+#choice: local, remote, pai
 trainingServicePlatform: local
 searchSpacePath: search_space.json
 #choice: true, false

--- a/examples/trials/pytorch_cifar10/config_pai.yml
+++ b/examples/trials/pytorch_cifar10/config_pai.yml
+authorName: default
+experimentName: example_pytorch_cifar10
+trialConcurrency: 1
+maxExecDuration: 100h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: pai
+searchSpacePath: search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution,
+  #SMAC (SMAC should be installed through nnictl)
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+trial:
+  command: python3 main.py
+  codeDir: .
+  gpuNum: 1
+  cpuNum: 1
+  memoryMB: 8196
+  #The docker image to run nni job on pai
+  image: openpai/pai.example.tensorflow
+  #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
+  hdfsDataDir: hdfs://10.10.10.10:9000/username/nni
+  #The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
+  hdfsOutputDir: hdfs://10.10.10.10:9000/username/nni
+paiConfig:
+  #The username to login pai
+  userName: username
+  #The password to login pai
+  passWord: password
+  #The host of restful server of pai
+  host: 10.10.10.10
--- a/tools/nnicmd/common_utils.py
+++ b/tools/nnicmd/common_utils.py
@@ -21,7 +21,7 @@
 import json
 import yaml
 import psutil
-from .constants import ERROR_INFO, NORMAL_INFO
+from .constants import ERROR_INFO, NORMAL_INFO, WARNING_INFO, COLOR_RED_FORMAT, COLOR_YELLOW_FORMAT
 def get_yml_content(file_path):
    '''Load yaml file content'''
@@ -43,12 +43,16 @@ def get_json_content(file_path):
 def print_error(content):
    '''Print error information to screen'''
-    print(ERROR_INFO % content)
+    print(COLOR_RED_FORMAT % (ERROR_INFO % content))
 def print_normal(content):
    '''Print error information to screen'''
    print(NORMAL_INFO % content)
+def print_warning(content):
+    '''Print warning information to screen'''
+    print(COLOR_YELLOW_FORMAT % (WARNING_INFO % content))
 def detect_process(pid):
    '''Detect if a process is alive'''
    try:

--- a/tools/nnicmd/constants.py
+++ b/tools/nnicmd/constants.py
@@ -34,22 +34,37 @@ STDOUT_FULL_PATH = os.path.join(LOG_DIR, 'stdout')
 STDERR_FULL_PATH = os.path.join(LOG_DIR, 'stderr')
-ERROR_INFO = 'Error: %s'
+ERROR_INFO = 'ERROR: %s'
-NORMAL_INFO = 'Info: %s'
+NORMAL_INFO = 'INFO: %s'
-WARNING_INFO = 'Waining: %s'
+WARNING_INFO = 'WARNING: %s'
-EXPERIMENT_SUCCESS_INFO = 'Start experiment success! The experiment id is %s, and the restful server post is %s.\n' \
+EXPERIMENT_SUCCESS_INFO = '\033[1;32;32mSuccessfully started experiment!\n\033[0m' \
-                          'You can use these commands to get more information about this experiment:\n' \
+                          '-----------------------------------------------------------------------\n' \
+                          'The experiment id is %s\n'\
+                          'The restful server post is %s\n' \
+                          'The Web UI urls are: %s\n' \
+                          '-----------------------------------------------------------------------\n\n' \
+                          'You can use these commands to get more information about the experiment\n' \
+                          '-----------------------------------------------------------------------\n' \
                          '         commands                       description\n' \
                          '1. nnictl experiment show        show the information of experiments\n' \
                          '2. nnictl trial ls               list all of trial jobs\n' \
-                          '3. nnictl stop                   stop a experiment\n' \
+                          '3. nnictl log stderr             show stderr log content\n' \
-                          '4. nnictl trial kill             kill a trial job by id\n' \
+                          '4. nnictl log stdout             show stdout log content\n' \
-                          '5. nnictl --help                 get help information about nnictl\n' \
+                          '5. nnictl stop                   stop a experiment\n' \
-                          '6. nnictl webui url              get the url of web ui'
+                          '6. nnictl trial kill             kill a trial job by id\n' \
+                          '7. nnictl webui url              get the url of web ui\n' \
+                          '8. nnictl --help                 get help information about nnictl\n' \
+                          '-----------------------------------------------------------------------\n' \
 PACKAGE_REQUIREMENTS = {
    'SMAC': 'smac_tuner'
 }
+COLOR_RED_FORMAT = '\033[1;31;31m%s\033[0m'
+COLOR_GREEN_FORMAT = '\033[1;32;32m%s\033[0m'
+COLOR_YELLOW_FORMAT = '\033[1;33;33m%s\033[0m'
\ No newline at end of file