Note: You should set ``trainingServicePlatform: aml`` in NNI config YAML file if you want to start experiment in aml mode.
Note: You should set ``platform: aml`` in NNI config YAML file if you want to start experiment in aml mode.
Compared with `LocalMode <LocalMode.rst>`__ trial configuration in aml mode have these additional keys:
Compared with `LocalMode <LocalMode.rst>`__ training service configuration in aml mode have these additional keys:
* image
* dockerImage
* required key. The docker image name used in job. NNI support image ``msranni/nni`` for running aml jobs.
...
...
@@ -103,7 +92,7 @@ amlConfig:
* required key, the compute cluster name you want to use in your AML workspace. `refer <https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target>`__ See Step 6.
* maxTrialNumPerGpu
* maxTrialNumberPerGpu
* optional key, default 1. Used to specify the max concurrency trial number on a GPU device.
@@ -8,7 +8,7 @@ In this tutorial, we will use the example in [nni/examples/trials/mnist-pytorch]
Before starts
You have an implementation for MNIST classifer using convolutional layers, the Python code is in ``mnist_before.py``.
You have an implementation for MNIST classifer using convolutional layers, the Python code is similar to ``mnist.py``.
..
...
...
@@ -37,15 +37,13 @@ to get hyper-parameters' values assigned by tuner. ``tuner_params`` is an object
1.3 Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor. Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
We had made the changes and saved it to ``mnist.py``.
**NOTE**\ :
.. code-block:: bash
accuracy - The `accuracy` could be any python object, but if you use NNI built-in tuner/assessor, `accuracy` should be a numerical variable (e.g. float, int).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
tuner - The tuner will generate next parameters/architecture based on the explore history (final result of all trials).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
..
...
...
@@ -71,16 +69,6 @@ Refer to `define search space <../Tutorial/SearchSpaceSpec.rst>`__ to learn more
..
3.1 enable NNI API mode
To enable NNI API mode, you need to set useAnnotation to *false* and provide the path of SearchSpace file (you just defined in step 1):
.. code-block:: bash
useAnnotation: false
searchSpacePath: /path/to/your/search_space.json
To run an experiment in NNI, you only needed:
...
...
@@ -96,11 +84,11 @@ To run an experiment in NNI, you only needed:
You can download nni source code and a set of examples can be found in ``nni/examples``, run ``ls nni/examples/trials`` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
Let's use a simple trial example, e.g. mnist, provided by NNI. After you cloned NNI source, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
This command will be filled in the YAML configure file below. Please refer to `here <../TrialExample/Trials.rst>`__ for how to write your own trial.
...
...
@@ -110,53 +98,42 @@ This command will be filled in the YAML configure file below. Please refer to `h
.. code-block:: bash
tuner:
builtinTunerName: TPE
name: TPE
classArgs:
optimize_mode: maximize
*builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found `here <../Tuner/BuiltinTuner.rst>`__\ ), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
*name* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found `here <../Tuner/BuiltinTuner.rst>`__\ ), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**\ : Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, ``cat ~/nni/examples/trials/mnist-annotation/config.yml`` to see it. Its content is basically shown below:
**Prepare configure file**\ : Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, ``cat ~/nni/examples/trials/mnist-pytorch/config.yml`` to see it. Its content is basically shown below:
Here *useAnnotation* is true because this trial example uses our python annotation (refer to `here <../Tutorial/AnnotationSpec.rst>`__ for details). For trial, we should provide *trialCommand* which is the command to run the trial, provide *trialCodeDir* where the trial code is. The command will be executed in this directory. We should also provide how many GPUs a trial requires.
With all these steps done, we can run the experiment with the following command:
You can refer to `here <../Tutorial/Nnictl.rst>`__ for more usage guide of *nnictl* command line tool.
...
...
@@ -169,29 +146,29 @@ The experiment has been running now. Other than *nnictl*\ , NNI also provides We
Using multiple local GPUs to speed up search
--------------------------------------------
The following steps assume that you have 4 NVIDIA GPUs installed at local and `tensorflow with GPU support <https://www.tensorflow.org/install/gpu>`__. The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
The following steps assume that you have 4 NVIDIA GPUs installed at local and PyTorch with CUDA support. The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
**Prepare configure file**\ : NNI provides a demo configuration file for the setting above, ``cat ~/nni/examples/trials/mnist-annotation/config_gpu.yml`` to see it. The trailConcurrency and gpuNum are different from the basic configure file:
**Prepare configure file**\ : NNI provides a demo configuration file for the setting above, ``cat ~/nni/examples/trials/mnist-pytorch/config_detailed.yml`` to see it. The trailConcurrency and trialGpuNumber are different from the basic configure file:
.. code-block:: bash
...
# how many trials could be concurrently running
trialGpuNumber: 1
trialConcurrency: 4
...
trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 1
trainingService:
platform: local
useActiveGpu: false # set to "true" if you are using graphical OS like Windows 10 and Ubuntu desktop
We can run the experiment with the following command:
You can use *nnictl* command line tool or WebUI to trace the training progress. *nvidia_smi* command line tool can also help you to monitor the GPU usage during training.
@@ -50,85 +50,84 @@ You could use the following configuration in your NNI's config file:
.. code-block:: yaml
nniManagerNFSMountPath: /local/mnt
localStorageMountPoint: /local/mnt
**Step 4. Get OpenPAI's storage config name and nniManagerMountPath**
**Step 4. Get OpenPAI's storage config name and localStorageMountPoint**
The ``Team share storage`` field is storage configuration used to specify storage value in OpenPAI. You can get ``paiStorageConfigName`` and ``containerNFSMountPath`` field in ``Team share storage``\ , for example:
The ``Team share storage`` field is storage configuration used to specify storage value in OpenPAI. You can get ``storageConfigName`` and ``containerStorageMountPoint`` field in ``Team share storage``\ , for example:
.. code-block:: yaml
paiStorageConfigName: confignfs-data
containerNFSMountPath: /mnt/confignfs-data
storageConfigName: confignfs-data
containerStorageMountPoint: /mnt/confignfs-data
Run an experiment
-----------------
Use ``examples/trials/mnist-annotation`` as an example. The NNI config YAML file's content is like:
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai
trainingServicePlatform: pai
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: true
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 1
maxTrialNumber: 10
tuner:
builtinTunerName: TPE
name: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
virtualCluster: default
nniManagerNFSMountPath: /local/mnt
containerNFSMountPath: /mnt/confignfs-data
paiStorageConfigName: confignfs-data
# Configuration to access OpenPAI Cluster
paiConfig:
userName: your_pai_nni_user
token: your_pai_token
host: 10.1.1.1
# optional, experimental feature.
reuse: true
Note: You should set ``trainingServicePlatform: pai`` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like ``10.10.5.1``\ , the default http protocol in NNI is ``http``\ , if your PAI's cluster enabled https, please use the uri in ``https://10.10.5.1`` format.
Trial configurations
^^^^^^^^^^^^^^^^^^^^
Compared with `LocalMode <LocalMode.rst>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trial`` configuration in pai mode has the following additional keys:
trainingService:
platform: openpai
host: http://123.123.123.123
username: ${your user name}
token: ${your token}
dockerImage: msranni/nni
trialCpuNumber: 1
trialMemorySize: 8GB
storageConfigName: ${your storage config name}
localStorageMountPoint: ${NFS mount point on local machine}
containerStorageMountPoint: ${NFS mount point inside Docker container}
Note: You should set ``platform: pai`` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like ``10.10.5.1``\ , the default protocol in NNI is HTTPS, if your PAI's cluster disabled https, please use the uri in ``http://10.10.5.1`` format.
OpenPai configurations
^^^^^^^^^^^^^^^^^^^^^^
Compared with `LocalMode <LocalMode.rst>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trainingService`` configuration in pai mode has the following additional keys:
*
username
Required key. User name of OpenPAI platform.
*
cpuNum
token
Optional key. Should be positive number based on your trial program's CPU requirement. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
Required key. Authentication key of OpenPAI platform.
*
memoryMB
host
Optional key. Should be positive number based on your trial program's memory requirement. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
Required key. The host of OpenPAI platform. It's OpenPAI's job submission page uri, like ``10.10.5.1``\ , the default protocol in NNI is HTTPS, if your OpenPAI cluster disabled https, please use the uri in ``http://10.10.5.1`` format.
*
image
trialCpuNumber
Optional key. In pai mode, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
Optional key. Should be positive number based on your trial program's CPU requirement. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
trialMemorySize
Optional key. Should be in format like ``2gb`` based on your trial program's memory requirement. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
*
dockerImage
Optional key. In OpenPai mode, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
@@ -138,30 +137,31 @@ Compared with `LocalMode <LocalMode.rst>`__ and `RemoteMachineMode <RemoteMachin
Optional key. Set the virtualCluster of OpenPAI. If omitted, the job will run on default virtual cluster.
*
nniManagerNFSMountPath
localStorageMountPoint
Required key. Set the mount path in your nniManager machine.
Required key. Set the mount path in the machine you run nnictl.
*
containerNFSMountPath
containerStorageMountPoint
Required key. Set the mount path in your container used in OpenPAI.
*
paiStorageConfigName:
storageConfigName:
Optional key. Set the storage name used in OpenPAI. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
Optional key. Set the storage name used in OpenPAI. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
*
command
openpaiConfigFile
Optional key. Set the commands used in OpenPAI container.
Optional key. Set the file path of OpenPAI job configuration, the file is in yaml format.
If users set ``openpaiConfigFile`` in NNI's configuration file, no need to specify the fields ``storageConfigName``, ``virtualCluster``, ``dockerImage``, ``trialCpuNumber``, ``trialGpuNumber``, ``trialMemorySize`` in configuration. These fields will use the values from the config file specified by ``openpaiConfigFile``.
*
paiConfigPath
Optional key. Set the file path of OpenPAI job configuration, the file is in yaml format.
openpaiConfig
If users set ``paiConfigPath`` in NNI's configuration file, no need to specify the fields ``command``\ , ``paiStorageConfigName``\ , ``virtualCluster``\ , ``image``\ , ``memoryMB``\ , ``cpuNum``\ , ``gpuNum`` in ``trial`` configuration. These fields will use the values from the config file specified by ``paiConfigPath``.
Optional key. Similar to ``openpaiConfigFile``, but instead of referencing an external file, using this field you embed the content into NNI's config YAML.
Note:
...
...
@@ -172,32 +172,6 @@ Compared with `LocalMode <LocalMode.rst>`__ and `RemoteMachineMode <RemoteMachin
#.
If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
OpenPAI configurations
^^^^^^^^^^^^^^^^^^^^^^
``paiConfig`` includes OpenPAI specific configurations,
*
userName
Required key. User name of OpenPAI platform.
*
token
Required key. Authentication key of OpenPAI platform.
*
host
Required key. The host of OpenPAI platform. It's OpenPAI's job submission page uri, like ``10.10.5.1``\ , the default http protocol in NNI is ``http``\ , if your OpenPAI cluster enabled https, please use the uri in ``https://10.10.5.1`` format.
*
reuse (experimental feature)
Optional key, default is false. If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
#machineList can be empty if the platform is local
trainingService:
platform: remote
machineList:
- ip: 10.1.1.1
username: bob
passwd: bob123
#port can be skip if using default ssh port 22
#port: 22
- ip: 10.1.1.2
username: bob
passwd: bob123
- ip: 10.1.1.3
username: bob
passwd: bob123
Files in ``codeDir`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
- host: 192.0.2.1
user: alice
ssh_key_file: ~/.ssh/id_rsa
- host: 192.0.2.2
port: 10022
user: bob
password: bob123
pythonPath: /usr/bin
Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine.
Use ``examples/trials/mnist-tfv2`` as the example. Below is content of ``examples/trials/mnist-tfv2/config_remote.yml``\ :
Remote machine supports running experiment in reuse mode. In this mode, NNI will reuse remote machine jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in the same job, for example, avoid loading checkpoint from previous trials.
@@ -59,7 +59,7 @@ Benchmark code should receive a configuration from NNI manager, and report the c
Config file
^^^^^^^^^^^
One could start a NNI experiment with a config file. A config file for NNI is a ``yaml`` file usually including experiment settings (\ ``trialConcurrency``\ , ``maxExecDuration``\ , ``maxTrialNum``\ , ``trial gpuNum``\ , etc.), platform settings (\ ``trainingServicePlatform``\ , etc.), path settings (\ ``searchSpacePath``\ , ``trial codeDir``\ , etc.) and tuner settings (\ ``tuner``\ , ``tuner optimize_mode``\ , etc.). Please refer to `here <../Tutorial/QuickStart.rst>`__ for more information.
One could start a NNI experiment with a config file. A config file for NNI is a ``yaml`` file usually including experiment settings (\ ``trialConcurrency``\ , ``trialGpuNumber``\ , etc.), platform settings (\ ``trainingService``\ ), path settings (\ ``searchSpaceFile``\ , ``trialCodeDirectory``\ , etc.) and tuner settings (\ ``tuner``\ , ``tuner optimize_mode``\ , etc.). Please refer to `here <../Tutorial/QuickStart.rst>`__ for more information.
Here is an example of tuning RocksDB with SMAC algorithm:
In the **trial** part, if you want to use GPU to perform the architecture search, change ``gpuNum`` from ``0`` to ``1``. You need to increase the ``maxTrialNum`` and ``maxExecDuration``\ , according to how long you want to wait for the search result.
trainingService:
platform: local
In the **trial** part, if you want to use GPU to perform the architecture search, change ``trialGpuNum`` from ``0`` to ``1``. You need to increase the ``maxTrialNumber`` and ``maxExperimentDuration``\ , according to how long you want to wait for the search result.
2.3 submit this job
^^^^^^^^^^^^^^^^^^^
...
...
@@ -96,73 +95,15 @@ In the **trial** part, if you want to use GPU to perform the architecture search
Due to the memory limitation of upload, we only upload the source code and complete the data download and training on OpenPAI. This experiment requires sufficient memory that ``memoryMB >= 32G``\ , and the training may last for several hours.
3.1 Update configuration
^^^^^^^^^^^^^^^^^^^^^^^^
Modify ``nni/examples/trials/ga_squad/config_pai.yml``\ , here is the default configuration:
Please change the default value to your personal account and machine information. Including ``nniManagerIp``\ , ``userName``\ , ``passWord`` and ``host``.
In the "trial" part, if you want to use GPU to perform the architecture search, change ``gpuNum`` from ``0`` to ``1``. You need to increase the ``maxTrialNum`` and ``maxExecDuration``\ , according to how long you want to wait for the search result.
``trialConcurrency`` is the number of trials running concurrently, which is the number of GPUs you want to use, if you are setting ``gpuNum`` to 1.
The evolution-algorithm based architecture for question answering has two different parts just like any other examples: the trial and the tuner.
4.2 The trial
3.2 The trial
^^^^^^^^^^^^^
The trial has a lot of different files, functions and classes. Here we will only give most of those files a brief introduction:
...
...
@@ -224,7 +165,7 @@ performs topological sorting on the internal graph representation, and the code
performs actually conversion that maps each layer to a part in Tensorflow computation graph.
4.3 The tuner
3.3 The tuner
^^^^^^^^^^^^^
The tuner is much more simple than the trial. They actually share the same ``graph.py``. Besides, the tuner has a ``customer_tuner.py``\ , the most important class in which is ``CustomerTuner``\ :
...
...
@@ -272,7 +213,7 @@ As we can see, the overloaded method ``generate_parameters`` implements a pretty
controls the mutation process. It will always take two random individuals in the population, only keeping and mutating the one with better result.
4.4 Model configuration format
3.4 Model configuration format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is an example of the model configuration, which is passed from the tuner to the trial in the architecture search procedure.