* Add intermediate results in export result (#2706)
* Add `command <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Tutorial/Nnictl.rst#nnictl-trial>`__ to list trial results with highest/lowest metrics (#2747)
* Improve the user experience of `nnicli <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/nnicli_ref.rst>`__ with `examples <https://github.com/microsoft/nni/blob/v1.8/examples/notebooks/retrieve_nni_info_with_python.ipynb>`__ (#2713)
* Add `command <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Tutorial/Nnictl.md#nnictl-trial>`__ to list trial results with highest/lowest metrics (#2747)
* Improve the user experience of `nnicli <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/nnicli_ref.md>`__ with `examples <https://github.com/microsoft/nni/blob/v1.8/examples/notebooks/retrieve_nni_info_with_python.ipynb>`__ (#2713)
Neural architecture search
^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Search space zoo: ENAS and DARTS <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/NAS/SearchSpaceZoo.rst>`__ (#2589)
* `Search space zoo: ENAS and DARTS <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/NAS/SearchSpaceZoo.md>`__ (#2589)
* API to query intermediate results in NAS benchmark (#2728)
Model compression
...
...
@@ -115,8 +193,8 @@ Model compression
* Support the List/Tuple Construct/Unpack operation for TorchModuleGraph (#2609)
* Model speedup improvement: Add support of DenseNet and InceptionV3 (#2719)
* Support the multiple successive tuple unpack operations (#2768)
* `Doc of comparing the performance of supported pruners <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/CommunitySharings/ModelCompressionComparison.rst>`__ (#2742)
* New pruners: `Sensitivity pruner <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Compressor/Pruner.md#sensitivity-pruner>`__ (#2684) and `AMC pruner <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Compressor/Pruner.rst>`__ (#2573) (#2786)
* `Doc of comparing the performance of supported pruners <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/CommunitySharings/ModelCompressionComparison.md>`__ (#2742)
* New pruners: `Sensitivity pruner <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Compressor/Pruner.md#sensitivity-pruner>`__ (#2684) and `AMC pruner <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Compressor/Pruner.md>`__ (#2573) (#2786)
* TensorFlow v2 support in model compression (#2755)
Backward incompatible changes
...
...
@@ -178,28 +256,28 @@ Training Service
* Support AML(Azure Machine Learning) platform as NNI training service.
* OpenPAI job can be reusable. When a trial is completed, the OpenPAI job won't stop, and wait next trial. `refer to reuse flag in OpenPAI config <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/PaiMode.rst#openpai-configurations>`__.
* `Support ignoring files and folders in code directory with .nniignore when uploading code directory to training service <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/Overview.rst#how-to-use-training-service>`__.
* OpenPAI job can be reusable. When a trial is completed, the OpenPAI job won't stop, and wait next trial. `refer to reuse flag in OpenPAI config <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/PaiMode.md#openpai-configurations>`__.
* `Support ignoring files and folders in code directory with .nniignore when uploading code directory to training service <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/Overview.md#how-to-use-training-service>`__.
Neural Architecture Search (NAS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*
`Provide NAS Open Benchmarks (NasBench101, NasBench201, NDS) with friendly APIs <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/Benchmarks.rst>`__.
`Provide NAS Open Benchmarks (NasBench101, NasBench201, NDS) with friendly APIs <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/Benchmarks.md>`__.
*
`Support Classic NAS (i.e., non-weight-sharing mode) on TensorFlow 2.X <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/ClassicNas.rst>`__.
`Support Classic NAS (i.e., non-weight-sharing mode) on TensorFlow 2.X <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/ClassicNas.md>`__.
Model Compression
^^^^^^^^^^^^^^^^^
* Improve Model Speedup: track more dependencies among layers and automatically resolve mask conflict, support the speedup of pruned resnet.
* Added new pruners, including three auto model pruning algorithms: `NetAdapt Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#netadapt-pruner>`__\ , `SimulatedAnnealing Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#simulatedannealing-pruner>`__\ , `AutoCompress Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#autocompress-pruner>`__\ , and `ADMM Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.rst#admm-pruner>`__.
* Added `model sensitivity analysis tool <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/CompressionUtils.rst>`__ to help users find the sensitivity of each layer to the pruning.
* Added new pruners, including three auto model pruning algorithms: `NetAdapt Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#netadapt-pruner>`__\ , `SimulatedAnnealing Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#simulatedannealing-pruner>`__\ , `AutoCompress Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#autocompress-pruner>`__\ , and `ADMM Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#admm-pruner>`__.
* Added `model sensitivity analysis tool <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/CompressionUtils.md>`__ to help users find the sensitivity of each layer to the pruning.
*
`Easy flops calculation for model compression and NAS <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/CompressionUtils.rst#model-flops-parameters-counter>`__.
`Easy flops calculation for model compression and NAS <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/CompressionUtils.md#model-flops-parameters-counter>`__.
*
Update lottery ticket pruner to export winning ticket.
...
...
@@ -208,13 +286,13 @@ Examples
^^^^^^^^
* Automatically optimize tensor operators on NNI with a new `customized tuner OpEvo <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrialExample/OpEvoExamples.rst>`__.
* Automatically optimize tensor operators on NNI with a new `customized tuner OpEvo <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrialExample/OpEvoExamples.md>`__.
Built-in tuners/assessors/advisors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Allow customized tuners/assessor/advisors to be installed as built-in algorithms <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Tutorial/InstallCustomizedAlgos.rst>`__.
* `Allow customized tuners/assessor/advisors to be installed as built-in algorithms <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Tutorial/InstallCustomizedAlgos.md>`__.
WebUI
^^^^^
...
...
@@ -235,10 +313,10 @@ Documentation
-------------
* Improve `documentation for model compression <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Overview.rst>`__
and `examples <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/BenchmarksExample.ipynb>`__ for NAS benchmarks.
* Improve `documentation for AzureML training service <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/AMLMode.rst>`__
* Improve `documentation for AzureML training service <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/AMLMode.md>`__
* Homepage migration to readthedoc.
Bug Fixes
...
...
@@ -287,7 +365,7 @@ NAS Updates
^^^^^^^^^^^
* NAS support for TensorFlow 2.0 (preview) `TF2.0 NAS examples <https://github.com/microsoft/nni/tree/v1.9/examples/nas/naive-tf>`__
* NAS support for TensorFlow 2.0 (preview) `TF2.0 NAS examples <https://github.com/microsoft/nni/tree/v1.6/examples/nas/naive-tf>`__
* Use OrderedDict for LayerChoice
* Prettify the format of export
* Replace layer choice with selected module after applied fixed architecture
...
...
@@ -303,7 +381,7 @@ Training Service Updates
* update pai yaml merge logic
* support windows as remote machine in remote mode `Remote Mode <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/TrainingService/RemoteMachineMode.rst#windows>`__
* support windows as remote machine in remote mode `Remote Mode <https://github.com/microsoft/nni/blob/v1.6/docs/en_US/TrainingService/RemoteMachineMode.md#windows>`__
Bug Fix
^^^^^^^
...
...
@@ -324,32 +402,32 @@ Hyper-Parameter Optimizing
^^^^^^^^^^^^^^^^^^^^^^^^^^
* New tuner: `Population Based Training (PBT) <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/Tuner/PBTTuner.rst>`__
* New tuner: `Population Based Training (PBT) <https://github.com/microsoft/nni/blob/v1.5/docs/en_US/Tuner/PBTTuner.md>`__
* Trials can now report infinity and NaN as result
Neural Architecture Search
^^^^^^^^^^^^^^^^^^^^^^^^^^
* New NAS algorithm: `TextNAS <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/NAS/TextNAS.rst>`__
* ENAS and DARTS now support `visualization <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/NAS/Visualization.rst>`__ through web UI.
* New NAS algorithm: `TextNAS <https://github.com/microsoft/nni/blob/v1.5/docs/en_US/NAS/TextNAS.md>`__
* ENAS and DARTS now support `visualization <https://github.com/microsoft/nni/blob/v1.5/docs/en_US/NAS/Visualization.md>`__ through web UI.
Model Compression
^^^^^^^^^^^^^^^^^
* New Pruner: `GradientRankFilterPruner <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/Compressor/Pruner.rst#gradientrankfilterpruner>`__
* New Pruner: `GradientRankFilterPruner <https://github.com/microsoft/nni/blob/v1.5/docs/en_US/Compression/Pruner.md#gradientrankfilterpruner>`__
* Compressors will validate configuration by default
* Refactor: Adding optimizer as an input argument of pruner, for easy support of DataParallel and more efficient iterative pruning. This is a broken change for the usage of iterative pruning algorithms.
* Model compression examples are refactored and improved
* Added documentation for `implementing compressing algorithm <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/Compressor/Framework.rst>`__
* Added documentation for `implementing compressing algorithm <https://github.com/microsoft/nni/blob/v1.5/docs/en_US/Compression/Framework.md>`__
* Experimental `DLTS <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/TrainingService/DLTSMode.rst>`__ support
* Experimental `DLTS <https://github.com/microsoft/nni/blob/v1.5/docs/en_US/TrainingService/DLTSMode.md>`__ support
Overall Documentation Improvement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
...
@@ -380,8 +458,8 @@ Neural Architecture Search
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Support `C-DARTS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/CDARTS.rst>`__ algorithm and add `the example <https://github.com/microsoft/nni/tree/v1.4/examples/nas/cdarts>`__ using it
* Support a preliminary version of `ProxylessNAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/Proxylessnas.rst>`__ and the corresponding `example <https://github.com/microsoft/nni/tree/v1.4/examples/nas/proxylessnas>`__
* Support `C-DARTS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/CDARTS.md>`__ algorithm and add `the example <https://github.com/microsoft/nni/tree/v1.4/examples/nas/cdarts>`__ using it
* Support a preliminary version of `ProxylessNAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/Proxylessnas.md>`__ and the corresponding `example <https://github.com/microsoft/nni/tree/v1.4/examples/nas/proxylessnas>`__
* Add unit tests for the NAS framework
Model Compression
...
...
@@ -389,7 +467,7 @@ Model Compression
* Support DataParallel for compressing models, and provide `an example <https://github.com/microsoft/nni/blob/v1.4/examples/model_compress/multi_gpu.py>`__ of using DataParallel
* Support `model speedup <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/ModelSpeedup.rst>`__ for compressed models, in Alpha version
* Support `model speedup <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/ModelSpeedup.md>`__ for compressed models, in Alpha version
Training Service
^^^^^^^^^^^^^^^^
...
...
@@ -409,7 +487,7 @@ Others
^^^^^^
* Support running `NNI experiment at foreground <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Tutorial/Nnictl#manage-an-experiment>`__\ , i.e., ``--foreground`` argument in ``nnictl create/resume/view``
* Support running `NNI experiment at foreground <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Tutorial/Nnictl.md#manage-an-experiment>`__\ , i.e., ``--foreground`` argument in ``nnictl create/resume/view``
* Support canceling the trials in UNKNOWN state
* Support large search space whose size could be up to 50mb (thanks external contributor @Sundrops)
...
...
@@ -418,10 +496,10 @@ Documentation
* Improve `the index structure <https://nni.readthedocs.io/en/latest/>`__ of NNI readthedocs
* Improve `documentation for NAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/NasGuide.rst>`__
* Improve documentation for `the new PAI mode <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrainingService/PaiMode.rst>`__
* Add QuickStart guidance for `NAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/QuickStart.md>`__ and `model compression <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/QuickStart.rst>`__
* Improve documentation for `the supported EfficientNet <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrialExample/EfficientNet.rst>`__
* Improve `documentation for NAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/NasGuide.md>`__
* Improve documentation for `the new PAI mode <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrainingService/PaiMode.md>`__
* Add QuickStart guidance for `NAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/QuickStart.md>`__ and `model compression <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/QuickStart.md>`__
* Improve documentation for `the supported EfficientNet <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrialExample/EfficientNet.md>`__
Bug Fixes
^^^^^^^^^
...
...
@@ -448,15 +526,18 @@ Model Compression Algorithms Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Knowledge Distillation <https://github.com/microsoft/nni/blob/v1.3/docs/en_US/TrialExample/KDExample.rst>`__ algorithm and the example using itExample
* `Knowledge Distillation <https://github.com/microsoft/nni/blob/v1.3/docs/en_US/TrialExample/KDExample.md>`__ algorithm and the example using itExample
* Tuners can now use dedicated GPU resource (see ``gpuIndices`` in `tutorial <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Tutorial/ExperimentConfig.rst>`__ for details)
* New tuner: `PPO Tuner <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Tuner/PPOTuner.md>`__
* Tuners can now use dedicated GPU resource (see ``gpuIndices`` in `tutorial <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Tutorial/ExperimentConfig.md>`__ for details)
* Web UI improvements
* Trials detail page can now list hyperparameters of each trial, as well as their start and end time (via "add column")
...
...
@@ -558,9 +639,9 @@ Major Features
* More examples
* `EfficientNet PyTorch example <https://github.com/ultmaster/EfficientNet-PyTorch>`__
* `Cifar10 NAS example <https://github.com/microsoft/nni/blob/v1.1/examples/trials/nas_cifar10/README.rst>`__
* `Cifar10 NAS example <https://github.com/microsoft/nni/blob/v1.1/examples/trials/nas_cifar10/README.md>`__
* `Model compression toolkit - Alpha release <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Compressor/Overview.rst>`__\ : We are glad to announce the alpha release for model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute
* `Model compression toolkit - Alpha release <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Compressor/Overview.md>`__\ : We are glad to announce the alpha release for model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute
Fixed Bugs
^^^^^^^^^^
...
...
@@ -584,7 +665,7 @@ Major Features
* Provide auto feature interface
* Tuner based on beam search
* `Add Pakdd example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/auto-feature-engineering>`__
* `Add Pakdd example <https://github.com/microsoft/nni/tree/v1.0/examples/trials/auto-feature-engineering>`__
* Add a parallel algorithm to improve the performance of TPE with large concurrency. -PR #1052
* Support multiphase for hyperband -PR #1257
...
...
@@ -686,7 +767,9 @@ Major Features
* ``nnictl experiment delete``\ : delete one or all experiments, it includes log, result, environment information and cache. It uses to delete useless experiment result, or save disk space.
* ``nnictl platform clean``\ : It uses to clean up disk on a target platform. The provided YAML file includes the information of target platform, and it follows the same schema as the NNI configuration file.
### Bug fix and other changes
Bug fix and other changes
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Tuner Installation Improvements: add `sklearn <https://scikit-learn.org/stable/>`__ to nni dependencies.
* (Bug Fix) Failed to connect to PAI http code - `Issue #1076 <https://github.com/microsoft/nni/issues/1076>`__
...
...
@@ -716,7 +799,7 @@ Major Features
* Support Kubeflow TFJob/PyTorchJob v1beta2
* `General NAS programming interface <https://github.com/microsoft/nni/blob/v0.8/docs/en_US/GeneralNasInterfaces.rst>`__
* `General NAS programming interface <https://github.com/microsoft/nni/blob/v0.8/docs/en_US/GeneralNasInterfaces.md>`__
* Provide NAS programming interface for users to easily express their neural architecture search space through NNI annotation
* Provide a new command ``nnictl trial codegen`` for debugging the NAS code
...
...
@@ -855,8 +938,8 @@ Improvements
^^^^^^^^^^^^
* Making `log directory <https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.rst>`__ configurable
* Support `different levels of logs <https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.rst>`__\ , making it easier for debugging
* Making `log directory <https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.md>`__ configurable
* Support `different levels of logs <https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.md>`__\ , making it easier for debugging
Documentation
^^^^^^^^^^^^^
...
...
@@ -885,7 +968,7 @@ New tuner and assessor supports
* Support `Metis tuner <Tuner/MetisTuner.rst>`__ as a new NNI tuner. Metis algorithm has been proofed to be well performed for **online** hyper-parameter tuning.
* Support `ENAS customized tuner <https://github.com/countif/enas_nni>`__\ , a tuner contributed by github community user, is an algorithm for neural network search, it could learn neural network architecture via reinforcement learning and serve a better performance than NAS.
* Support `Curve fitting assessor <Assessor/CurvefittingAssessor.rst>`__ for early stop policy using learning curve extrapolation.
* Advanced Support of `Weight Sharing <https://github.com/microsoft/nni/blob/v0.5/docs/AdvancedNAS.rst>`__\ : Enable weight sharing for NAS tuners, currently through NFS.
* Advanced Support of `Weight Sharing <https://github.com/microsoft/nni/blob/v0.5/docs/AdvancedNAS.md>`__\ : Enable weight sharing for NAS tuners, currently through NFS.
Training Service Enhancement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
...
@@ -950,8 +1033,8 @@ New examples
^^^^^^^^^^^^
* `FashionMnist <https://github.com/microsoft/nni/tree/v1.9/examples/trials/network_morphism>`__\ , work together with network morphism tuner
* `Distributed MNIST example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/mnist-distributed-pytorch>`__ written in PyTorch
* `FashionMnist <https://github.com/microsoft/nni/tree/v0.5/examples/trials/network_morphism>`__\ , work together with network morphism tuner
* `Distributed MNIST example <https://github.com/microsoft/nni/tree/v0.5/examples/trials/mnist-distributed-pytorch>`__ written in PyTorch
Release 0.4 - 12/6/2018
-----------------------
...
...
@@ -963,7 +1046,7 @@ Major Features
* `Kubeflow Training service <TrainingService/KubeflowMode.rst>`__
* Support tf-operator
* `Distributed trial example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/mnist-distributed/dist_mnist.py>`__ on Kubeflow
* `Distributed trial example <https://github.com/microsoft/nni/tree/v0.4/examples/trials/mnist-distributed/dist_mnist.py>`__ on Kubeflow
You are encouraged to `contribute more examples <Tutorial/Contributing.rst>`__ for other NNI users.
...
...
@@ -58,22 +44,16 @@ Supported Library
NNI also supports all libraries written in python.Here are some common libraries, including some algorithms based on GBDT: XGBoost, CatBoost and lightGBM.
Here is just a small list of libraries that supported by NNI. If you are interested in NNI, you can refer to the `tutorial <TrialExample/Trials.rst>`__ to complete your own hacks.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.md>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.rst>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
@@ -11,7 +11,7 @@ Prerequisite for Kubernetes Service
#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
...
...
@@ -76,7 +76,7 @@ Here is a template configuration specification to use AdaptDL as a training serv
storageSize: 1Gi
Those configs not mentioned below, are following the
`default specs defined in the NNI doc </Tutorial/ExperimentConfig.html#configuration-spec>`__.
`default specs defined </Tutorial/ExperimentConfig.rst#configuration-spec>`__ in the NNI doc.
* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
Prerequisite for on-premises Kubernetes Service
...
...
@@ -18,7 +17,7 @@ Prerequisite for on-premises Kubernetes Service
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
...
...
@@ -101,7 +100,7 @@ If you use Azure Kubernetes Service, you should set ``frameworkcontrollerConfig
Note: You should explicitly set ``trainingServicePlatform: frameworkcontroller`` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
...
...
@@ -115,7 +114,7 @@ Trial configuration in frameworkcontroller mode have the following configuration
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.rst#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.md>`__\ , `remote platfrom <RemoteMachineMode.md>`__\ , `PAI platfrom <PaiMode.md>`__\ , `kubeflow platform <KubeflowMode.md>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.rst>`__\ , `remote platfrom <RemoteMachineMode.rst>`__\ , `PAI platfrom <PaiMode.rst>`__\ , `kubeflow platform <KubeflowMode.rst>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
Run NNI on heterogeneous mode means that NNI will run trials jobs in multiple kinds of training platforms. For example, NNI could submit trial jobs to remote machine and AML simultaneously。
Run NNI on hybrid mode means that NNI will run trials jobs in multiple kinds of training platforms. For example, NNI could submit trial jobs to remote machine and AML simultaneously.
## Setup environment
NNI has supported [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md) and [AML](./AMLMode.md) for heterogeneous training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Setup environment
-----------------
NNI has supported `local <./LocalMode.rst>`__\ , `remote <./RemoteMachineMode.rst>`__\ , `PAI <./PaiMode.rst>`__\ , and `AML <./AMLMode.rst>`__ for hybrid training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Run an experiment
-----------------
## Run an experiment
Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's content is like:
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
...
...
@@ -18,7 +20,7 @@ Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's conte
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: heterogeneous
trainingServicePlatform: hybrid
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
...
...
@@ -31,7 +33,7 @@ Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's conte
command: python3 mnist.py
codeDir: .
gpuNum: 1
heterogeneousConfig:
hybridConfig:
trainingServicePlatforms:
- local
- remote
...
...
@@ -42,11 +44,11 @@ Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's conte
username: bob
passwd: bob123
Configurations for heterogeneous mode:
Configurations for hybrid mode:
heterogeneousConfig:
* trainingServicePlatforms. required key. This field specify the platforms used in heterogeneous mode, the values using yaml list format. NNI support setting `local`, `remote`, `aml`, `pai` in this field.
hybridConfig:
* trainingServicePlatforms. required key. This field specify the platforms used in hybrid mode, the values using yaml list format. NNI support setting ``local``, ``remote``, ``aml``, ``pai`` in this field.
Note:
If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set `remote` as one of the platform, should also set `machineList` and `remoteConfig` configuration.
.. Note:: If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set ``remote`` as one of the platform, should also set ``machineList`` and ``remoteConfig`` configuration. Local platform in hybrid mode does not support windows for now.
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
...
...
@@ -11,16 +9,16 @@ Prerequisite for on-premises Kubernetes Service
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or**Azure File Storage**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or**Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
...
...
@@ -231,6 +229,8 @@ Trial configuration in kubeflow mode have the following configuration keys:
* Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor.
Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
1.3 Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor. Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
We had made the changes and saved it to ``mnist.py``.
...
...
@@ -54,12 +56,12 @@ The hyper-parameters used in ``Step 1.2 - Get predefined parameters`` is defined
Refer to `define search space <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search space.
...
...
@@ -91,7 +93,7 @@ To run an experiment in NNI, you only needed:
..
A set of examples can be found in ~/nni/examples after your installation, run ``ls ~/nni/examples/trials`` to see all the trial examples.
You can download nni source code and a set of examples can be found in ``nni/examples``, run ``ls nni/examples/trials`` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.md>`__\ , `remote machines <./RemoteMachineMode.md>`__\ , and on clusters like `PAI <./PaiMode.md>`__\ , `Kubeflow <./KubeflowMode.md>`__\ , `AdaptDL <./AdaptDLMode.md>`__\ , `FrameworkController <./FrameworkControllerMode.md>`__\ , `DLTS <./DLTSMode.md>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.rst>`__\ , `remote machines <./RemoteMachineMode.rst>`__\ , and on clusters like `PAI <./PaiMode.rst>`__\ , `Kubeflow <./KubeflowMode.rst>`__\ , `AdaptDL <./AdaptDLMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ , `DLTS <./DLTSMode.rst>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "\ `how to implement training service <./HowToImplementTrainingService>`__\ " for details.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to `how to implement training service <./HowToImplementTrainingService.rst>`__ for details.
How to use Training Service?
----------------------------
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig>`__ provides more details on the specification of the experiment configuration file.
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig.rst>`__ provides more details on the specification of the experiment configuration file.
Next, users should prepare code directory, which is specified as ``codeDir`` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see :githublink:`this example <examples/trials/mnist-tfv1/.nniignore>` and the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
...
...
@@ -28,21 +28,21 @@ Built-in Training Services
* - TrainingService
- Brief Introduction
* - `**Local** <./LocalMode.rst>`__
* - `Local <./LocalMode.rst>`__
- NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.
* - `**Remote** <./RemoteMachineMode.rst>`__
* - `Remote <./RemoteMachineMode.rst>`__
- NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.
* - `**PAI** <./PaiMode.rst>`__
* - `PAI <./PaiMode.rst>`__
- NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__ (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.
* - `**Kubeflow** <./KubeflowMode.rst>`__
* - `Kubeflow <./KubeflowMode.rst>`__
- NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
* - `**AdaptDL** <./AdaptDLMode.rst>`__
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use AdaptDL mode, you should have a Kubernetes cluster.
- NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
* - `**DLTS** <./DLTSMode.rst>`__
* - `DLTS <./DLTSMode.rst>`__
- NNI supports running experiment using `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.
* - `**AML** <./AMLMode.rst>`__
* - `AML <./AMLMode.rst>`__
- NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
...
...
@@ -57,7 +57,7 @@ What does Training Service do?
</p>
According to the architecture shown in `Overview <../Overview>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
According to the architecture shown in `Overview <../Overview.rst>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
Step 1. **Validate config and prepare the training platform.** Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will start to prepare for the experiment by making the code directory (\ ``codeDir``\ ) accessible to training platform.
...
...
@@ -65,6 +65,6 @@ Step 1. **Validate config and prepare the training platform.** Training service
Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse mode), NNI copies another few files (including parameters, launch script and etc.) onto training platform. After that, NNI launches the trial through subprocess, SSH, RESTful API, and etc.
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have a differen path (even on differen machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have different paths (even on different machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
Step 3. **Collect metrics.** NNI then monitors the status of trial, updates the status (e.g., from ``WAITING`` to ``RUNNING``\ , ``RUNNING`` to ``SUCCEEDED``\ ) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.
@@ -110,7 +110,7 @@ Note: You should set ``trainingServicePlatform: pai`` in NNI config YAML file if
Trial configurations
^^^^^^^^^^^^^^^^^^^^
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trial`` configuration in pai mode has the following additional keys:
Compared with `LocalMode <LocalMode.rst>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trial`` configuration in pai mode has the following additional keys:
*
...
...
@@ -130,6 +130,8 @@ Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachine
We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
@@ -165,7 +167,7 @@ Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachine
#.
The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni\ *exp*\ ${this.experimentId}*trial*\ ${trialJobId}.
The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is ``nni_exp_{this.experimentId}_trial_{trialJobId}`` .
#.
If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
...
...
@@ -216,7 +218,7 @@ Once a trial job is completed, you can goto NNI WebUI's overview page (like http
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
.. image:: ../../img/nni_webui_joblist.png
:scale: 30%
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
...
...
@@ -245,5 +247,5 @@ Check policy:
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
The original ``pai`` mode is modificated to ``paiYarn`` mode, which is a distributed training platform based on Yarn.
Setup environment
-----------------
Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai, paiYarn
trainingServicePlatform: paiYarn
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-tfv1
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
# Configuration to access OpenpaiYarn Cluster
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
Note: You should set ``trainingServicePlatform: paiYarn`` in NNI config YAML file if you want to start experiment in paiYarn mode.
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , trial configuration in paiYarn mode have these additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* image
* Required key. In paiYarn mode, your trial program will be scheduled by OpenpaiYarn to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it.
* virtualCluster
* Optional key. Set the virtualCluster of OpenpaiYarn. If omitted, the job will run on default virtual cluster.
* shmMB
* Optional key. Set the shmMB configuration of OpenpaiYarn, it set the shared memory for one task in the task role.
* authFile
* Optional key, Set the auth file path for private registry while using paiYarn mode, `Refer <https://github.com/microsoft/paiYarn/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpaiYarn-job>`__\ , you can prepare the authFile and simply provide the local path of this file, NNI will upload this file to HDFS for you.
*
portList
*
Optional key. Set the portList configuration of OpenpaiYarn, it specifies a list of port used in container, `Refer <https://github.com/microsoft/paiYarn/blob/b2324866d0280a2d22958717ea6025740f71b9f0/docs/job_tutorial.rst#specification>`__.\ :raw-html:`<br>`
The config schema in NNI is shown below:
.. code-block:: bash
portList:
- label: test
beginAt: 8080
portNumber: 2
Let's say you want to launch a tensorboard in the mnist example using the port. So the first step is to write a wrapper script ``launch_paiYarn.sh`` of ``mnist.py``.
The config file of portList should be filled as following:
.. code-block:: yaml
trial:
command: bash launch_paiYarn.sh
portList:
- label: tensorboard
beginAt: 0
portNumber: 1
NNI support two kind of authorization method in paiYarn, including password and paiYarn token, `refer <https://github.com/microsoft/paiYarn/blob/b6bd2ab1c8890f91b7ac5859743274d2aa923c22/docs/rest-server/API.rst#2-authentication>`__. The authorization is configured in ``paiYarnConfig`` field.\ :raw-html:`<br>`
For password authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
For paiYarn token authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
token: your_paiYarn_token
host: 10.1.1.1
Once complete to fill NNI experiment config file and save (for example, save as exp_paiYarn.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_paiYarn.yml
to start the experiment in paiYarn mode. NNI will create OpenpaiYarn job for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see jobs created by NNI in the OpenpaiYarn cluster's web portal, like:
.. image:: ../../img/nni_pai_joblist.jpg
:target: ../../img/nni_pai_joblist.jpg
:alt:
Notice: In paiYarn mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
:target: ../../img/nni_webui_joblist.jpg
:alt:
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
.. image:: ../../img/nni_trial_hdfs_output.jpg
:target: ../../img/nni_trial_hdfs_output.jpg
:alt:
You can see there're three fils in output folder: stderr, stdout, and trial.log
data management
---------------
If your training data is not too large, it could be put into codeDir, and nni will upload the data to hdfs, or you could build your own docker image with the data. If you have large dataset, it's not appropriate to put the data in codeDir, and you could follow the `guidance <https://github.com/microsoft/paiYarn/blob/master/docs/user/storage.rst>`__ to mount the data folder in container.
If you also want to save trial's other output into HDFS, like model files, you can use environment variable ``NNI_OUTPUT_DIR`` in your trial code to save your own output files, and NNI SDK will copy all the files in ``NNI_OUTPUT_DIR`` from trial's container to HDFS, the target path is ``hdfs://host:port/{username}/nni/{experiments}/{experimentId}/trials/{trialId}/nnioutput``
version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
@@ -217,3 +217,11 @@ If you want multiple commands to be executed, you can use ``&&`` to connect thes
preCommand: command1 && command2 && command3
**Note**\ : Because **preCommand** will execute before other commands each time, it is strongly not recommended to set **preCommand** that will make changes to system, i.e. ``mkdir`` or ``touch``.
Remote machine supports running experiment in reuse mode. In this mode, NNI will reuse remote machine jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in the same job, for example, avoid loading checkpoint from previous trials.
CNN MNIST classifier for deep learning is similar to ``hello world`` for programming languages. Thus, we use MNIST as example to introduce different features of NNI. The examples are listed below:
* `MNIST with NNI API (TensorFlow v1.x) <#mnist-tfv1>`__
* `MNIST with NNI API (PyTorch) <#mnist-pytorch>`__
* `MNIST with NNI API (TensorFlow v2.x) <#mnist-tfv2>`__
* `MNIST with NNI API (TensorFlow v1.x) <#mnist-tfv1>`__
* `MNIST with NNI annotation <#mnist-annotation>`__
* `MNIST in keras <#mnist-keras>`__
* `MNIST -- tuning with batch tuner <#mnist-batch>`__
...
...
@@ -18,65 +19,70 @@ CNN MNIST classifier for deep learning is similar to ``hello world`` for program
* `distributed MNIST (tensorflow) using kubeflow <#mnist-kubeflow-tf>`__
* `distributed MNIST (pytorch) using kubeflow <#mnist-kubeflow-pytorch>`__
:raw-html:`<a name="mnist-tfv1"></a>`
**MNIST with NNI API (TensorFlow v1.x)**
:raw-html:`<a name="mnist-pytorch"></a>`
**MNIST with NNI API (PyTorch)**
This is a simple network which has two convolutional layers, two pooling layers and a fully connected layer. We tune hyperparameters, such as dropout rate, convolution size, hidden size, etc. It can be tuned with most NNI built-in tuners, such as TPE, SMAC, Random. We also provide an exmaple YAML file which enables assessor.
This is a simple network which has two convolutional layers, two pooling layers and a fully connected layer.
We tune hyperparameters, such as dropout rate, convolution size, hidden size, etc.
It can be tuned with most NNI built-in tuners, such as TPE, SMAC, Random.
We also provide an exmaple YAML file which enables assessor.
This example is similar to the example above, the only difference is that this example uses NNI annotation to specify search space and report results, while the example above uses NNI apis to receive configuration and report results.
Same network to the example above, but written in TensorFlow v1.x API.
This example is implemented in keras. It is also a network for MNIST dataset, with two convolution layers, one pooling layer, and two fully connected layers.
This example is similar to the example above, the only difference is that this example uses NNI annotation to specify search space and report results, while the example above uses NNI apis to receive configuration and report results.
This example is to show how to use batch tuner. Users simply list all the configurations they want to try in the search space file. NNI will try all of them.
This example is to show how to use hyperband to tune the model. There is one more key ``STEPS`` in the received configuration for trials to control how long it can run (e.g., number of iterations).
This example is to show how to run distributed training on kubeflow through NNI. Users can simply provide distributed training code and a configure file which specifies the kubeflow mode. For example, what is the command to run ps and what is the command to run worker, and how many resources they consume. This example is implemented in tensorflow, thus, uses kubeflow tensorflow operator.
@@ -33,7 +33,7 @@ We prepared a dockerfile for setting up experiment environments. Before starting
Run Experiments:
----------------
Three representative kinds of tensor operators, **matrix multiplication**\ ,**batched matrix multiplication** and **2D convolution**\ , are chosen from BERT and AlexNet, and tuned with NNI. The ``Trial`` code for all tensor operators is ``/root/compiler_auto_tune_stable.py``\ , and ``Search Space`` files and ``config`` files for each tuning algorithm locate in ``/root/experiments/``\ , which are categorized by tensor operators. Here ``/root`` refers to the root of the container.
Three representative kinds of tensor operators, **matrix multiplication**\ ,**batched matrix multiplication** and **2D convolution**\ , are chosen from BERT and AlexNet, and tuned with NNI. The ``Trial`` code for all tensor operators is ``/root/compiler_auto_tune_stable.py``\ , and ``Search Space`` files and ``config`` files for each tuning algorithm locate in ``/root/experiments/``\ , which are categorized by tensor operators. Here ``/root`` refers to the root of the container.
For tuning the operators of matrix multiplication, please run below commands from ``/root``\ :
...
...
@@ -111,7 +111,7 @@ For tuning the operators of 2D convolution, please run below commands from ``/ro
Please note that G-BFS and N-A2C are only designed for tuning tiling schemes of multiplication of matrices with only power of 2 rows and columns, so they are not compatible with other types of configuration spaces, thus not eligible to tune the operators of batched matrix multiplication and 2D convolution. Here, AutoTVM is implemented by its authors in the TVM project, so the tuning results are printed on the screen rather than reported to NNI manager. The port 8080 of the container is bind to the host on the same port, so one can access the NNI Web UI through ``host_ip_addr:8080`` and monitor tuning process as below screenshot.
The performance of RocksDB is highly contingent on its tuning. However, because of the complexity of its underlying technology and a large number of configurable parameters, a good configuration is sometimes hard to obtain. NNI can help to address this issue. NNI supports many kinds of tuning algorithms to search the best configuration of RocksDB, and support many kinds of environments like local machine, remote servers and cloud.
This example illustrates how to use NNI to search the best configuration of RocksDB for a ``fillrandom`` benchmark supported by a benchmark tool ``db_bench``\ , which is an official benchmark tool provided by RocksDB itself. Therefore, before running this example, please make sure NNI is installed and `\ ``db_bench`` <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is in your ``PATH``. Please refer to `here <../Tutorial/QuickStart.md>`__ for detailed information about installation and preparing of NNI environment, and `here <https://github.com/facebook/rocksdb/blob/master/INSTALL.rst>`__ for compiling RocksDB as well as ``db_bench``.
This example illustrates how to use NNI to search the best configuration of RocksDB for a ``fillrandom`` benchmark supported by a benchmark tool ``db_bench``\ , which is an official benchmark tool provided by RocksDB itself. Therefore, before running this example, please make sure NNI is installed and `db_bench <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is in your ``PATH``. Please refer to `here <../Tutorial/QuickStart.rst>`__ for detailed information about installation and preparing of NNI environment, and `here <https://github.com/facebook/rocksdb/blob/master/INSTALL.md>`__ for compiling RocksDB as well as ``db_bench``.
We also provide a simple script :githublink:`db_bench_installation.sh <examples/trials/systems/rocksdb-fillrandom/db_bench_installation.sh>` helping to compile and install ``db_bench`` as well as its dependencies on Ubuntu. Installing RocksDB on other systems can follow the same procedure.
We also provide a simple script :githublink:`db_bench_installation.sh <examples/trials/systems_auto_tuning/rocksdb-fillrandom/db_bench_installation.sh>` helping to compile and install ``db_bench`` as well as its dependencies on Ubuntu. Installing RocksDB on other systems can follow the same procedure.
Other tuners can be easily adopted in the same way. Please refer to `here <../Tuner/BuiltinTuner.rst>`__ for more information.
...
...
@@ -97,8 +97,8 @@ We ran these two examples on the same machine with following details:
The detailed experiment results are shown in the below figure. Horizontal axis is sequential order of trials. Vertical axis is the metric, write OPS in this example. Blue dots represent trials for tuning RocksDB with SMAC tuner, and orange dots stand for trials for tuning RocksDB with TPE tuner.
@@ -87,7 +87,7 @@ Modify ``nni/examples/trials/ga_squad/config.yml``\ , here is the default config
codeDir: ~/nni/examples/trials/ga_squad
gpuNum: 0
In the "trial" part, if you want to use GPU to perform the architecture search, change ``gpuNum`` from ``0`` to ``1``. You need to increase the ``maxTrialNum`` and ``maxExecDuration``\ , according to how long you want to wait for the search result.
In the **trial** part, if you want to use GPU to perform the architecture search, change ``gpuNum`` from ``0`` to ``1``. You need to increase the ``maxTrialNum`` and ``maxExecDuration``\ , according to how long you want to wait for the search result.
2.3 submit this job
^^^^^^^^^^^^^^^^^^^
...
...
@@ -120,7 +120,7 @@ Modify ``nni/examples/trials/ga_squad/config_pai.yml``\ , here is the default co
Refer to `SearchSpaceSpec.md <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search spaces. Tuner will generate configurations from this search space, that is, choosing a value for each hyperparameter from the range.
Refer to `SearchSpaceSpec <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search spaces. Tuner will generate configurations from this search space, that is, choosing a value for each hyperparameter from the range.
Step 2 - Update model code
^^^^^^^^^^^^^^^^^^^^^^^^^^
...
...
@@ -80,7 +80,7 @@ To enable NNI API mode, you need to set useAnnotation to *false* and provide the
You can refer to `here <../Tutorial/ExperimentConfig.rst>`__ for more information about how to set up experiment configurations.
Please refer to `here </sdk_reference.html>`__ for more APIs (e.g., ``nni.get_sequence_id()``\ ) provided by NNI.
Please refer to `here <../sdk_reference.rst>`__ for more APIs (e.g., ``nni.get_sequence_id()``\ ) provided by NNI.
:raw-html:`<a name="nni-annotation"></a>`
...
...
@@ -167,7 +167,7 @@ NNI supports a standalone mode for trial code to run without starting an NNI exp
nni.get_trial_id # return "STANDALONE"
nni.get_sequence_id # return 0
You can try standalone mode with the :githublink:`mnist example <examples/trials/mnist-tfv1>`. Simply run ``python3 mnist.py`` under the code directory. The trial code should successfully run with the default hyperparameter values.
You can try standalone mode with the :githublink:`mnist example <examples/trials/mnist-pytorch>`. Simply run ``python3 mnist.py`` under the code directory. The trial code should successfully run with the default hyperparameter values.
For more information on debugging, please refer to `How to Debug <../Tutorial/HowToDebug.rst>`__