Commit b40e3db7 authored by quzha's avatar quzha
Browse files

Merge branch 'master' of github.com:Microsoft/nni into dev-retiarii

parents efa4e31c 95f731e4
...@@ -51,6 +51,7 @@ build/Release ...@@ -51,6 +51,7 @@ build/Release
# Dependency directories # Dependency directories
node_modules/ node_modules/
jspm_packages/ jspm_packages/
**/package-lock.json
# TypeScript v1 declaration files # TypeScript v1 declaration files
typings/ typings/
...@@ -81,6 +82,8 @@ __pycache__ ...@@ -81,6 +82,8 @@ __pycache__
build build
*.egg-info *.egg-info
setup.pye setup.pye
**/__init__.pye
**/.ipynb_checkpoints
# Environments # Environments
.env .env
......
# Contributing to NNI
Welcome, and thank you for your interest in contributing to NNI!
There are many ways in which you can contribute, beyond writing code. The goal of this document is to provide a high-level overview of how you can get involved.
# Provide feedback or ask a question
* [File an issue](https://github.com/microsoft/nni/issues/new/choose) on GitHub.
* Ask a question with NNI tags on [Stack Overflow](https://stackoverflow.com/questions/tagged/nni?sort=Newest&edited=true).
* Discuss on the NNI [Gitter](https://gitter.im/Microsoft/nni?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) in NNI.
Join IM discussion groups:
|Gitter||WeChat|
|----|----|----|
|![image](https://user-images.githubusercontent.com/39592018/80665738-e0574a80-8acc-11ea-91bc-0836dc4cbf89.png)| OR |![image](https://github.com/scarlett2018/nniutil/raw/master/wechat.png)|
# Look for an existing issue
Before you create a new issue, please do a search in [open issues](https://github.com/microsoft/nni/issues) to see if the issue or feature request has already been filed.
Be sure to scan through the [most popular](https://github.com/microsoft/nni/issues?q=is%3Aopen+is%3Aissue+label%3AFAQ+sort%3Areactions-%2B1-desc) feature requests.
If you find your issue already exists, make relevant comments and add your [reaction](https://github.com/blog/2119-add-reactions-to-pull-requests-issues-and-comments). Use a reaction in place of a "+1" comment:
* 👍 - upvote
* 👎 - downvote
If you cannot find an existing issue that describes your bug or feature, create a new issue using the guidelines below.
# Writing good bug reports or feature requests
File a single issue per problem and feature request. Do not enumerate multiple bugs or feature requests in the same issue.
Provide as many information as you think might relevant to the context (thinking the issue is assigning to you, what kinds of info you will need to debug it!!!). To give you a general idea about what kinds of info are useful for developers to dig out the issue, we had provided issue template for you.
Once you had submitted an issue, be sure to follow it for questions and discussions.
Once the bug is fixed or feature is addressed, be sure to close the issue.
# Contributing fixes or examples
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
# Code of Conduct
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
# How to Contribute
After getting familiar with contribution agreements, you are ready to create your first PR =), follow the NNI developer tutorials to get start:
* We recommend new contributors to start with simple issues: ['good first issue'](https://github.com/Microsoft/nni/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or ['help-wanted'](https://github.com/microsoft/nni/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22).
* [NNI developer environment installation tutorial](docs/en_US/Tutorial/SetupNniDeveloperEnvironment.md)
* [How to debug](docs/en_US/Tutorial/HowToDebug.md)
* If you have any questions on usage, review [FAQ](https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/FAQ.md) first, if there are no relevant issues and answers to your question, try contact NNI dev team and users in [Gitter](https://gitter.im/Microsoft/nni?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) or [File an issue](https://github.com/microsoft/nni/issues/new/choose) on GitHub.
* [Customize your own Tuner](docs/en_US/Tuner/CustomizeTuner.md)
* [Implement customized TrainingService](docs/en_US/TrainingService/HowToImplementTrainingService.md)
* [Implement a new NAS trainer on NNI](docs/en_US/NAS/Advanced.md)
* [Customize your own Advisor](docs/en_US/Tuner/CustomizeAdvisor.md)
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
[简体中文](README_zh_CN.md) [简体中文](README_zh_CN.md)
**NNI (Neural Network Intelligence)** is a lightweight but powerful toolkit to help users **automate** <a href="docs/en_US/FeatureEngineering/Overview.md">Feature Engineering</a>, <a href="docs/en_US/NAS/Overview.md">Neural Architecture Search</a>, <a href="docs/en_US/Tuner/BuiltinTuner.md">Hyperparameter Tuning</a> and <a href="docs/en_US/Compressor/Overview.md">Model Compression</a>. **NNI (Neural Network Intelligence)** is a lightweight but powerful toolkit to help users **automate** <a href="docs/en_US/FeatureEngineering/Overview.md">Feature Engineering</a>, <a href="docs/en_US/NAS/Overview.md">Neural Architecture Search</a>, <a href="docs/en_US/Tuner/BuiltinTuner.md">Hyperparameter Tuning</a> and <a href="docs/en_US/Compression/Overview.md">Model Compression</a>.
The tool manages automated machine learning (AutoML) experiments, **dispatches and runs** experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in **different training environments** like <a href="docs/en_US/TrainingService/LocalMode.md">Local Machine</a>, <a href="docs/en_US/TrainingService/RemoteMachineMode.md">Remote Servers</a>, <a href="docs/en_US/TrainingService/PaiMode.md">OpenPAI</a>, <a href="docs/en_US/TrainingService/KubeflowMode.md">Kubeflow</a>, <a href="docs/en_US/TrainingService/FrameworkControllerMode.md">FrameworkController on K8S (AKS etc.)</a>, <a href="docs/en_US/TrainingService/DLTSMode.md">DLWorkspace (aka. DLTS)</a>, <a href="docs/en_US/TrainingService/AMLMode.md">AML (Azure Machine Learning)</a> and other cloud options. The tool manages automated machine learning (AutoML) experiments, **dispatches and runs** experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in **different training environments** like <a href="docs/en_US/TrainingService/LocalMode.md">Local Machine</a>, <a href="docs/en_US/TrainingService/RemoteMachineMode.md">Remote Servers</a>, <a href="docs/en_US/TrainingService/PaiMode.md">OpenPAI</a>, <a href="docs/en_US/TrainingService/KubeflowMode.md">Kubeflow</a>, <a href="docs/en_US/TrainingService/FrameworkControllerMode.md">FrameworkController on K8S (AKS etc.)</a>, <a href="docs/en_US/TrainingService/DLTSMode.md">DLWorkspace (aka. DLTS)</a>, <a href="docs/en_US/TrainingService/AMLMode.md">AML (Azure Machine Learning)</a> and other cloud options.
...@@ -135,24 +135,25 @@ Within the following table, we summarized the current NNI capabilities, we are g ...@@ -135,24 +135,25 @@ Within the following table, we summarized the current NNI capabilities, we are g
<li><a href="docs/en_US/NAS/Proxylessnas.md">ProxylessNAS</a></li> <li><a href="docs/en_US/NAS/Proxylessnas.md">ProxylessNAS</a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#NetworkMorphism">Network Morphism</a></li> <li><a href="docs/en_US/Tuner/BuiltinTuner.md#NetworkMorphism">Network Morphism</a></li>
<li><a href="docs/en_US/NAS/TextNAS.md">TextNAS</a></li> <li><a href="docs/en_US/NAS/TextNAS.md">TextNAS</a></li>
<li><a href="docs/en_US/NAS/Cream.md">Cream</a></li>
</ul> </ul>
</ul> </ul>
<a href="docs/en_US/Compressor/Overview.md">Model Compression</a> <a href="docs/en_US/Compression/Overview.md">Model Compression</a>
<ul> <ul>
<b>Pruning</b> <b>Pruning</b>
<ul> <ul>
<li><a href="docs/en_US/Compressor/Pruner.md#agp-pruner">AGP Pruner</a></li> <li><a href="docs/en_US/Compression/Pruner.md#agp-pruner">AGP Pruner</a></li>
<li><a href="docs/en_US/Compressor/Pruner.md#slim-pruner">Slim Pruner</a></li> <li><a href="docs/en_US/Compression/Pruner.md#slim-pruner">Slim Pruner</a></li>
<li><a href="docs/en_US/Compressor/Pruner.md#fpgm-pruner">FPGM Pruner</a></li> <li><a href="docs/en_US/Compression/Pruner.md#fpgm-pruner">FPGM Pruner</a></li>
<li><a href="docs/en_US/Compressor/Pruner.md#netadapt-pruner">NetAdapt Pruner</a></li> <li><a href="docs/en_US/Compression/Pruner.md#netadapt-pruner">NetAdapt Pruner</a></li>
<li><a href="docs/en_US/Compressor/Pruner.md#simulatedannealing-pruner">SimulatedAnnealing Pruner</a></li> <li><a href="docs/en_US/Compression/Pruner.md#simulatedannealing-pruner">SimulatedAnnealing Pruner</a></li>
<li><a href="docs/en_US/Compressor/Pruner.md#admm-pruner">ADMM Pruner</a></li> <li><a href="docs/en_US/Compression/Pruner.md#admm-pruner">ADMM Pruner</a></li>
<li><a href="docs/en_US/Compressor/Pruner.md#autocompress-pruner">AutoCompress Pruner</a></li> <li><a href="docs/en_US/Compression/Pruner.md#autocompress-pruner">AutoCompress Pruner</a></li>
</ul> </ul>
<b>Quantization</b> <b>Quantization</b>
<ul> <ul>
<li><a href="docs/en_US/Compressor/Quantizer.md#qat-quantizer">QAT Quantizer</a></li> <li><a href="docs/en_US/Compression/Quantizer.md#qat-quantizer">QAT Quantizer</a></li>
<li><a href="docs/en_US/Compressor/Quantizer.md#dorefa-quantizer">DoReFa Quantizer</a></li> <li><a href="docs/en_US/Compression/Quantizer.md#dorefa-quantizer">DoReFa Quantizer</a></li>
</ul> </ul>
</ul> </ul>
<a href="docs/en_US/FeatureEngineering/Overview.md">Feature Engineering (Beta)</a> <a href="docs/en_US/FeatureEngineering/Overview.md">Feature Engineering (Beta)</a>
...@@ -331,8 +332,8 @@ With authors' permission, we listed a set of NNI usage examples and relevant art ...@@ -331,8 +332,8 @@ With authors' permission, we listed a set of NNI usage examples and relevant art
* [Hyperparameter Tuning for Matrix Factorization](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) with NNI * [Hyperparameter Tuning for Matrix Factorization](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) with NNI
* [scikit-nni](https://github.com/ksachdeva/scikit-nni) Hyper-parameter search for scikit-learn pipelines using NNI * [scikit-nni](https://github.com/ksachdeva/scikit-nni) Hyper-parameter search for scikit-learn pipelines using NNI
* ### **Relevant Articles** ### * ### **Relevant Articles** ###
* [Hyper Parameter Optimization Comparison](docs/en_US/CommunitySharings/HpoComparision.md) * [Hyper Parameter Optimization Comparison](docs/en_US/CommunitySharings/HpoComparison.md)
* [Neural Architecture Search Comparison](docs/en_US/CommunitySharings/NasComparision.md) * [Neural Architecture Search Comparison](docs/en_US/CommunitySharings/NasComparison.md)
* [Parallelizing a Sequential Algorithm TPE](docs/en_US/CommunitySharings/ParallelizingTpeSearch.md) * [Parallelizing a Sequential Algorithm TPE](docs/en_US/CommunitySharings/ParallelizingTpeSearch.md)
* [Automatically tuning SVD with NNI](docs/en_US/CommunitySharings/RecommendersSvd.md) * [Automatically tuning SVD with NNI](docs/en_US/CommunitySharings/RecommendersSvd.md)
* [Automatically tuning SPTAG with NNI](docs/en_US/CommunitySharings/SptagAutoTune.md) * [Automatically tuning SPTAG with NNI](docs/en_US/CommunitySharings/SptagAutoTune.md)
......
...@@ -10,7 +10,7 @@ ...@@ -10,7 +10,7 @@
**NNI (Neural Network Intelligence)** 是一个轻量但强大的工具包,帮助用户**自动**的进行[特征工程](docs/zh_CN/FeatureEngineering/Overview.md)[神经网络架构搜索](docs/zh_CN/NAS/Overview.md)[超参调优](docs/zh_CN/Tuner/BuiltinTuner.md)以及[模型压缩](docs/zh_CN/Compressor/Overview.md) **NNI (Neural Network Intelligence)** 是一个轻量但强大的工具包,帮助用户**自动**的进行[特征工程](docs/zh_CN/FeatureEngineering/Overview.md)[神经网络架构搜索](docs/zh_CN/NAS/Overview.md)[超参调优](docs/zh_CN/Tuner/BuiltinTuner.md)以及[模型压缩](docs/zh_CN/Compressor/Overview.md)
NNI 管理自动机器学习 (AutoML) 的 Experiment,**调度运行**由调优算法生成的 Trial 任务来找到最好的神经网络架构和/或超参,支持**各种训练环境**,如[本机](docs/zh_CN/TrainingService/LocalMode. md),[远程服务器](docs/zh_CN/TrainingService/RemoteMachineMode. md),[OpenPAI](docs/zh_CN/TrainingService/PaiMode. md),[Kubeflow](docs/zh_CN/TrainingService/KubeflowMode. md),[基于 K8S 的 FrameworkController(如,AKS 等)](docs/zh_CN/TrainingService/FrameworkControllerMode. md), [DLWorkspace](docs/zh_CN/TrainingService/DLTSMode. md) (又称 DLTS)</a>, [AML](docs/zh_CN/TrainingService/AMLMode.md) (Azure Machine Learning) 以及其它环境。 NNI 管理自动机器学习 (AutoML) 的 Experiment,**调度运行**由调优算法生成的 Trial 任务来找到最好的神经网络架构和/或超参,支持**各种训练环境**,如[本机](docs/zh_CN/TrainingService/LocalMode.md)[远程服务器](docs/zh_CN/TrainingService/RemoteMachineMode.md)[OpenPAI](docs/zh_CN/TrainingService/PaiMode.md)[Kubeflow](docs/zh_CN/TrainingService/KubeflowMode.md)[基于 K8S 的 FrameworkController(如,AKS 等)](docs/zh_CN/TrainingService/FrameworkControllerMode.md)[DLWorkspace](docs/zh_CN/TrainingService/DLTSMode.md) (又称 DLTS)</a>, [AML](docs/zh_CN/TrainingService/AMLMode.md) (Azure Machine Learning) 以及其它环境。
## **使用场景** ## **使用场景**
...@@ -328,8 +328,8 @@ You can use these commands to get more information about the experiment ...@@ -328,8 +328,8 @@ You can use these commands to get more information about the experiment
* 使用 NNI 的 [矩阵分解超参调优](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) * 使用 NNI 的 [矩阵分解超参调优](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb)
* [scikit-nni](https://github.com/ksachdeva/scikit-nni) 使用 NNI 为 scikit-learn 开发的超参搜索。 * [scikit-nni](https://github.com/ksachdeva/scikit-nni) 使用 NNI 为 scikit-learn 开发的超参搜索。
* ### **相关文章** ### * ### **相关文章** ###
* [超参数优化的对比](docs/zh_CN/CommunitySharings/HpoComparision.md) * [超参数优化的对比](docs/zh_CN/CommunitySharings/HpoComparison.md)
* [神经网络结构搜索的对比](docs/zh_CN/CommunitySharings/NasComparision.md) * [神经网络结构搜索的对比](docs/zh_CN/CommunitySharings/NasComparison.md)
* [并行化顺序算法:TPE](docs/zh_CN/CommunitySharings/ParallelizingTpeSearch.md) * [并行化顺序算法:TPE](docs/zh_CN/CommunitySharings/ParallelizingTpeSearch.md)
* [使用 NNI 为 SVD 自动调参](docs/zh_CN/CommunitySharings/RecommendersSvd.md) * [使用 NNI 为 SVD 自动调参](docs/zh_CN/CommunitySharings/RecommendersSvd.md)
* [使用 NNI 为 SPTAG 自动调参](docs/zh_CN/CommunitySharings/SptagAutoTune.md) * [使用 NNI 为 SPTAG 自动调参](docs/zh_CN/CommunitySharings/SptagAutoTune.md)
......
...@@ -121,14 +121,28 @@ fixed_mask = fix_mask_conflict('./resnet18_mask', net, data) ...@@ -121,14 +121,28 @@ fixed_mask = fix_mask_conflict('./resnet18_mask', net, data)
``` ```
## Model FLOPs/Parameters Counter ## Model FLOPs/Parameters Counter
We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup). We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup).
We support two modes to collect information of modules. The first mode is `default`, which only collect the information of convolution and linear. The second mode is `full`, which also collect the information of other operations. Users can easily use our collected `results` for futher analysis.
### Usage ### Usage
``` ```
from nni.compression.pytorch.utils.counter import count_flops_params from nni.compression.pytorch.utils.counter import count_flops_params
# Given input size (1, 1, 28, 28) # Given input size (1, 1, 28, 28)
flops, params = count_flops_params(model, (1, 1, 28, 28)) flops, params, results = count_flops_params(model, (1, 1, 28, 28))
# Given input tensor with size (1, 1, 28, 28) and switch to full mode
x = torch.randn(1, 1, 28, 28)
flops, params, results = count_flops_params(model, (x,) mode='full') # tuple of tensor as input
# Format output size to M (i.e., 10^6) # Format output size to M (i.e., 10^6)
print(f'FLOPs: {flops/1e6:.3f}M, Params: {params/1e6:.3f}M) print(f'FLOPs: {flops/1e6:.3f}M, Params: {params/1e6:.3f}M)
print(results)
{
'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']},
'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}
}
``` ```
# CDARTS # CDARTS
## Introduction ## Introduction
CDARTS builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network. [CDARTS](https://arxiv.org/pdf/2006.10724.pdf) builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network.
In implementation of `CdartsTrainer`, it first instantiates two models and two mutators (one for each). The first model is the so-called "search network", which is mutated with a `RegularizedDartsMutator` -- a mutator with subtle differences with `DartsMutator`. The second model is the "evaluation network", which is mutated with a discrete mutator that leverages the previous search network mutator, to sample a single path each time. Trainers train models and mutators alternatively. Users can refer to [references](#reference) if they are interested in more details on these trainers and mutators. In implementation of `CdartsTrainer`, it first instantiates two models and two mutators (one for each). The first model is the so-called "search network", which is mutated with a `RegularizedDartsMutator` -- a mutator with subtle differences with `DartsMutator`. The second model is the "evaluation network", which is mutated with a discrete mutator that leverages the previous search network mutator, to sample a single path each time. Trainers train models and mutators alternatively. Users can refer to [paper](https://arxiv.org/pdf/2006.10724.pdf) if they are interested in more details on these trainers and mutators.
## Reproduction Results ## Reproduction Results
This is CDARTS based on the NNI platform, which currently supports CIFAR10 search and retrain. ImageNet search and retrain should also be supported, and we provide corresponding interfaces. Our reproduced results on NNI are slightly lower than the paper, but much higher than the original DARTS. Here we show the results of three independent experiments on CIFAR10. This is CDARTS based on the NNI platform, which currently supports CIFAR10 search and retrain. ImageNet search and retrain should also be supported, and we provide corresponding interfaces. Our reproduced results on NNI are slightly lower than the paper, but much higher than the original DARTS. Here we show the results of three independent experiments on CIFAR10.
| Runs | Paper | NNI | | Runs | Paper | NNI |
| ---- |:-------------:| :-----:| | ---- |:-------------:| :-----:|
| 1 | 97.52 | 97.44 | | 1 | 97.52 | 97.44 |
| 2 | 97.53 | 97.48 | | 2 | 97.53 | 97.48 |
...@@ -19,7 +20,7 @@ This is CDARTS based on the NNI platform, which currently supports CIFAR10 searc ...@@ -19,7 +20,7 @@ This is CDARTS based on the NNI platform, which currently supports CIFAR10 searc
## Examples ## Examples
[Example code](https://github.com/microsoft/nni/tree/v1.9/examples/nas/cdarts) [Example code](https://github.com/microsoft/nni/tree/master/examples/nas/cdarts)
```bash ```bash
# In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder. # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
...@@ -55,3 +56,4 @@ bash run_retrain_cifar.sh ...@@ -55,3 +56,4 @@ bash run_retrain_cifar.sh
.. autoclass:: nni.algorithms.nas.pytorch.cdarts.RegularizedMutatorParallel .. autoclass:: nni.algorithms.nas.pytorch.cdarts.RegularizedMutatorParallel
:members: :members:
``` ```
# Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search
**[[Paper]](https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf) [[Models-Google Drive]](https://drive.google.com/drive/folders/1NLGAbBF9bA1IUAxKlk2VjgRXhr6RHvRW?usp=sharing)[[Models-Baidu Disk (PWD: wqw6)]](https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g) [[BibTex]](https://scholar.googleusercontent.com/scholar.bib?q=info:ICWVXc_SsKAJ:scholar.google.com/&output=citation&scisdr=CgUmooXfEMfTi0cV5aU:AAGBfm0AAAAAX7sQ_aXoamdKRaBI12tAVN8REq1VKNwM&scisig=AAGBfm0AAAAAX7sQ_RdYtp6BSro3zgbXVJU2MCgsG730&scisf=4&ct=citation&cd=-1&hl=ja)** <br/>
In this work, we present a simple yet effective architecture distillation method. The central idea is that subnetworks can learn collaboratively and teach each other throughout the training process, aiming to boost the convergence of individual models. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Distilling knowledge from the prioritized paths is able to boost the training of subnetworks. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop. The discovered architectures achieve superior performance compared to the recent [MobileNetV3](https://arxiv.org/abs/1905.02244) and [EfficientNet](https://arxiv.org/abs/1905.11946) families under aligned settings.
<div >
<img src="https://github.com/microsoft/Cream/blob/main/demo/intro.jpg" width="800"/>
</div>
## Reproduced Results
Top-1 Accuracy on ImageNet. The top-1 accuracy of Cream search algorithm surpasses MobileNetV3 and EfficientNet-B0/B1 on ImageNet.
The training with 16 Gpus is a little bit superior than 8 Gpus, as below.
| Model (M Flops) | 8Gpus | 16Gpus |
| ---- |:-------------:| :-----:|
| 14M | 53.7 | 53.8 |
| 43M | 65.8 | 66.5 |
| 114M | 72.1 | 72.8 |
| 287M | 76.7 | 77.6 |
| 481M | 78.9 | 79.2 |
| 604M | 79.4 | 80.0 |
<table style="border: none">
<th><img src="./../../img/cream_flops100.jpg" alt="drawing" width="400"/></th>
<th><img src="./../../img/cream_flops600.jpg" alt="drawing" width="400"/></th>
</table>
## Examples
[Example code](https://github.com/microsoft/nni/tree/master/examples/nas/cream)
Please run the following scripts in the example folder.
## Data Preparation
You need to first download the [ImageNet-2012](http://www.image-net.org/) to the folder `./data/imagenet` and move the validation set to the subfolder `./data/imagenet/val`. To move the validation set, you cloud use the following script: <https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh>
Put the imagenet data in `./data`. It should be like following:
```
./data/imagenet/train
./data/imagenet/val
...
```
## Quick Start
### I. Search
First, build environments for searching.
```
pip install -r ./requirements
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cpp_ext --cuda_ext
```
To search for an architecture, you need to configure the parameters `FLOPS_MINIMUM` and `FLOPS_MAXIMUM` to specify the desired model flops, such as [0,600]MB flops. You can specify the flops interval by changing these two parameters in `./configs/train.yaml`
```
FLOPS_MINIMUM: 0 # Minimum Flops of Architecture
FLOPS_MAXIMUM: 600 # Maximum Flops of Architecture
```
For example, if you expect to search an architecture with model flops <= 200M, please set the `FLOPS_MINIMUM` and `FLOPS_MAXIMUM` to be `0` and `200`.
After you specify the flops of the architectures you would like to search, you can search an architecture now by running:
```
python -m torch.distributed.launch --nproc_per_node=8 ./train.py --cfg ./configs/train.yaml
```
The searched architectures need to be retrained and obtain the final model. The final model is saved in `.pth.tar` format. Retraining code will be released soon.
### II. Retrain
To train searched architectures, you need to configure the parameter `MODEL_SELECTION` to specify the model Flops. To specify which model to train, you should add `MODEL_SELECTION` in `./configs/retrain.yaml`. You can select one from [14,43,112,287,481,604], which stands for different Flops(MB).
```
MODEL_SELECTION: 43 # Retrain 43m model
MODEL_SELECTION: 481 # Retrain 481m model
......
```
To train random architectures, you need specify `MODEL_SELECTION` to `-1` and configure the parameter `INPUT_ARCH`:
```
MODEL_SELECTION: -1 # Train random architectures
INPUT_ARCH: [[0], [3], [3, 3], [3, 1, 3], [3, 3, 3, 3], [3, 3, 3], [0]] # Random Architectures
......
```
After adding `MODEL_SELECTION` in `./configs/retrain.yaml`, you need to use the following command to train the model.
```
python -m torch.distributed.launch --nproc_per_node=8 ./retrain.py --cfg ./configs/retrain.yaml
```
### III. Test
To test our trained of models, you need to use `MODEL_SELECTION` in `./configs/test.yaml` to specify which model to test.
```
MODEL_SELECTION: 43 # test 43m model
MODEL_SELECTION: 481 # test 470m model
......
```
After specifying the flops of the model, you need to write the path to the resume model in `./test.sh`.
```
RESUME_PATH: './43.pth.tar'
RESUME_PATH: './481.pth.tar'
......
```
We provide 14M/43M/114M/287M/481M/604M pretrained models in [google drive](https://drive.google.com/drive/folders/1CQjyBryZ4F20Rutj7coF8HWFcedApUn2) or [[Models-Baidu Disk (password: wqw6)]](https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g) .
After downloading the pretrained models and adding `MODEL_SELECTION` and `RESUME_PATH` in './configs/test.yaml', you need to use the following command to test the model.
```
python -m torch.distributed.launch --nproc_per_node=8 ./test.py --cfg ./configs/test.yaml
```
...@@ -15,7 +15,7 @@ The above-mentioned example is meant to reproduce the results in the paper, we d ...@@ -15,7 +15,7 @@ The above-mentioned example is meant to reproduce the results in the paper, we d
| | In paper | Reproduction | | | In paper | Reproduction |
| ---------------------- | ------------- | ------------ | | ---------------------- | ------------- | ------------ |
| First order (CIFAR10) | 3.00 +/- 0.14 | 2.78 | | First order (CIFAR10) | 3.00 +/- 0.14 | 2.78 |
| Second order (CIFAR10) | 2.76 +/- 0.09 | 2.89 | | Second order (CIFAR10) | 2.76 +/- 0.09 | 2.80 |
## Examples ## Examples
......
...@@ -14,4 +14,5 @@ One-shot NAS algorithms leverage weight sharing among models in neural architect ...@@ -14,4 +14,5 @@ One-shot NAS algorithms leverage weight sharing among models in neural architect
SPOS <SPOS> SPOS <SPOS>
CDARTS <CDARTS> CDARTS <CDARTS>
ProxylessNAS <Proxylessnas> ProxylessNAS <Proxylessnas>
TextNAS <TextNAS> TextNAS <TextNAS>
\ No newline at end of file Cream <Cream>
# Run an Experiment on AdaptDL
Now NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl). Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
## Prerequisite for Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes [on Azure](https://azure.microsoft.com/en-us/services/kubernetes-service/), or [on-premise](https://kubernetes.io/docs/setup/) with [cephfs](https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd), or [microk8s with storage add-on enabled](https://microk8s.io/docs/addons).
2. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this [guideline](https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html) to setup AdaptDL scheduler.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
### Verify Prerequisites
```bash
nnictl --version
# Expected: <version_number>
```
```bash
kubectl version
# Expected that the kubectl client version matches the server version.
```
```bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
```
## Run an experiment
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under `examples/trials/cifar10_pytorch` folder. (`main_adl.py` and `config_adl.yaml`)
Here is a template configuration specification to use AdaptDL as a training service.
```yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: microk8s-hostpath
storageSize: 1Gi
```
Those configs not mentioned below, are following the
[default specs defined in the NNI doc](https://nni.readthedocs.io/en/latest/Tutorial/ExperimentConfig.html#configuration-spec).
* **trainingServicePlatform**: Choose `adl` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**: *Required* to get the correct info and metrics back from the cluster, for `adl` training service.
IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http.
* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**: It defines the specs of an `adl` trial.
* **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive.
* **image**: Docker image for the trial
* **imagePullSecret**: (*Optional*) If you are using a private registry,
you need to provide the secret to successfully pull the image.
* **codeDir**: the working directory of the container. `.` means the default working directory defined by the image.
* **command**: the bash command to start the trial
* **gpuNum**: the number of GPUs requested for this trial. It must be non-negative integer.
* **cpuNum**: (*Optional*) the number of CPUs requested for this trial. It must be non-negative integer.
* **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes
[default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
* **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*) [storage settings](https://kubernetes.io/docs/concepts/storage/storage-classes/) for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.
### NFS Storage
As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
Note that `adl` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The `adl` training service can then mount it to the kubernetes for every trials, with the proper configurations:
* **server**: NFS server address, e.g. IP address or domain
* **path**: NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**: In container absolute path to mount the NFS **path** above,
so that every trial will have the access to the NFS.
In the trial containers, you can access the NFS with this path.
Use cases:
* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
So if you want to export your trained models,
you may mount the NFS to the trial to persist and export your trained models.
In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
## Monitor via Log Stream
Follow the log streaming of a certain trial:
```bash
nnictl log trial --trial_id=<trial_id>
```
```bash
nnictl log trial <experiment_id> --trial_id=<trial_id>
```
Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.
## Monitor via TensorBoard
In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.
You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.
In the trial container you may have access to two environment variables:
* `ADAPTDL_TENSORBOARD_LOGDIR`: the TensorBoard logging directory for the current experiment,
* `NNI_TRIAL_JOB_ID`: the `trial` job id for the current trial.
It is recommended for to have them joined as the directory for trial,
for example in Python:
```python
import os
tensorboard_logdir = os.path.join(
os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
os.getenv("NNI_TRIAL_JOB_ID")
)
```
If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.
With the above setting, you can monitor the experiment easily
via TensorBoard by
```bash
nnictl tensorboard start
```
If having multiple experiment running at the same time, you may use
```bash
nnictl tensorboard start <experiment_id>
```
It will provide you the web url to access the tensorboard.
Note that you have the flexibility to set up the local `--port`
for the TensorBoard.
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled. NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*. Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [AdaptDL](./AdaptDLMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details. If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details.
...@@ -24,6 +24,7 @@ In case users intend to use large files in their experiment (like large-scaled d ...@@ -24,6 +24,7 @@ In case users intend to use large files in their experiment (like large-scaled d
|[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.| |[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
|[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.| |[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.|
|[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.| |[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
|[__AdaptDL__](./AdaptDLMode.md)|NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl), called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.|
|[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.| |[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|
|[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.| |[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.|
|[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode. |[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode.
......
...@@ -2,25 +2,27 @@ ...@@ -2,25 +2,27 @@
=== ===
NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker. NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.
[toc]
## Setup environment ## Setup environment
Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md). **Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).**
Step 2. Get token. **Step 2. Get token.**
Open web portal of OpenPAI, and click `My profile` button in the top-right side. Open web portal of OpenPAI, and click `My profile` button in the top-right side.
![](../../img/pai_profile.jpg) <img src="../../img/pai_profile.jpg" style="zoom: 80%;" />
Click `copy` button in the page to copy a jwt token. Click `copy` button in the page to copy a jwt token.
![](../../img/pai_token.jpg) <img src="../../img/pai_token.jpg" style="zoom:67%;" />
Step 3. Mount NFS storage to local machine. **Step 3. Mount NFS storage to local machine.**
Click `Submit job` button in web portal. Click `Submit job` button in web portal.
![](../../img/pai_job_submission_page.jpg) <img src="../../img/pai_job_submission_page.jpg" style="zoom: 50%;" />
Find the data management region in job submission page. Find the data management region in job submission page.
![](../../img/pai_data_management_page.jpg) <img src="../../img/pai_data_management_page.jpg" style="zoom: 33%;" />
The `Preview container paths` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage. The `Preview container paths` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage.
For example, use the following command: For example, use the following command:
...@@ -33,9 +35,9 @@ You could use the following configuration in your NNI's config file: ...@@ -33,9 +35,9 @@ You could use the following configuration in your NNI's config file:
```yaml ```yaml
nniManagerNFSMountPath: /local/mnt nniManagerNFSMountPath: /local/mnt
``` ```
Step 4. Get OpenPAI's storage config name and nniManagerMountPath **Step 4. Get OpenPAI's storage config name and nniManagerMountPath**
The `Team share storage` field is storage configuration used to specify storage value in OpenPAI. You can get `paiStorageConfigName` and `containerNFSMountPath` field in `Team share storage`, for example: The `Team share storage` field is storage configuration used to specify storage value in OpenPAI. You can get `paiStorageConfigName` and `containerNFSMountPath` field in `Team share storage`, for example:
...@@ -44,7 +46,10 @@ paiStorageConfigName: confignfs-data ...@@ -44,7 +46,10 @@ paiStorageConfigName: confignfs-data
containerNFSMountPath: /mnt/confignfs-data containerNFSMountPath: /mnt/confignfs-data
``` ```
## Run an experiment ## Run an experiment
Use `examples/trials/mnist-annotation` as an example. The NNI config YAML file's content is like: Use `examples/trials/mnist-annotation` as an example. The NNI config YAML file's content is like:
```yaml ```yaml
...@@ -88,9 +93,11 @@ paiConfig: ...@@ -88,9 +93,11 @@ paiConfig:
Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like `10.10.5.1`, the default http protocol in NNI is `http`, if your PAI's cluster enabled https, please use the uri in `https://10.10.5.1` format. Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like `10.10.5.1`, the default http protocol in NNI is `http`, if your PAI's cluster enabled https, please use the uri in `https://10.10.5.1` format.
### Trial configurations ### Trial configurations
Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), `trial` configuration in pai mode have these additional keys: Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), `trial` configuration in pai mode has the following additional keys:
* cpuNum * cpuNum
...@@ -136,6 +143,8 @@ Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMod ...@@ -136,6 +143,8 @@ Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMod
2. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error. 2. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
### OpenPAI configurations ### OpenPAI configurations
`paiConfig` includes OpenPAI specific configurations, `paiConfig` includes OpenPAI specific configurations,
...@@ -171,17 +180,23 @@ Notice: In pai mode, NNIManager will start a rest server and listen on a port wh ...@@ -171,17 +180,23 @@ Notice: In pai mode, NNIManager will start a rest server and listen on a port wh
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information. Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like: Expand a trial information in trial list view, click the logPath link like:
![](../../img/nni_webui_joblist.jpg) <img src="../../img/nni_webui_joblist.jpg" style="zoom: 30%;" />
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS: And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
![](../../img/nni_trial_hdfs_output.jpg) <img src="../../img/nni_trial_hdfs_output.jpg" style="zoom: 80%;" />
You can see there're three fils in output folder: stderr, stdout, and trial.log You can see there're three fils in output folder: stderr, stdout, and trial.log
## data management ## data management
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by `paiStorageConfigName` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the `nniManagerNFSMountPath` field in configuration file, NNI will generate bash files and copy data in `codeDir` to the `nniManagerNFSMountPath` folder, then NNI will start a trial job. The data in `nniManagerNFSMountPath` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in `containerNFSMountPath`, NNI will enter this folder first, and then run scripts to start a trial job. Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by `paiStorageConfigName` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the `nniManagerNFSMountPath` field in configuration file, NNI will generate bash files and copy data in `codeDir` to the `nniManagerNFSMountPath` folder, then NNI will start a trial job. The data in `nniManagerNFSMountPath` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in `containerNFSMountPath`, NNI will enter this folder first, and then run scripts to start a trial job.
## version check ## version check
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility. NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy: Check policy:
...@@ -190,4 +205,5 @@ Check policy: ...@@ -190,4 +205,5 @@ Check policy:
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7. 3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check. If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
![](../../img/version_check.png)
<img src="../../img/version_check.png" style="zoom: 80%;" />
...@@ -247,6 +247,7 @@ This is suggested when you have limited computational resources but have a relat ...@@ -247,6 +247,7 @@ This is suggested when you have limited computational resources but have a relat
* **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will try to maximize metrics. If 'minimize', the tuner will try to minimize metrics. * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will try to maximize metrics. If 'minimize', the tuner will try to minimize metrics.
* **R** (*int, optional, default = 60*) - the maximum budget given to a trial (could be the number of mini-batches or epochs). Each trial should use TRIAL_BUDGET to control how long they run. * **R** (*int, optional, default = 60*) - the maximum budget given to a trial (could be the number of mini-batches or epochs). Each trial should use TRIAL_BUDGET to control how long they run.
* **eta** (*int, optional, default = 3*) - `(eta-1)/eta` is the proportion of discarded trials. * **eta** (*int, optional, default = 3*) - `(eta-1)/eta` is the proportion of discarded trials.
* **exec_mode** (*serial or parallelism, optional, default = parallelism*) - If 'parallelism', the tuner will try to use available resources to start new bucket immediately. If 'serial', the tuner will only start new bucket after the current bucket is done.
**Example Configuration:** **Example Configuration:**
......
...@@ -7,7 +7,17 @@ Hyperband on NNI ...@@ -7,7 +7,17 @@ Hyperband on NNI
## 2. Implementation with full parallelism ## 2. Implementation with full parallelism
First, this is an example of how to write an autoML algorithm based on MsgDispatcherBase, rather than Tuner and Assessor. Hyperband is implemented in this way because it integrates the functions of both Tuner and Assessor, thus, we call it Advisor. First, this is an example of how to write an autoML algorithm based on MsgDispatcherBase, rather than Tuner and Assessor. Hyperband is implemented in this way because it integrates the functions of both Tuner and Assessor, thus, we call it Advisor.
Second, this implementation fully leverages Hyperband's internal parallelism. Specifically, the next bucket is not started strictly after the current bucket. Instead, it starts when there are available resources. Second, this implementation fully leverages Hyperband's internal parallelism. Specifically, the next bucket is not started strictly after the current bucket. Instead, it starts when there are available resources. If you want to use full parallelism mode, set `exec_mode` with `parallelism`.
Or if you want to set `exec_mode` with `serial` according to the original algorithm. In this mode, the next bucket will start strictly after the current bucket.
`parallelism` mode may lead to multiple unfinished buckets, and there is at most one unfinished bucket under `serial` mode. The advantage of `parallelism` mode is to make full use of resources, which may reduce the experiment duration multiple times. The following two pictures are the results of quick verification using [nas-bench-201](../NAS/Benchmarks.md), picture above is in `parallelism` mode, picture below is in `serial` mode.
![parallelism mode](../../img/hyperband_parallelism.png "parallelism mode")
![serial mode](../../img/hyperband_serial.png "serial mode")
If you want to reproduce these results, refer to the example under `examples/trials/benchmarking/` for details.
## 3. Usage ## 3. Usage
To use Hyperband, you should add the following spec in your experiment's YAML config file: To use Hyperband, you should add the following spec in your experiment's YAML config file:
...@@ -23,6 +33,8 @@ advisor: ...@@ -23,6 +33,8 @@ advisor:
eta: 3 eta: 3
#choice: maximize, minimize #choice: maximize, minimize
optimize_mode: maximize optimize_mode: maximize
#choice: serial, parallelism
exec_mode: parallelism
``` ```
Note that once you use Advisor, you are not allowed to add a Tuner and Assessor spec in the config file. If you use Hyperband, among the hyperparameters (i.e., key-value pairs) received by a trial, there will be one more key called `TRIAL_BUDGET` defined by user. **By using this `TRIAL_BUDGET`, the trial can control how long it runs**. Note that once you use Advisor, you are not allowed to add a Tuner and Assessor spec in the config file. If you use Hyperband, among the hyperparameters (i.e., key-value pairs) received by a trial, there will be one more key called `TRIAL_BUDGET` defined by user. **By using this `TRIAL_BUDGET`, the trial can control how long it runs**.
......
...@@ -260,6 +260,8 @@ Specifies the platform to run the experiment, including __local__, __remote__, _ ...@@ -260,6 +260,8 @@ Specifies the platform to run the experiment, including __local__, __remote__, _
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md) * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
* __adl__ submit trial jobs to [AdaptDL](https://www.kubeflow.org/docs/about/kubeflow/), NNI support AdaptDL on Kubernetes cluster. For detail please refer to [AdaptDL Docs](../TrainingService/AdaptDLMode.md)
* TODO: explain frameworkcontroller. * TODO: explain frameworkcontroller.
### searchSpacePath ### searchSpacePath
......
...@@ -118,3 +118,4 @@ Due to potential programming changes, the minimum system requirements of NNI may ...@@ -118,3 +118,4 @@ Due to potential programming changes, the minimum system requirements of NNI may
* [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md) * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
* [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md) * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
* [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md) * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
\ No newline at end of file
...@@ -281,3 +281,4 @@ Below is the status of all trials. Specifically: ...@@ -281,3 +281,4 @@ Below is the status of all trials. Specifically:
* [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md) * [How to run an experiment on OpenPAI?](../TrainingService/PaiMode.md)
* [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md) * [How to run an experiment on Kubernetes through Kubeflow?](../TrainingService/KubeflowMode.md)
* [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md) * [How to run an experiment on Kubernetes through FrameworkController?](../TrainingService/FrameworkControllerMode.md)
* [How to run an experiment on Kubernetes through AdaptDL?](../TrainingService/AdaptDLMode.md)
\ No newline at end of file
...@@ -8,6 +8,7 @@ Introduction to NNI Training Services ...@@ -8,6 +8,7 @@ Introduction to NNI Training Services
OpenPAI<./TrainingService/PaiMode> OpenPAI<./TrainingService/PaiMode>
OpenPAI Yarn Mode<./TrainingService/PaiYarnMode> OpenPAI Yarn Mode<./TrainingService/PaiYarnMode>
Kubeflow<./TrainingService/KubeflowMode> Kubeflow<./TrainingService/KubeflowMode>
AdaptDL<./TrainingService/AdaptDLMode>
FrameworkController<./TrainingService/FrameworkControllerMode> FrameworkController<./TrainingService/FrameworkControllerMode>
DLTS<./TrainingService/DLTSMode> DLTS<./TrainingService/DLTSMode>
AML<./TrainingService/AMLMode> AML<./TrainingService/AMLMode>
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment