Unverified Commit 1e439e45 authored by kvartet's avatar kvartet Committed by GitHub
Browse files

Fix bug in document conversion (#3203)

parent 9520f251
GradientFeatureSelector
-----------------------
The algorithm in GradientFeatureSelector comes from `"Feature Gradients: Scalable Feature Selection via Discrete Relaxation" <https://arxiv.org/pdf/1908.10382.pdf>`__.
The algorithm in GradientFeatureSelector comes from `Feature Gradients: Scalable Feature Selection via Discrete Relaxation <https://arxiv.org/pdf/1908.10382.pdf>`__.
GradientFeatureSelector, a gradient-based search algorithm
for feature selection.
......@@ -90,7 +90,7 @@ And you could reference the examples in ``/examples/feature_engineering/gradient
*
**device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
**Requirement of ``fit`` FuncArgs**
**Requirement of fit FuncArgs**
*
......@@ -102,6 +102,6 @@ And you could reference the examples in ``/examples/feature_engineering/gradient
*
**groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
**Requirement of ``get_selected_features`` FuncArgs**
**Requirement of get_selected_features FuncArgs**
For now, the ``get_selected_features`` function has no parameters.
......@@ -49,7 +49,7 @@ If you want to implement a customized feature selector, you need to:
#. Inherit the base FeatureSelector class
#. Implement *fit* and _get_selected*features* function
#. Implement *fit* and _get_selected *features* function
#. Integrate with sklearn (Optional)
Here is an example:
......@@ -64,7 +64,7 @@ Here is an example:
def __init__(self, ...):
...
**2. Implement *fit* and _get_selected*features* Function**
**2. Implement fit and _get_selected features Function**
.. code-block:: python
......
......@@ -81,9 +81,9 @@ The requirements of return values of ``sample_search()`` and ``sample_final()``
def sample_final(self):
return self.sample_search() # use the same logic here. you can do something different
The complete example of random mutator can be found :githublink:`here <src/sdk/pynni/nni/nas/pytorch/random/mutator.py>`.
The complete example of random mutator can be found :githublink:`here <nni/nas/pytorch/mutator.py>`.
For advanced usages, e.g., users want to manipulate the way modules in ``LayerChoice`` are executed, they can inherit ``BaseMutator``\ , and overwrite ``on_forward_layer_choice`` and ``on_forward_input_choice``\ , which are the callback implementation of ``LayerChoice`` and ``InputChoice`` respectively. Users can still use property ``mutables`` to get all ``LayerChoice`` and ``InputChoice`` in the model code. For details, please refer to :githublink:`reference <src/sdk/pynni/nni/nas/pytorch>` here to learn more.
For advanced usages, e.g., users want to manipulate the way modules in ``LayerChoice`` are executed, they can inherit ``BaseMutator``\ , and overwrite ``on_forward_layer_choice`` and ``on_forward_input_choice``\ , which are the callback implementation of ``LayerChoice`` and ``InputChoice`` respectively. Users can still use property ``mutables`` to get all ``LayerChoice`` and ``InputChoice`` in the model code. For details, please refer to :githublink:`reference <nni/nas/pytorch/>` here to learn more.
.. tip::
A useful application of random mutator is for debugging. Use
......
......@@ -44,12 +44,13 @@ Please make sure there is at least 10GB free disk space and note that the conver
Example Usages
--------------
Please refer to `examples usages of Benchmarks API <./BenchmarksExample>`__.
Please refer to `examples usages of Benchmarks API <./BenchmarksExample.rst>`__.
NAS-Bench-101
-------------
`Paper link <https://arxiv.org/abs/1902.09635>`__ &nbsp; &nbsp; `Open-source <https://github.com/google-research/nasbench>`__
* `Paper link <https://arxiv.org/abs/1902.09635>`__
* `Open-source <https://github.com/google-research/nasbench>`__
NAS-Bench-101 contains 423,624 unique neural networks, combined with 4 variations in number of epochs (4, 12, 36, 108), each of which is trained 3 times. It is a cell-wise search space, which constructs and stacks a cell by enumerating DAGs with at most 7 operators, and no more than 9 connections. All operators can be chosen from ``CONV3X3_BN_RELU``\ , ``CONV1X1_BN_RELU`` and ``MAXPOOL3X3``\ , except the first operator (always ``INPUT``\ ) and last operator (always ``OUTPUT``\ ).
......@@ -85,7 +86,9 @@ API Documentation
NAS-Bench-201
-------------
`Paper link <https://arxiv.org/abs/2001.00326>`__ &nbsp; &nbsp; `Open-source API <https://github.com/D-X-Y/NAS-Bench-201>`__ &nbsp; &nbsp;\ `Implementations <https://github.com/D-X-Y/AutoDL-Projects>`__
* `Paper link <https://arxiv.org/abs/2001.00326>`__
* `Open-source API <https://github.com/D-X-Y/NAS-Bench-201>`__
* `Implementations <https://github.com/D-X-Y/AutoDL-Projects>`__
NAS-Bench-201 is a cell-wise search space that views nodes as tensors and edges as operators. The search space contains all possible densely-connected DAGs with 4 nodes, resulting in 15,625 candidates in total. Each operator (i.e., edge) is selected from a pre-defined operator set (\ ``NONE``\ , ``SKIP_CONNECT``\ , ``CONV_1X1``\ , ``CONV_3X3`` and ``AVG_POOL_3X3``\ ). Training appraoches vary in the dataset used (CIFAR-10, CIFAR-100, ImageNet) and number of epochs scheduled (12 and 200). Each combination of architecture and training approach is repeated 1 - 3 times with different random seeds.
......@@ -113,7 +116,8 @@ API Documentation
NDS
---
`Paper link <https://arxiv.org/abs/1905.13214>`__ &nbsp; &nbsp; `Open-source <https://github.com/facebookresearch/nds>`__
* `Paper link <https://arxiv.org/abs/1905.13214>`__
* `Open-source <https://github.com/facebookresearch/nds>`__
*On Network Design Spaces for Visual Recognition* released trial statistics of over 100,000 configurations (models + hyper-parameters) sampled from multiple model families, including vanilla (feedforward network loosely inspired by VGG), ResNet and ResNeXt (residual basic block and residual bottleneck block) and NAS cells (following popular design from NASNet, Ameoba, PNAS, ENAS and DARTS). Most configurations are trained only once with a fixed seed, except a few that are trained twice or three times.
......
.. role:: raw-html(raw)
:format: html
Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search
=======================================================================================
**`[Paper] <https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf>`__ `[Models-Google Drive] <https://drive.google.com/drive/folders/1NLGAbBF9bA1IUAxKlk2VjgRXhr6RHvRW?usp=sharing>`__\ `[Models-Baidu Disk (PWD: wqw6)] <https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g>`__ `[BibTex] <https://scholar.googleusercontent.com/scholar.bib?q=info:ICWVXc_SsKAJ:scholar.google.com/&output=citation&scisdr=CgUmooXfEMfTi0cV5aU:AAGBfm0AAAAAX7sQ_aXoamdKRaBI12tAVN8REq1VKNwM&scisig=AAGBfm0AAAAAX7sQ_RdYtp6BSro3zgbXVJU2MCgsG730&scisf=4&ct=citation&cd=-1&hl=ja>`__** :raw-html:`<br/>`
* `Paper <https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf>`__
* `Models-Google Drive <https://drive.google.com/drive/folders/1NLGAbBF9bA1IUAxKlk2VjgRXhr6RHvRW?usp=sharing>`__
* `Models-Baidu Disk (PWD: wqw6) <https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g>`__
* `BibTex <https://scholar.googleusercontent.com/scholar.bib?q=info:ICWVXc_SsKAJ:scholar.google.com/&output=citation&scisdr=CgUmooXfEMfTi0cV5aU:AAGBfm0AAAAAX7sQ_aXoamdKRaBI12tAVN8REq1VKNwM&scisig=AAGBfm0AAAAAX7sQ_RdYtp6BSro3zgbXVJU2MCgsG730&scisf=4&ct=citation&cd=-1&hl=ja>`__
In this work, we present a simple yet effective architecture distillation method. The central idea is that subnetworks can learn collaboratively and teach each other throughout the training process, aiming to boost the convergence of individual models. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Distilling knowledge from the prioritized paths is able to boost the training of subnetworks. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop. The discovered architectures achieve superior performance compared to the recent `MobileNetV3 <https://arxiv.org/abs/1905.02244>`__ and `EfficientNet <https://arxiv.org/abs/1905.11946>`__ families under aligned settings.
:raw-html:`<div ><img src="https://github.com/microsoft/Cream/blob/main/demo/intro.jpg" width="800"/></div>`
.. image:: https://raw.githubusercontent.com/microsoft/Cream/main/demo/intro.jpg
Reproduced Results
------------------
......@@ -44,13 +44,11 @@ The training with 16 Gpus is a little bit superior than 8 Gpus, as below.
.. raw:: html
<table style="border: none">
<th><img src="./../../img/cream_flops100.jpg" alt="drawing" width="400"/></th>
<th><img src="./../../img/cream_flops600.jpg" alt="drawing" width="400"/></th>
</table>
.. image:: ../../img/cream_flops100.jpg
:scale: 50%
.. image:: ../../img/cream_flops600.jpg
:scale: 50%
Examples
--------
......@@ -62,7 +60,7 @@ Please run the following scripts in the example folder.
Data Preparation
----------------
You need to first download the `ImageNet-2012 <http://www.image-net.org/>`__ to the folder ``./data/imagenet`` and move the validation set to the subfolder ``./data/imagenet/val``. To move the validation set, you cloud use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
You need to first download the `ImageNet-2012 <http://www.image-net.org/>`__ to the folder ``./data/imagenet`` and move the validation set to the subfolder ``./data/imagenet/val``. To move the validation set, you cloud use `the following script <https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh>`__ .
Put the imagenet data in ``./data``. It should be like following:
......@@ -75,7 +73,7 @@ Put the imagenet data in ``./data``. It should be like following:
Quick Start
-----------
I. Search
1. Search
^^^^^^^^^
First, build environments for searching.
......@@ -105,7 +103,7 @@ After you specify the flops of the architectures you would like to search, you c
The searched architectures need to be retrained and obtain the final model. The final model is saved in ``.pth.tar`` format. Retraining code will be released soon.
II. Retrain
2. Retrain
^^^^^^^^^^^
To train searched architectures, you need to configure the parameter ``MODEL_SELECTION`` to specify the model Flops. To specify which model to train, you should add ``MODEL_SELECTION`` in ``./configs/retrain.yaml``. You can select one from [14,43,112,287,481,604], which stands for different Flops(MB).
......@@ -130,7 +128,7 @@ After adding ``MODEL_SELECTION`` in ``./configs/retrain.yaml``\ , you need to us
python -m torch.distributed.launch --nproc_per_node=8 ./retrain.py --cfg ./configs/retrain.yaml
III. Test
3. Test
^^^^^^^^^
To test our trained of models, you need to use ``MODEL_SELECTION`` in ``./configs/test.yaml`` to specify which model to test.
......
......@@ -39,8 +39,8 @@ Reference
PyTorch
^^^^^^^
.. autoclass:: nni.algorithms.nas.pytorch.enas.EnasTrainer
.. autoclass:: nni.algorithms.nas.pytorch.enas.EnasTrainer
:members:
.. autoclass:: nni.algorithms.nas.pytorch.enas.EnasMutator
.. autoclass:: nni.algorithms.nas.pytorch.enas.EnasMutator
:members:
......@@ -29,7 +29,7 @@ The procedure of classic NAS algorithms is similar to hyper-parameter tuning, us
- Brief Introduction of Algorithm
* - :githublink:`Random Search <examples/tuners/random_nas_tuner>`
- Randomly pick a model from search space
* - `PPO Tuner </Tuner/BuiltinTuner.html#PPOTuner>`__
* - `PPO Tuner <../Tuner/BuiltinTuner.rst#PPO-Tuner>`__
- PPO Tuner is a Reinforcement Learning tuner based on PPO algorithm. `Reference Paper <https://arxiv.org/abs/1707.06347>`__
......@@ -46,19 +46,19 @@ NNI currently supports the one-shot NAS algorithms listed below and is adding mo
* - Name
- Brief Introduction of Algorithm
* - `ENAS </NAS/ENAS.html>`__
* - `ENAS <ENAS.rst>`__
- `Efficient Neural Architecture Search via Parameter Sharing <https://arxiv.org/abs/1802.03268>`__. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. It uses parameter sharing between child models to achieve fast speed and excellent performance.
* - `DARTS </NAS/DARTS.html>`__
* - `DARTS <DARTS.rst>`__
- `DARTS: Differentiable Architecture Search <https://arxiv.org/abs/1806.09055>`__ introduces a novel algorithm for differentiable network architecture search on bilevel optimization.
* - `P-DARTS </NAS/PDARTS.html>`__
* - `P-DARTS <PDARTS.rst>`__
- `Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation <https://arxiv.org/abs/1904.12760>`__ is based on DARTS. It introduces an efficient algorithm which allows the depth of searched architectures to grow gradually during the training procedure.
* - `SPOS </NAS/SPOS.html>`__
* - `SPOS <SPOS.rst>`__
- `Single Path One-Shot Neural Architecture Search with Uniform Sampling <https://arxiv.org/abs/1904.00420>`__ constructs a simplified supernet trained with a uniform path sampling method and applies an evolutionary algorithm to efficiently search for the best-performing architectures.
* - `CDARTS </NAS/CDARTS.html>`__
* - `CDARTS <CDARTS.rst>`__
- `Cyclic Differentiable Architecture Search <https://arxiv.org/abs/****>`__ builds a cyclic feedback mechanism between the search and evaluation networks. It introduces a cyclic differentiable architecture search framework which integrates the two networks into a unified architecture.
* - `ProxylessNAS </NAS/Proxylessnas.html>`__
* - `ProxylessNAS <Proxylessnas.rst>`__
- `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/abs/1812.00332>`__. It removes proxy, directly learns the architectures for large-scale target tasks and target hardware platforms.
* - `TextNAS </NAS/TextNAS.html>`__
* - `TextNAS <TextNAS.rst>`__
- `TextNAS: A Neural Architecture Search Space tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__. It is a neural architecture search algorithm tailored for text representation.
......
......@@ -56,8 +56,7 @@ Implementation
The implementation on NNI is based on the `offical implementation <https://github.com/mit-han-lab/ProxylessNAS>`__. The official implementation supports two training approaches: gradient descent and RL based, and support different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'. In our current implementation on NNI, gradient descent training approach is supported, but has not supported different hardwares. The complete support is ongoing.
Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, we put the specified search space in :githublink:`example code <examples/nas/proxylessnas>` using :githublink:`NNI NAS interface <src/sdk/pynni/nni/nas/pytorch/proxylessnas>`.
Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, we put the specified search space in :githublink:`example code <examples/nas/proxylessnas>` using :githublink:`NNI NAS interface <nni/algorithms/nas/pytorch/proxylessnas>`.
.. image:: ../../img/proxylessnas.png
:target: ../../img/proxylessnas.png
......
......@@ -4,7 +4,7 @@ NAS Visualization (Experimental)
Built-in Trainers Support
-------------------------
Currently, only ENAS and DARTS support visualization. Examples of `ENAS <./ENAS.md>`__ and `DARTS <./DARTS.rst>`__ has demonstrated how to enable visualization in your code, namely, adding this before ``trainer.train()``\ :
Currently, only ENAS and DARTS support visualization. Examples of `ENAS <./ENAS.rst>`__ and `DARTS <./DARTS.rst>`__ has demonstrated how to enable visualization in your code, namely, adding this before ``trainer.train()``\ :
.. code-block:: python
......
Write A .. role:: raw-html(raw)
:format: html
Search Space
Write A Search Space
====================
Genrally, a search space describes candiate architectures from which users want to find the best one. Different search algorithms, no matter classic NAS or one-shot NAS, can be applied on the search space. NNI provides APIs to unified the expression of neural architecture search space.
......@@ -61,10 +58,10 @@ So how about the possibilities of connections? This can be done using ``InputCho
# ... same ...
return output
Input choice can be thought of as a callable module that receives a list of tensors and outputs the concatenation/sum/mean of some of them (sum by default), or ``None`` if none is selected. Like layer choices, input choices should be **initialized in ``__init__`` and called in ``forward``**. This is to allow search algorithms to identify these choices and do necessary preparations.
Input choice can be thought of as a callable module that receives a list of tensors and outputs the concatenation/sum/mean of some of them (sum by default), or ``None`` if none is selected. Like layer choices, input choices should be initialized in ``__init__`` and called in ``forward``. This is to allow search algorithms to identify these choices and do necessary preparations.
``LayerChoice`` and ``InputChoice`` are both **mutables**. Mutable means "changeable". As opposed to traditional deep learning layers/modules which have fixed operation types once defined, models with mutable are essentially a series of possible models.
Users can specify a **key** for each mutable. By default, NNI will assign one for you that is globally unique, but in case users want to share choices (for example, there are two ``LayerChoice``\ s with the same candidate operations and you want them to have the same choice, i.e., if first one chooses the i-th op, the second one also chooses the i-th op), they can give them the same key. The key marks the identity for this choice and will be used in the dumped checkpoint. So if you want to increase the readability of your exported architecture, manually assigning keys to each mutable would be a good idea. For advanced usage on mutables (e.g., ``LayerChoice`` and ``InputChoice``\ ), see `Mutables <./NasReference.rst>`__.
With search space defined, the next step is searching for the best model from it. Please refer to `classic NAS algorithms <./ClassicNas.md>`__ and `one-shot NAS algorithms <./NasGuide.rst>`__ for how to search from your defined search space.
With search space defined, the next step is searching for the best model from it. Please refer to `classic NAS algorithms <./ClassicNas.rst>`__ and `one-shot NAS algorithms <./NasGuide.rst>`__ for how to search from your defined search space.
......@@ -77,7 +77,7 @@ NNI also provides algorithm toolkits for machine learning and deep learning, esp
Hyperparameter Tuning
^^^^^^^^^^^^^^^^^^^^^
This is a core and basic feature of NNI, we provide many popular `automatic tuning algorithms <Tuner/BuiltinTuner.md>`__ (i.e., tuner) and `early stop algorithms <Assessor/BuiltinAssessor.md>`__ (i.e., assessor). You can follow `Quick Start <Tutorial/QuickStart.rst>`__ to tune your model (or system). Basically, there are the above three steps and then starting an NNI experiment.
This is a core and basic feature of NNI, we provide many popular `automatic tuning algorithms <Tuner/BuiltinTuner.rst>`__ (i.e., tuner) and `early stop algorithms <Assessor/BuiltinAssessor.rst>`__ (i.e., assessor). You can follow `Quick Start <Tutorial/QuickStart.rst>`__ to tune your model (or system). Basically, there are the above three steps and then starting an NNI experiment.
General NAS Framework
^^^^^^^^^^^^^^^^^^^^^
......
This diff is collapsed.
......@@ -11,44 +11,30 @@ Supported AI Frameworks
-----------------------
* :raw-html:`<b>[PyTorch]</b>` https://github.com/pytorch/pytorch
* `PyTorch <https://github.com/pytorch/pytorch>`__
.. raw:: html
* :githublink:`MNIST-pytorch <examples/trials/mnist-distributed-pytorch>`
* `CIFAR-10 <./TrialExample/Cifar10Examples.rst>`__
* :githublink:`TGS salt identification chanllenge <examples/trials/kaggle-tgs-salt/README.md>`
* :githublink:`Network_morphism <examples/trials/network_morphism/README.md>`
<ul>
<li><a href="../../examples/trials/mnist-distributed-pytorch">MNIST-pytorch</a><br/></li>
<li><a href="TrialExample/Cifar10Examples.md">CIFAR-10</a><br/></li>
<li><a href="../../examples/trials/kaggle-tgs-salt/README.md">TGS salt identification chanllenge</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* `TensorFlow <https://github.com/tensorflow/tensorflow>`__
* :githublink:`MNIST-tensorflow <examples/trials/mnist-distributed>`
* :githublink:`Squad <examples/trials/ga_squad/README.md>`
* :raw-html:`<b>[TensorFlow]</b>` https://github.com/tensorflow/tensorflow
* `Keras <https://github.com/keras-team/keras>`__
.. raw:: html
* :githublink:`MNIST-keras <examples/trials/mnist-keras>`
* :githublink:`Network_morphism <examples/trials/network_morphism/README.md>`
<ul>
<li><a href="../../examples/trials/mnist-distributed">MNIST-tensorflow</a><br/></li>
<li><a href="../../examples/trials/ga_squad/README.md">Squad</a><br/></li>
</ul>
* :raw-html:`<b>[Keras]</b>` https://github.com/keras-team/keras
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-keras">MNIST-keras</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* :raw-html:`<b>[MXNet]</b>` https://github.com/apache/incubator-mxnet
* :raw-html:`<b>[Caffe2]</b>` https://github.com/BVLC/caffe
* :raw-html:`<b>[CNTK (Python language)]</b>` https://github.com/microsoft/CNTK
* :raw-html:`<b>[Spark MLlib]</b>` http://spark.apache.org/mllib/
* :raw-html:`<b>[Chainer]</b>` https://chainer.org/
* :raw-html:`<b>[Theano]</b>` https://pypi.org/project/Theano/ :raw-html:`<br/>`
* `MXNet <https://github.com/apache/incubator-mxnet>`__
* `Caffe2 <https://github.com/BVLC/caffe>`__
* `CNTK (Python language) <https://github.com/microsoft/CNTK>`__
* `Spark MLlib <http://spark.apache.org/mllib/>`__
* `Chainer <https://chainer.org/>`__
* `Theano <https://pypi.org/project/Theano/>`__
You are encouraged to `contribute more examples <Tutorial/Contributing.rst>`__ for other NNI users.
......@@ -58,22 +44,16 @@ Supported Library
NNI also supports all libraries written in python.Here are some common libraries, including some algorithms based on GBDT: XGBoost, CatBoost and lightGBM.
* :raw-html:`<b>[Scikit-learn]</b>` https://scikit-learn.org/stable/
.. raw:: html
* `Scikit-learn <https://scikit-learn.org/stable/>`__
<ul>
<li><a href="TrialExample/SklearnExamples.md">Scikit-learn</a><br/></li>
</ul>
* `Scikit-learn <TrialExample/SklearnExamples.rst>`__
* `XGBoost <https://xgboost.readthedocs.io/en/latest/>`__
* `CatBoost <https://catboost.ai/>`__
* `LightGBM <https://lightgbm.readthedocs.io/en/latest/>`__
* :raw-html:`<b>[XGBoost]</b>` https://xgboost.readthedocs.io/en/latest/
* :raw-html:`<b>[CatBoost]</b>` https://catboost.ai/
* :raw-html:`<b>[LightGBM]</b>` https://lightgbm.readthedocs.io/en/latest/
:raw-html:`<ul>
<li><a href="TrialExample/GbdtExample.md">Auto-gbdt</a><br/></li>
</ul>`
* `Auto-gbdt <TrialExample/GbdtExample.rst>`__
Here is just a small list of libraries that supported by NNI. If you are interested in NNI, you can refer to the `tutorial <TrialExample/Trials.rst>`__ to complete your own hacks.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.md>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.rst>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
......@@ -81,9 +81,8 @@ Compared with `LocalMode <LocalMode.rst>`__ trial configuration in aml mode have
* image
* required key. The docker image name used in job. NNI support image ``msranni/nni`` for running aml jobs.
.. code-block:: bash
Note: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
.. Note:: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
amlConfig:
......
......@@ -11,7 +11,7 @@ Prerequisite for Kubernetes Service
#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the ** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
......@@ -76,7 +76,7 @@ Here is a template configuration specification to use AdaptDL as a training serv
storageSize: 1Gi
Those configs not mentioned below, are following the
`default specs defined in the NNI doc </Tutorial/ExperimentConfig.html#configuration-spec>`__.
`default specs defined </Tutorial/ExperimentConfig.rst#configuration-spec>`__ in the NNI doc.
* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
......
......@@ -18,7 +18,7 @@ Prerequisite for on-premises Kubernetes Service
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
......@@ -101,7 +101,7 @@ If you use Azure Kubernetes Service, you should set ``frameworkcontrollerConfig
Note: You should explicitly set ``trainingServicePlatform: frameworkcontroller`` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
......@@ -115,7 +115,7 @@ Trial configuration in frameworkcontroller mode have the following configuration
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.rst#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
How to run example
------------------
......
......@@ -3,13 +3,15 @@
Run NNI on heterogeneous mode means that NNI will run trials jobs in multiple kinds of training platforms. For example, NNI could submit trial jobs to remote machine and AML simultaneously。
## Setup environment
NNI has supported [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md) and [AML](./AMLMode.md) for heterogeneous training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Setup environment
-----------------
NNI has supported `local <./LocalMode.rst>`__\ , `remote <./RemoteMachineMode.rst>`__\ , `PAI <./PaiMode.rst>`__\ , and `AML <./AMLMode.rst>`__ for heterogeneous training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Run an experiment
-----------------
## Run an experiment
Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's content is like:
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
......@@ -45,8 +47,8 @@ Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's conte
Configurations for heterogeneous mode:
heterogeneousConfig:
* trainingServicePlatforms. required key. This field specify the platforms used in heterogeneous mode, the values using yaml list format. NNI support setting `local`, `remote`, `aml`, `pai` in this field.
* trainingServicePlatforms. required key. This field specify the platforms used in heterogeneous mode, the values using yaml list format. NNI support setting ``local``, ``remote``, ``aml``, ``pai`` in this field.
Note:
If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set `remote` as one of the platform, should also set `machineList` and `remoteConfig` configuration.
.. Note:: If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set ``remote`` as one of the platform, should also set ``machineList`` and ``remoteConfig`` configuration.
......@@ -15,7 +15,7 @@ System architecture
:alt:
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.md>`__\ , `remote platfrom <RemoteMachineMode.md>`__\ , `PAI platfrom <PaiMode.md>`__\ , `kubeflow platform <KubeflowMode.md>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.rst>`__\ , `remote platfrom <RemoteMachineMode.rst>`__\ , `PAI platfrom <PaiMode.rst>`__\ , `kubeflow platform <KubeflowMode.rst>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
......
......@@ -11,16 +11,16 @@ Prerequisite for on-premises Kubernetes Service
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or**Azure File Storage**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or **Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
......@@ -231,6 +231,8 @@ Trial configuration in kubeflow mode have the following configuration keys:
* Required key. The API version of your Kubeflow.
.. cannot find :githublink:`msranni/nni <deployment/docker/Dockerfile>`
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
......
......@@ -6,14 +6,14 @@ What is Training Service?
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.md>`__\ , `remote machines <./RemoteMachineMode.md>`__\ , and on clusters like `PAI <./PaiMode.md>`__\ , `Kubeflow <./KubeflowMode.md>`__\ , `AdaptDL <./AdaptDLMode.md>`__\ , `FrameworkController <./FrameworkControllerMode.md>`__\ , `DLTS <./DLTSMode.md>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.rst>`__\ , `remote machines <./RemoteMachineMode.rst>`__\ , and on clusters like `PAI <./PaiMode.rst>`__\ , `Kubeflow <./KubeflowMode.rst>`__\ , `AdaptDL <./AdaptDLMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ , `DLTS <./DLTSMode.rst>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "\ `how to implement training service <./HowToImplementTrainingService>`__\ " for details.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to `how to implement training service <./HowToImplementTrainingService.rst>`__ for details.
How to use Training Service?
----------------------------
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig>`__ provides more details on the specification of the experiment configuration file.
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig.rst>`__ provides more details on the specification of the experiment configuration file.
Next, users should prepare code directory, which is specified as ``codeDir`` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see :githublink:`this example <examples/trials/mnist-tfv1/.nniignore>` and the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
......@@ -28,21 +28,21 @@ Built-in Training Services
* - TrainingService
- Brief Introduction
* - `**Local** <./LocalMode.rst>`__
* - `Local <./LocalMode.rst>`__
- NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.
* - `**Remote** <./RemoteMachineMode.rst>`__
* - `Remote <./RemoteMachineMode.rst>`__
- NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.
* - `**PAI** <./PaiMode.rst>`__
* - `PAI <./PaiMode.rst>`__
- NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__ (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.
* - `**Kubeflow** <./KubeflowMode.rst>`__
* - `Kubeflow <./KubeflowMode.rst>`__
- NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
* - `**AdaptDL** <./AdaptDLMode.rst>`__
* - `AdaptDL <./AdaptDLMode.rst>`__
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.
* - `**FrameworkController** <./FrameworkControllerMode.rst>`__
* - `FrameworkController <./FrameworkControllerMode.rst>`__
- NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
* - `**DLTS** <./DLTSMode.rst>`__
* - `DLTS <./DLTSMode.rst>`__
- NNI supports running experiment using `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.
* - `**AML** <./AMLMode.rst>`__
* - `AML <./AMLMode.rst>`__
- NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
......@@ -57,7 +57,7 @@ What does Training Service do?
</p>
According to the architecture shown in `Overview <../Overview>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
According to the architecture shown in `Overview <../Overview.rst>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
Step 1. **Validate config and prepare the training platform.** Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will start to prepare for the experiment by making the code directory (\ ``codeDir``\ ) accessible to training platform.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment