Commit 1011377c authored by qianyj's avatar qianyj
Browse files

the source code of NNI for DCU

parent abc22158
-f https://download.pytorch.org/whl/torch_stable.html
tensorflow == 1.15.4
torch == 1.7.1+cpu
torchvision == 0.8.2+cpu
# It will install pytorch-lightning 0.8.x and unit tests won't work.
# Latest version has conflict with tensorboard and tensorflow 1.x.
pytorch-lightning
torchmetrics
keras == 2.1.6
onnx
peewee
graphviz
gym
tianshou >= 0.4.1
matplotlib < 3.4
astor
hyperopt == 0.1.2
json_tricks >= 3.15.5
psutil
pyyaml >= 5.4
requests
responses ; python_version >= "3.7"
responses < 0.18 ; python_version < "3.7"
schema
typeguard
PythonWebHDFS
colorama
scikit-learn >= 0.24.1 ; python_version >= "3.7"
scikit-learn < 1.0 ; python_version < "3.7"
websockets >= 10.1 ; python_version >= "3.7"
websockets <= 10.0 ; python_version < "3.7"
filelock ; python_version >= "3.7"
filelock < 3.4 ; python_version < "3.7"
prettytable
cloudpickle
dataclasses ; python_version < "3.7"
typing_extensions ; python_version < "3.8"
numpy < 1.19.4 ; sys_platform == "win32"
numpy < 1.20 ; sys_platform != "win32" and python_version < "3.7"
numpy ; sys.platform != "win32" and python_version >= "3.7"
scipy < 1.6 ; python_version < "3.7"
scipy ; python_version >= "3.7"
pandas < 1.2 ; python_version < "3.7"
pandas ; python_version >= "3.7"
# the following content will be read by setup.py.
# please follow the logic in setup.py.
# SMAC
ConfigSpaceNNI
smac4nni
# BOHB
ConfigSpace>=0.4.11
statsmodels>=0.12.0
# PPOTuner
gym
# DNGO
pybnn
pip
wheel
setuptools
.. role:: raw-html(raw)
:format: html
Built-in Assessors
==================
NNI provides state-of-the-art tuning algorithms within our builtin-assessors and makes them easy to use. Below is a brief overview of NNI's current builtin Assessors.
Note: Click the **Assessor's name** to get each Assessor's installation requirements, suggested usage scenario, and a config example. A link to a detailed description of each algorithm is provided at the end of the suggested scenario for each Assessor.
Currently, we support the following Assessors:
.. list-table::
:header-rows: 1
:widths: auto
* - Assessor
- Brief Introduction of Algorithm
* - `Medianstop <#MedianStop>`__
- Medianstop is a simple early stopping rule. It stops a pending trial X at step S if the trial’s best objective value by step S is strictly worse than the median value of the running averages of all completed trials’ objectives reported up to step S. `Reference Paper <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf>`__
* - `Curvefitting <#Curvefitting>`__
- Curve Fitting Assessor is an LPA (learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of the final epoch's performance worse than the best final performance in the trial history. In this algorithm, we use 12 curves to fit the accuracy curve. `Reference Paper <http://aad.informatik.uni-freiburg.de/papers/15-IJCAI-Extrapolation_of_Learning_Curves.pdf>`__
Usage of Builtin Assessors
--------------------------
Usage of builtin assessors provided by the NNI SDK requires one to declare the **builtinAssessorName** and **classArgs** in the ``config.yml`` file. In this part, we will introduce the details of usage and the suggested scenarios, classArg requirements, and an example for each assessor.
Note: Please follow the provided format when writing your ``config.yml`` file.
:raw-html:`<a name="MedianStop"></a>`
Median Stop Assessor
^^^^^^^^^^^^^^^^^^^^
..
Builtin Assessor Name: **Medianstop**
**Suggested scenario**
It's applicable in a wide range of performance curves, thus, it can be used in various scenarios to speed up the tuning progress. `Detailed Description <./MedianstopAssessor.rst>`__
**classArgs requirements:**
* **optimize_mode** (*maximize or minimize, optional, default = maximize*\ ) - If 'maximize', assessor will **stop** the trial with smaller expectation. If 'minimize', assessor will **stop** the trial with larger expectation.
* **start_step** (*int, optional, default = 0*\ ) - A trial is determined to be stopped or not only after receiving start_step number of reported intermediate results.
**Usage example:**
.. code-block:: yaml
# config.yml
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
start_step: 5
:raw-html:`<br>`
:raw-html:`<a name="Curvefitting"></a>`
Curve Fitting Assessor
^^^^^^^^^^^^^^^^^^^^^^
..
Builtin Assessor Name: **Curvefitting**
**Suggested scenario**
It's applicable in a wide range of performance curves, thus, it can be used in various scenarios to speed up the tuning progress. Even better, it's able to handle and assess curves with similar performance. `Detailed Description <./CurvefittingAssessor.rst>`__
**Note**\ , according to the original paper, only incremental functions are supported. Therefore this assessor can only be used to maximize optimization metrics. For example, it can be used for accuracy, but not for loss.
**classArgs requirements:**
* **epoch_num** (*int,** required***\ ) - The total number of epochs. We need to know the number of epochs to determine which points we need to predict.
* **start_step** (*int, optional, default = 6*\ ) - A trial is determined to be stopped or not only after receiving start_step number of reported intermediate results.
* **threshold** (*float, optional, default = 0.95*\ ) - The threshold that we use to decide to early stop the worst performance curve. For example: if threshold = 0.95, and the best performance in the history is 0.9, then we will stop the trial who's predicted value is lower than 0.95 * 0.9 = 0.855.
* **gap** (*int, optional, default = 1*\ ) - The gap interval between Assessor judgements. For example: if gap = 2, start_step = 6, then we will assess the result when we get 6, 8, 10, 12...intermediate results.
**Usage example:**
.. code-block:: yaml
# config.yml
assessor:
builtinAssessorName: Curvefitting
classArgs:
epoch_num: 20
start_step: 6
threshold: 0.95
gap: 1
Curve Fitting Assessor on NNI
=============================
Introduction
------------
The Curve Fitting Assessor is an LPA (learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of the final epoch's performance is worse than the best final performance in the trial history.
In this algorithm, we use 12 curves to fit the learning curve. The set of parametric curve models are chosen from this `reference paper <http://aad.informatik.uni-freiburg.de/papers/15-IJCAI-Extrapolation_of_Learning_Curves.pdf>`__. The learning curves' shape coincides with our prior knowledge about the form of learning curves: They are typically increasing, saturating functions.
.. image:: ../../img/curvefitting_learning_curve.PNG
:target: ../../img/curvefitting_learning_curve.PNG
:alt: learning_curve
We combine all learning curve models into a single, more powerful model. This combined model is given by a weighted linear combination:
.. image:: ../../img/curvefitting_f_comb.gif
:target: ../../img/curvefitting_f_comb.gif
:alt: f_comb
with the new combined parameter vector
.. image:: ../../img/curvefitting_expression_xi.gif
:target: ../../img/curvefitting_expression_xi.gif
:alt: expression_xi
Assuming additive Gaussian noise and the noise parameter being initialized to its maximum likelihood estimate.
We determine the maximum probability value of the new combined parameter vector by learning the historical data. We use such a value to predict future trial performance and stop the inadequate experiments to save computing resources.
Concretely, this algorithm goes through three stages of learning, predicting, and assessing.
*
Step1: Learning. We will learn about the trial history of the current trial and determine the \xi at the Bayesian angle. First of all, We fit each curve using the least-squares method, implemented by ``fit_theta``. After we obtained the parameters, we filter the curve and remove the outliers, implemented by ``filter_curve``. Finally, we use the MCMC sampling method. implemented by ``mcmc_sampling``\ , to adjust the weight of each curve. Up to now, we have determined all the parameters in \xi.
*
Step2: Predicting. It calculates the expected final result accuracy, implemented by ``f_comb``\ , at the target position (i.e., the total number of epochs) by \xi and the formula of the combined model.
*
Step3: If the fitting result doesn't converge, the predicted value will be ``None``. In this case, we return ``AssessResult.Good`` to ask for future accuracy information and predict again. Furthermore, we will get a positive value from the ``predict()`` function. If this value is strictly greater than the best final performance in history * ``THRESHOLD``\ (default value = 0.95), return ``AssessResult.Good``\ , otherwise, return ``AssessResult.Bad``
The figure below is the result of our algorithm on MNIST trial history data, where the green point represents the data obtained by Assessor, the blue point represents the future but unknown data, and the red line is the Curve predicted by the Curve fitting assessor.
.. image:: ../../img/curvefitting_example.PNG
:target: ../../img/curvefitting_example.PNG
:alt: examples
Usage
-----
To use Curve Fitting Assessor, you should add the following spec in your experiment's YAML config file:
.. code-block:: yaml
assessor:
builtinAssessorName: Curvefitting
classArgs:
# (required)The total number of epoch.
# We need to know the number of epoch to determine which point we need to predict.
epoch_num: 20
# (optional) In order to save our computing resource, we start to predict when we have more than only after receiving start_step number of reported intermediate results.
# The default value of start_step is 6.
start_step: 6
# (optional) The threshold that we decide to early stop the worse performance curve.
# For example: if threshold = 0.95, best performance in the history is 0.9, then we will stop the trial which predict value is lower than 0.95 * 0.9 = 0.855.
# The default value of threshold is 0.95.
threshold: 0.95
# (optional) The gap interval between Assesor judgements.
# For example: if gap = 2, start_step = 6, then we will assess the result when we get 6, 8, 10, 12...intermedian result.
# The default value of gap is 1.
gap: 1
Limitation
----------
According to the original paper, only incremental functions are supported. Therefore this assessor can only be used to maximize optimization metrics. For example, it can be used for accuracy, but not for loss.
File Structure
--------------
The assessor has a lot of different files, functions, and classes. Here we briefly describe a few of them.
* ``curvefunctions.py`` includes all the function expressions and default parameters.
* ``modelfactory.py`` includes learning and predicting; the corresponding calculation part is also implemented here.
* ``curvefitting_assessor.py`` is the assessor which receives the trial history and assess whether to early stop the trial.
TODO
----
* Further improve the accuracy of the prediction and test it on more models.
Customize Assessor
==================
NNI supports to build an assessor by yourself for tuning demand.
If you want to implement a customized Assessor, there are three things to do:
#. Inherit the base Assessor class
#. Implement assess_trial function
#. Configure your customized Assessor in experiment YAML config file
**1. Inherit the base Assessor class**
.. code-block:: python
from nni.assessor import Assessor
class CustomizedAssessor(Assessor):
def __init__(self, ...):
...
**2. Implement assess trial function**
.. code-block:: python
from nni.assessor import Assessor, AssessResult
class CustomizedAssessor(Assessor):
def __init__(self, ...):
...
def assess_trial(self, trial_history):
"""
Determines whether a trial should be killed. Must override.
trial_history: a list of intermediate result objects.
Returns AssessResult.Good or AssessResult.Bad.
"""
# you code implement here.
...
**3. Configure your customized Assessor in experiment YAML config file**
NNI needs to locate your customized Assessor class and instantiate the class, so you need to specify the location of the customized Assessor class and pass literal values as parameters to the __init__ constructor.
.. code-block:: yaml
assessor:
codeDir: /home/abc/myassessor
classFileName: my_customized_assessor.py
className: CustomizedAssessor
# Any parameter need to pass to your Assessor class __init__ constructor
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
Please noted in **2**. The object ``trial_history`` are exact the object that Trial send to Assessor by using SDK ``report_intermediate_result`` function.
The working directory of your assessor is ``<home>/nni-experiments/<experiment_id>/log``\ , which can be retrieved with environment variable ``NNI_LOG_DIRECTORY``\ ,
More detail example you could see:
* :githublink:`medianstop-assessor <nni/algorithms/hpo/medianstop_assessor.py>`
* :githublink:`curvefitting-assessor <nni/algorithms/hpo/curvefitting_assessor/>`
Medianstop Assessor on NNI
==========================
Median Stop
-----------
Medianstop is a simple early stopping rule mentioned in this `paper <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf>`__. It stops a pending trial X after step S if the trial’s best objective value by step S is strictly worse than the median value of the running averages of all completed trials’ objectives reported up to step S.
Hyper Parameter Optimization Comparison
=======================================
*Posted by Anonymous Author*
Comparison of Hyperparameter Optimization (HPO) algorithms on several problems.
Hyperparameter Optimization algorithms are list below:
* `Random Search <../Tuner/BuiltinTuner.rst>`__
* `Grid Search <../Tuner/BuiltinTuner.rst>`__
* `Evolution <../Tuner/BuiltinTuner.rst>`__
* `Anneal <../Tuner/BuiltinTuner.rst>`__
* `Metis <../Tuner/BuiltinTuner.rst>`__
* `TPE <../Tuner/BuiltinTuner.rst>`__
* `SMAC <../Tuner/BuiltinTuner.rst>`__
* `HyperBand <../Tuner/BuiltinTuner.rst>`__
* `BOHB <../Tuner/BuiltinTuner.rst>`__
All algorithms run in NNI local environment.
Machine Environment:
.. code-block:: bash
OS: Linux Ubuntu 16.04 LTS
CPU: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz 2600 MHz
Memory: 112 GB
NNI Version: v0.7
NNI Mode(local|pai|remote): local
Python version: 3.6
Is conda or virtualenv used?: Conda
is running in docker?: no
AutoGBDT Example
----------------
Problem Description
^^^^^^^^^^^^^^^^^^^
Nonconvex problem on the hyper-parameter search of `AutoGBDT <../TrialExample/GbdtExample.rst>`__ example.
Search Space
^^^^^^^^^^^^
.. code-block:: json
{
"num_leaves": {
"_type": "choice",
"_value": [10, 12, 14, 16, 18, 20, 22, 24, 28, 32, 48, 64, 96, 128]
},
"learning_rate": {
"_type": "choice",
"_value": [0.00001, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.5]
},
"max_depth": {
"_type": "choice",
"_value": [-1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 28, 32, 48, 64, 96, 128]
},
"feature_fraction": {
"_type": "choice",
"_value": [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
},
"bagging_fraction": {
"_type": "choice",
"_value": [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]
},
"bagging_freq": {
"_type": "choice",
"_value": [1, 2, 4, 8, 10, 12, 14, 16]
}
}
The total search space is 1,204,224, we set the number of maximum trial to 1000. The time limitation is 48 hours.
Results
^^^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* - Algorithm
- Best loss
- Average of Best 5 Losses
- Average of Best 10 Losses
* - Random Search
- 0.418854
- 0.420352
- 0.421553
* - Random Search
- 0.417364
- 0.420024
- 0.420997
* - Random Search
- 0.417861
- 0.419744
- 0.420642
* - Grid Search
- 0.498166
- 0.498166
- 0.498166
* - Evolution
- 0.409887
- 0.409887
- 0.409887
* - Evolution
- 0.413620
- 0.413875
- 0.414067
* - Evolution
- 0.409887
- 0.409887
- 0.409887
* - Anneal
- 0.414877
- 0.417289
- 0.418281
* - Anneal
- 0.409887
- 0.409887
- 0.410118
* - Anneal
- 0.413683
- 0.416949
- 0.417537
* - Metis
- 0.416273
- 0.420411
- 0.422380
* - Metis
- 0.420262
- 0.423175
- 0.424816
* - Metis
- 0.421027
- 0.424172
- 0.425714
* - TPE
- 0.414478
- 0.414478
- 0.414478
* - TPE
- 0.415077
- 0.417986
- 0.418797
* - TPE
- 0.415077
- 0.417009
- 0.418053
* - SMAC
- **0.408386**
- **0.408386**
- **0.408386**
* - SMAC
- 0.414012
- 0.414012
- 0.414012
* - SMAC
- **0.408386**
- **0.408386**
- **0.408386**
* - BOHB
- 0.410464
- 0.415319
- 0.417755
* - BOHB
- 0.418995
- 0.420268
- 0.422604
* - BOHB
- 0.415149
- 0.418072
- 0.418932
* - HyperBand
- 0.414065
- 0.415222
- 0.417628
* - HyperBand
- 0.416807
- 0.417549
- 0.418828
* - HyperBand
- 0.415550
- 0.415977
- 0.417186
* - GP
- 0.414353
- 0.418563
- 0.420263
* - GP
- 0.414395
- 0.418006
- 0.420431
* - GP
- 0.412943
- 0.416566
- 0.418443
In this example, all the algorithms are used with default parameters. For Metis, there are about 300 trials because it runs slowly due to its high time complexity O(n^3) in Gaussian Process.
RocksDB Benchmark 'fillrandom' and 'readrandom'
-----------------------------------------------
Problem Description
^^^^^^^^^^^^^^^^^^^
`DB_Bench <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is the main tool that is used to benchmark `RocksDB <https://rocksdb.org/>`__\ 's performance. It has so many hapermeter to tune.
The performance of ``DB_Bench`` is associated with the machine configuration and installation method. We run the ``DB_Bench``\ in the Linux machine and install the Rock in shared library.
Machine configuration
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
RocksDB: version 6.1
CPU: 6 * Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
CPUCache: 35840 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
Storage performance
^^^^^^^^^^^^^^^^^^^
**Latency**\ : each IO request will take some time to complete, this is called the average latency. There are several factors that would affect this time including network connection quality and hard disk IO performance.
**IOPS**\ : **IO operations per second**\ , which means the amount of *read or write operations* that could be done in one seconds time.
**IO size**\ : **the size of each IO request**. Depending on the operating system and the application/service that needs disk access it will issue a request to read or write a certain amount of data at the same time.
**Throughput (in MB/s) = Average IO size x IOPS**
IOPS is related to online processing ability and we use the IOPS as the metric in my experiment.
Search Space
^^^^^^^^^^^^
.. code-block:: json
{
"max_background_compactions": {
"_type": "quniform",
"_value": [1, 256, 1]
},
"block_size": {
"_type": "quniform",
"_value": [1, 500000, 1]
},
"write_buffer_size": {
"_type": "quniform",
"_value": [1, 130000000, 1]
},
"max_write_buffer_number": {
"_type": "quniform",
"_value": [1, 128, 1]
},
"min_write_buffer_number_to_merge": {
"_type": "quniform",
"_value": [1, 32, 1]
},
"level0_file_num_compaction_trigger": {
"_type": "quniform",
"_value": [1, 256, 1]
},
"level0_slowdown_writes_trigger": {
"_type": "quniform",
"_value": [1, 1024, 1]
},
"level0_stop_writes_trigger": {
"_type": "quniform",
"_value": [1, 1024, 1]
},
"cache_size": {
"_type": "quniform",
"_value": [1, 30000000, 1]
},
"compaction_readahead_size": {
"_type": "quniform",
"_value": [1, 30000000, 1]
},
"new_table_reader_for_compaction_inputs": {
"_type": "randint",
"_value": [1]
}
}
The search space is enormous (about 10^40) and we set the maximum number of trial to 100 to limit the computation resource.
Results
^^^^^^^
fillrandom' Benchmark
^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Best IOPS (Repeat 1)
- Best IOPS (Repeat 2)
- Best IOPS (Repeat 3)
* - Random
- 449901
- 427620
- 477174
* - Anneal
- 461896
- 467150
- 437528
* - Evolution
- 436755
- 389956
- 389790
* - TPE
- 378346
- 482316
- 468989
* - SMAC
- 491067
- 490472
- **491136**
* - Metis
- 444920
- 457060
- 454438
Figure:
.. image:: ../../img/hpo_rocksdb_fillrandom.png
:target: ../../img/hpo_rocksdb_fillrandom.png
:alt:
'readrandom' Benchmark
^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Best IOPS (Repeat 1)
- Best IOPS (Repeat 2)
- Best IOPS (Repeat 3)
* - Random
- 2276157
- 2285301
- 2275142
* - Anneal
- 2286330
- 2282229
- 2284012
* - Evolution
- 2286524
- 2283673
- 2283558
* - TPE
- 2287366
- 2282865
- 2281891
* - SMAC
- 2270874
- 2284904
- 2282266
* - Metis
- **2287696**
- 2283496
- 2277701
Figure:
.. image:: ../../img/hpo_rocksdb_readrandom.png
:target: ../../img/hpo_rocksdb_readrandom.png
:alt:
Comparison of Filter Pruning Algorithms
=======================================
To provide an initial insight into the performance of various filter pruning algorithms,
we conduct extensive experiments with various pruning algorithms on some benchmark models and datasets.
We present the experiment result in this document.
In addition, we provide friendly instructions on the re-implementation of these experiments to facilitate further contributions to this effort.
Experiment Setting
------------------
The experiments are performed with the following pruners/datasets/models:
*
Models: :githublink:`VGG16, ResNet18, ResNet50 <examples/model_compress/pruning/models/cifar10>`
*
Datasets: CIFAR-10
*
Pruners:
* These pruners are included:
* Pruners with scheduling : ``SimulatedAnnealing Pruner``\ , ``NetAdapt Pruner``\ , ``AutoCompress Pruner``.
Given the overal sparsity requirement, these pruners can automatically generate a sparsity distribution among different layers.
* One-shot pruners: ``L1Filter Pruner``\ , ``L2Filter Pruner``\ , ``FPGM Pruner``.
The sparsity of each layer is set the same as the overall sparsity in this experiment.
*
Only **filter pruning** performances are compared here.
For the pruners with scheduling, ``L1Filter Pruner`` is used as the base algorithm. That is to say, after the sparsities distribution is decided by the scheduling algorithm, ``L1Filter Pruner`` is used to performn real pruning.
*
All the pruners listed above are implemented in :githublink:`nni <docs/en_US/Compression/Overview.rst>`.
Experiment Result
-----------------
For each dataset/model/pruner combination, we prune the model to different levels by setting a series of target sparsities for the pruner.
Here we plot both **Number of Weights - Performances** curve and **FLOPs - Performance** curve.
As a reference, we also plot the result declared in the paper `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <http://arxiv.org/abs/1907.03141>`__ for models VGG16 and ResNet18 on CIFAR-10.
The experiment result are shown in the following figures:
CIFAR-10, VGG16:
.. image:: ../../../examples/model_compress/pruning/comparison_of_pruners/img/performance_comparison_vgg16.png
:target: ../../../examples/model_compress/pruning/comparison_of_pruners/img/performance_comparison_vgg16.png
:alt:
CIFAR-10, ResNet18:
.. image:: ../../../examples/model_compress/pruning/comparison_of_pruners/img/performance_comparison_resnet18.png
:target: ../../../examples/model_compress/pruning/comparison_of_pruners/img/performance_comparison_resnet18.png
:alt:
CIFAR-10, ResNet50:
.. image:: ../../../examples/model_compress/pruning/comparison_of_pruners/img/performance_comparison_resnet50.png
:target: ../../../examples/model_compress/pruning/comparison_of_pruners/img/performance_comparison_resnet50.png
:alt:
Analysis
--------
From the experiment result, we get the following conclusions:
* Given the constraint on the number of parameters, the pruners with scheduling ( ``AutoCompress Pruner`` , ``SimualatedAnnealing Pruner`` ) performs better than the others when the constraint is strict. However, they have no such advantage in FLOPs/Performances comparison since only number of parameters constraint is considered in the optimization process;
* The basic algorithms ``L1Filter Pruner`` , ``L2Filter Pruner`` , ``FPGM Pruner`` performs very similarly in these experiments;
* ``NetAdapt Pruner`` can not achieve very high compression rate. This is caused by its mechanism that it prunes only one layer each pruning iteration. This leads to un-acceptable complexity if the sparsity per iteration is much lower than the overall sparisity constraint.
Experiments Reproduction
------------------------
Implementation Details
^^^^^^^^^^^^^^^^^^^^^^
*
The experiment results are all collected with the default configuration of the pruners in nni, which means that when we call a pruner class in nni, we don't change any default class arguments.
*
Both FLOPs and the number of parameters are counted with :githublink:`Model FLOPs/Parameters Counter <docs/en_US/Compression/CompressionUtils.md#model-flopsparameters-counter>` after :githublink:`model speed up <docs/en_US/Compression/ModelSpeedup.rst>`.
This avoids potential issues of counting them of masked models.
*
The experiment code can be found :githublink:`here <examples/model_compress/pruning/auto_pruners_torch.py>`.
Experiment Result Rendering
^^^^^^^^^^^^^^^^^^^^^^^^^^^
*
If you follow the practice in the :githublink:`example <examples/model_compress/pruning/auto_pruners_torch.py>`\ , for every single pruning experiment, the experiment result will be saved in JSON format as follows:
.. code-block:: json
{
"performance": {"original": 0.9298, "pruned": 0.1, "speedup": 0.1, "finetuned": 0.7746},
"params": {"original": 14987722.0, "speedup": 167089.0},
"flops": {"original": 314018314.0, "speedup": 38589922.0}
}
*
The experiment results are saved :githublink:`here <examples/model_compress/pruning/comparison_of_pruners>`.
You can refer to :githublink:`analyze <examples/model_compress/pruning/comparison_of_pruners/analyze.py>` to plot new performance comparison figures.
Contribution
------------
TODO Items
^^^^^^^^^^
* Pruners constrained by FLOPS/latency
* More pruning algorithms/datasets/models
Issues
^^^^^^
For algorithm implementation & experiment issues, please `create an issue <https://github.com/microsoft/nni/issues/new/>`__.
.. role:: raw-html(raw)
:format: html
NNI review article from Zhihu: :raw-html:`<an open source project with highly reasonable design>` - By Garvin Li
========================================================================================================================
The article is by a NNI user on Zhihu forum. In the article, Garvin had shared his experience on using NNI for Automatic Feature Engineering. We think this article is very useful for users who are interested in using NNI for feature engineering. With author's permission, we translated the original article into English.
**source**\ : `如何看待微软最新发布的AutoML平台NNI?By Garvin Li <https://www.zhihu.com/question/297982959/answer/964961829?utm_source=wechat_session&utm_medium=social&utm_oi=28812108627968&from=singlemessage&isappinstalled=0>`__
01 Overview of AutoML
---------------------
In author's opinion, AutoML is not only about hyperparameter optimization, but
also a process that can target various stages of the machine learning process,
including feature engineering, NAS, HPO, etc.
02 Overview of NNI
------------------
NNI (Neural Network Intelligence) is an open source AutoML toolkit from
Microsoft, to help users design and tune machine learning models, neural network
architectures, or a complex system’s parameters in an efficient and automatic
way.
Link: `https://github.com/Microsoft/nni <https://github.com/Microsoft/nni>`__
In general, most of Microsoft tools have one prominent characteristic: the
design is highly reasonable (regardless of the technology innovation degree).
NNI's AutoFeatureENG basically meets all user requirements of AutoFeatureENG
with a very reasonable underlying framework design.
03 Details of NNI-AutoFeatureENG
--------------------------------
..
The article is following the github project: `https://github.com/SpongebBob/tabular_automl_NNI <https://github.com/SpongebBob/tabular_automl_NNI>`__.
Each new user could do AutoFeatureENG with NNI easily and efficiently. To exploring the AutoFeatureENG capability, downloads following required files, and then run NNI install through pip.
.. image:: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
:target: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
:alt:
NNI treats AutoFeatureENG as a two-steps-task, feature generation exploration and feature selection. Feature generation exploration is mainly about feature derivation and high-order feature combination.
04 Feature Exploration
----------------------
For feature derivation, NNI offers many operations which could automatically generate new features, which list \ `as following <https://github.com/SpongebBob/tabular_automl_NNI/blob/master/AutoFEOp.md>`__\  :
**count**\ : Count encoding is based on replacing categories with their counts computed on the train set, also named frequency encoding.
**target**\ : Target encoding is based on encoding categorical variable values with the mean of target variable per value.
**embedding**\ : Regard features as sentences, generate vectors using *Word2Vec.*
**crosscout**\ : Count encoding on more than one-dimension, alike CTR (Click Through Rate).
**aggregete**\ : Decide the aggregation functions of the features, including min/max/mean/var.
**nunique**\ : Statistics of the number of unique features.
**histsta**\ : Statistics of feature buckets, like histogram statistics.
Search space could be defined in a **JSON file**\ : to define how specific features intersect, which two columns intersect and how features generate from corresponding columns.
.. image:: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
:target: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
:alt:
The picture shows us the procedure of defining search space. NNI provides count encoding for 1-order-op, as well as cross count encoding, aggerate statistics (min max var mean median nunique) for 2-order-op.
For example, we want to search the features which are a frequency encoding (valuecount) features on columns name {“C1”, ...,” C26”}, in the following way:
.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
:target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
:alt:
we can define a cross frequency encoding (value count on cross dims) method on columns {"C1",...,"C26"} x {"C1",...,"C26"} in the following way:
.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
:target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
:alt:
The purpose of Exploration is to generate new features. You can use **get_next_parameter** function to get received feature candidates of one trial.
..
RECEIVED_PARAMS = nni.get_next_parameter()
05 Feature selection
--------------------
To avoid feature explosion and overfitting, feature selection is necessary. In the feature selection of NNI-AutoFeatureENG, LightGBM (Light Gradient Boosting Machine), a gradient boosting framework developed by Microsoft, is mainly promoted.
.. image:: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
:target: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
:alt:
If you have used **XGBoost** or **GBDT**\ , you would know the algorithm based on tree structure can easily calculate the importance of each feature on results. LightGBM is able to make feature selection naturally.
The issue is that selected features might be applicable to *GBDT* (Gradient Boosting Decision Tree), but not to the linear algorithm like *LR* (Logistic Regression).
.. image:: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
:target: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
:alt:
06 Summary
----------
NNI's AutoFeatureEng sets a well-established standard, showing us the operation procedure, available modules, which is highly convenient to use. However, a simple model is probably not enough for good results.
Suggestions to NNI
------------------
About Exploration: If consider using DNN (like xDeepFM) to extract high-order feature would be better.
About Selection: There could be more intelligent options, such as automatic selection system based on downstream models.
Conclusion: NNI could offer users some inspirations of design and it is a good open source project. I suggest researchers leverage it to accelerate the AI research.
Tips: Because the scripts of open source projects are compiled based on gcc7, Mac system may encounter problems of gcc (GNU Compiler Collection). The solution is as follows:
.. code-block:: bash
brew install libomp
Use NNI on Google Colab
=======================
NNI can easily run on Google Colab platform. However, Colab doesn't expose its public IP and ports, so by default you can not access NNI's Web UI on Colab. To solve this, you need a reverse proxy software like ``ngrok`` or ``frp``. This tutorial will show you how to use ngrok to access NNI's Web UI on Colab.
How to Open NNI's Web UI on Google Colab
----------------------------------------
#. Install required packages and softwares.
.. code-block:: bash
! pip install nni # install nni
! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip # download ngrok and unzip it
! unzip ngrok-stable-linux-amd64.zip
! mkdir -p nni_repo
! git clone https://github.com/microsoft/nni.git nni_repo/nni # clone NNI's offical repo to get examples
#. Register a ngrok account `here <https://ngrok.com/>`__\ , then connect to your account using your authtoken.
.. code-block:: bash
! ./ngrok authtoken <your-authtoken>
#. Start an NNI example on a port bigger than 1024, then start ngrok with the same port. If you want to use gpu, make sure gpuNum >= 1 in config.yml. Use ``get_ipython()`` to start ngrok since it will be stuck if you use ``! ngrok http 5000 &``.
.. code-block:: bash
! nnictl create --config nni_repo/nni/examples/trials/mnist-pytorch/config.yml --port 5000 &
get_ipython().system_raw('./ngrok http 5000 &')
#. Check the public url.
.. code-block:: bash
! curl -s http://localhost:4040/api/tunnels # don't change the port number 4040
You will see an url like http://xxxx.ngrok.io after step 4, open this url and you will find NNI's Web UI. Have fun :)
Access Web UI with frp
----------------------
frp is another reverse proxy software with similar functions. However, frp doesn't provide free public urls, so you may need an server with public IP as a frp server. See `here <https://github.com/fatedier/frp>`__ to know more about how to deploy frp.
Neural Architecture Search Comparison
=====================================
*Posted by Anonymous Author*
Train and Compare NAS (Neural Architecture Search) models including Autokeras, DARTS, ENAS and NAO.
Their source code link is as below:
*
Autokeras: `https://github.com/jhfjhfj1/autokeras <https://github.com/jhfjhfj1/autokeras>`__
*
DARTS: `https://github.com/quark0/darts <https://github.com/quark0/darts>`__
*
ENAS: `https://github.com/melodyguan/enas <https://github.com/melodyguan/enas>`__
*
NAO: `https://github.com/renqianluo/NAO <https://github.com/renqianluo/NAO>`__
Experiment Description
----------------------
To avoid over-fitting in **CIFAR-10**\ , we also compare the models in the other five datasets including Fashion-MNIST, CIFAR-100, OUI-Adience-Age, ImageNet-10-1 (subset of ImageNet), ImageNet-10-2 (another subset of ImageNet). We just sample a subset with 10 different labels from ImageNet to make ImageNet-10-1 or ImageNet-10-2.
.. list-table::
:header-rows: 1
:widths: auto
* - Dataset
- Training Size
- Numer of Classes
- Descriptions
* - `Fashion-MNIST <https://github.com/zalandoresearch/fashion-mnist>`__
- 60,000
- 10
- T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot.
* - `CIFAR-10 <https://www.cs.toronto.edu/~kriz/cifar.html>`__
- 50,000
- 10
- Airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks.
* - `CIFAR-100 <https://www.cs.toronto.edu/~kriz/cifar.html>`__
- 50,000
- 100
- Similar to CIFAR-10 but with 100 classes and 600 images each.
* - `OUI-Adience-Age <https://talhassner.github.io/home/projects/Adience/Adience-data.html>`__
- 26,580
- 8
- 8 age groups/labels (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-).
* - `ImageNet-10-1 <http://www.image-net.org/>`__
- 9,750
- 10
- Coffee mug, computer keyboard, dining table, wardrobe, lawn mower, microphone, swing, sewing machine, odometer and gas pump.
* - `ImageNet-10-2 <http://www.image-net.org/>`__
- 9,750
- 10
- Drum, banj, whistle, grand piano, violin, organ, acoustic guitar, trombone, flute and sax.
We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
Search phase time for all NAS methods is **two days** as well as the retrain time. Average results are reported based on **three repeat times**. Our evaluation machines have one Nvidia Tesla P100 GPU, 112GB of RAM and one 2.60GHz CPU (Intel E5-2690).
For NAO, it requires too much computing resources, so we only use NAO-WS which provides the pipeline script.
For AutoKeras, we used 0.2.18 version because it was the latest version when we started the experiment.
NAS Performance
---------------
.. list-table::
:header-rows: 1
:widths: auto
* - NAS
- AutoKeras (%)
- ENAS (macro) (%)
- ENAS (micro) (%)
- DARTS (%)
- NAO-WS (%)
* - Fashion-MNIST
- 91.84
- 95.44
- 95.53
- **95.74**
- 95.20
* - CIFAR-10
- 75.78
- 95.68
- **96.16**
- 94.23
- 95.64
* - CIFAR-100
- 43.61
- 78.13
- 78.84
- **79.74**
- 75.75
* - OUI-Adience-Age
- 63.20
- **80.34**
- 78.55
- 76.83
- 72.96
* - ImageNet-10-1
- 61.80
- 77.07
- 79.80
- **80.48**
- 77.20
* - ImageNet-10-2
- 37.20
- 58.13
- 56.47
- 60.53
- **61.20**
Unfortunately, we cannot reproduce all the results in the paper.
The best or average results reported in the paper:
.. list-table::
:header-rows: 1
:widths: auto
* - NAS
- AutoKeras(%)
- ENAS (macro) (%)
- ENAS (micro) (%)
- DARTS (%)
- NAO-WS (%)
* - CIFAR- 10
- 88.56(best)
- 96.13(best)
- 97.11(best)
- 97.17(average)
- 96.47(best)
For AutoKeras, it has relatively worse performance across all datasets due to its random factor on network morphism.
For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro) shows good results in CIFAR-10.
For DARTS, it has a good performance on some datasets but we found its high variance in other datasets. The difference among three runs of benchmarks can be up to 5.37% in OUI-Adience-Age and 4.36% in ImageNet-10-1.
For NAO-WS, it shows good results in ImageNet-10-2 but it can perform very poorly in OUI-Adience-Age.
Reference
---------
#.
Jin, Haifeng, Qingquan Song, and Xia Hu. "Efficient neural architecture search with network morphism." *arXiv preprint arXiv:1806.10282* (2018).
#.
Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018).
#.
Pham, Hieu, et al. "Efficient Neural Architecture Search via Parameters Sharing." international conference on machine learning (2018): 4092-4101.
#.
Luo, Renqian, et al. "Neural Architecture Optimization." neural information processing systems (2018): 7827-7838.
.. role:: raw-html(raw)
:format: html
Parallelizing a Sequential Algorithm TPE
========================================
TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete. For the TPE approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation of the y was assigned temporarily, until the evaluation completed and reported the actual loss f(x∗).
Introduction and Problems
-------------------------
Sequential Model-based Global Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sequential Model-Based Global Optimization (SMBO) algorithms have been used in many applications where evaluation of the fitness function is expensive. In an application where the true fitness function f: X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point x∗ that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. This active-learning-like algorithm template is summarized in the figure below. SMBO algorithms differ in what criterion they optimize to obtain x∗ given a model (or surrogate) of f, and in they model f via observation history H.
.. image:: ../../img/parallel_tpe_search4.PNG
:target: ../../img/parallel_tpe_search4.PNG
:alt:
The algorithms in this work optimize the criterion of Expected Improvement (EI). Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the bandit-based criterion. We chose to use the EI criterion in TPE because it is intuitive, and has been shown to work well in a variety of settings. Expected improvement is the expectation under some model M of f : X → RN that f(x) will exceed (negatively) some threshold y∗:
.. image:: ../../img/parallel_tpe_search_ei.PNG
:target: ../../img/parallel_tpe_search_ei.PNG
:alt:
Since calculation of p(y|x) is expensive, TPE approach modeled p(y|x) by p(x|y) and p(y).The TPE defines p(x|y) using two such densities:
.. image:: ../../img/parallel_tpe_search_tpe.PNG
:target: ../../img/parallel_tpe_search_tpe.PNG
:alt:
where l(x) is the density formed by using the observations {x(i)} such that corresponding loss
f(x(i)) was less than y∗ and g(x) is the density formed by using the remaining observations. TPE algorithm depends on a y∗ that is larger than the best observed f(x) so that some points can be used to form l(x). The TPE algorithm chooses y∗ to be some quantile γ of the observed y values, so that p(y<\ ``y∗``\ ) = γ, but no specific model for p(y) is necessary. The tree-structured form of l and g makes it easy to draw many candidates according to l and evaluate them according to g(x)/l(x). On each iteration, the algorithm returns the candidate x∗ with the greatest EI.
Here is a simulation of the TPE algorithm in a two-dimensional search space. The difference of background color represents different values. It can be seen that TPE combines exploration and exploitation very well. (Black indicates the points of this round samples, and yellow indicates the points has been taken in the history.)
.. image:: ../../img/parallel_tpe_search1.gif
:target: ../../img/parallel_tpe_search1.gif
:alt:
**Since EI is a continuous function, the highest x of EI is determined at a certain status.** As shown in the figure below, the blue triangle is the point that is most likely to be sampled in this state.
.. image:: ../../img/parallel_tpe_search_ei2.PNG
:target: ../../img/parallel_tpe_search_ei2.PNG
:alt:
TPE performs well when we use it in sequential, but if we provide a larger concurrency, then **there will be a large number of points produced in the same EI state**\ , too concentrated points will reduce the exploration ability of the tuner, resulting in resources waste.
Here is the simulation figure when we set ``concurrency=60``\ , It can be seen that this phenomenon is obvious.
.. image:: ../../img/parallel_tpe_search2.gif
:target: ../../img/parallel_tpe_search2.gif
:alt:
Research solution
-----------------
Approximated q-EI Maximization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The multi-points criterion that we have presented below can potentially be used to deliver an additional design of experiments in one step through the resolution of the optimization problem.
.. image:: ../../img/parallel_tpe_search_qEI.PNG
:target: ../../img/parallel_tpe_search_qEI.PNG
:alt:
However, the computation of q-EI becomes intensive as q increases. After our research, there are four popular greedy strategies that approach the result of problem while avoiding its numerical cost.
Solution 1: Believing the OK Predictor: The KB(Kriging Believer) Heuristic Strategy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Kriging Believer strategy replaces the conditional knowledge about the responses at the sites chosen within the last iterations by deterministic values equal to the expectation of the Kriging predictor. Keeping the same notations as previously, the strategy can be summed up as follows:
.. image:: ../../img/parallel_tpe_search_kb.PNG
:target: ../../img/parallel_tpe_search_kb.PNG
:alt:
This sequential strategy delivers a q-points design and is computationally affordable since it relies on the analytically known EI, optimized in d dimensions. However, there is a risk of failure, since believing an OK predictor that overshoots the observed data may lead to a sequence that gets trapped in a non-optimal region for many iterations. We now propose a second strategy that reduces this risk.
Solution 2: The CL(Constant Liar) Heuristic Strategy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Let us now consider a sequential strategy in which the metamodel is updated (still without hyperparameter re-estimation) at each iteration with a value L exogenously fixed by the user, here called a ”lie”. The strategy referred to as the Constant Liar consists in lying with the same value L at every iteration: maximize EI (i.e. find xn+1), actualize the model as if y(xn+1) = L, and so on always with the same L ∈ R:
.. image:: ../../img/parallel_tpe_search_cl.PNG
:target: ../../img/parallel_tpe_search_cl.PNG
:alt:
L should logically be determined on the basis of the values taken by y at X. Three values, min{Y}, mean{Y}, and max{Y} are considered here. **The larger L is, the more explorative the algorithm will be, and vice versa.**
We have simulated the method above. The following figure shows the result of using mean value liars to maximize q-EI. We find that the points we have taken have begun to be scattered.
.. image:: ../../img/parallel_tpe_search3.gif
:target: ../../img/parallel_tpe_search3.gif
:alt:
Experiment
----------
Branin-Hoo
^^^^^^^^^^
The four optimization strategies presented in the last section are now compared on the Branin-Hoo function which is a classical test-case in global optimization.
.. image:: ../../img/parallel_tpe_search_branin.PNG
:target: ../../img/parallel_tpe_search_branin.PNG
:alt:
The recommended values of a, b, c, r, s and t are: a = 1, b = 5.1 ⁄ (4π2), c = 5 ⁄ π, r = 6, s = 10 and t = 1 ⁄ (8π). This function has three global minimizers(-3.14, 12.27), (3.14, 2.27), (9.42, 2.47).
Next is the comparison of the q-EI associated with the q first points (q ∈ [1,10]) given by the constant liar strategies (min and max), 2000 q-points designs uniformly drawn for every q, and 2000 q-points LHS designs taken at random for every q.
.. image:: ../../img/parallel_tpe_search_result.PNG
:target: ../../img/parallel_tpe_search_result.PNG
:alt:
As we can seen on figure, CL[max] and CL[min] offer very good q-EI results compared to random designs, especially for small values of q.
Gaussian Mixed Model function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also compared the case of using parallel optimization and not using parallel optimization. A two-dimensional multimodal Gaussian Mixed distribution is used to simulate, the following is our result:
.. list-table::
:header-rows: 1
:widths: auto
* -
- concurrency=80
- concurrency=60
- concurrency=40
- concurrency=20
- concurrency=10
* - Without parallel optimization
- avg = 0.4841 :raw-html:`<br>` var = 0.1953
- avg = 0.5155 :raw-html:`<br>` var = 0.2219
- avg = 0.5773 :raw-html:`<br>` var = 0.2570
- avg = 0.4680 :raw-html:`<br>` var = 0.1994
- avg = 0.2774 :raw-html:`<br>` var = 0.1217
* - With parallel optimization
- avg = 0.2132 :raw-html:`<br>` var = 0.0700
- avg = 0.2177\ :raw-html:`<br>`\ var = 0.0796
- avg = 0.1835 :raw-html:`<br>` var = 0.0533
- avg = 0.1671 :raw-html:`<br>` var = 0.0413
- avg = 0.1918 :raw-html:`<br>` var = 0.0697
Note: The total number of samples per test is 240 (ensure that the budget is equal). The trials in each form were repeated 1000 times, the value is the average and variance of the best results in 1000 trials.
References
----------
[1] James Bergstra, Remi Bardenet, Yoshua Bengio, Balazs Kegl. `Algorithms for Hyper-Parameter Optimization. <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
[2] Meng-Hiot Lim, Yew-Soon Ong. `Computational Intelligence in Expensive Optimization Problems. <https://link.springer.com/content/pdf/10.1007%2F978-3-642-10701-6.pdf>`__
[3] M. Jordan, J. Kleinberg, B. Scho¨lkopf. `Pattern Recognition and Machine Learning. <http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf>`__
Automatically tuning SVD (NNI in Recommenders)
==============================================
In this tutorial, we first introduce a github repo `Recommenders <https://github.com/Microsoft/Recommenders>`__. It is a repository that provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. It has various models that are popular and widely deployed in recommendation systems. To provide a complete end-to-end experience, they present each example in five key tasks, as shown below:
* `Prepare Data <https://github.com/microsoft/recommenders/tree/master/examples/01_prepare_data>`__\ : Preparing and loading data for each recommender algorithm.
* Model(`collaborative filtering algorithms <https://github.com/microsoft/recommenders/tree/master/examples/02_model_collaborative_filtering>`__\ , `content-based filtering algorithms <https://github.com/microsoft/recommenders/tree/master/examples/02_model_content_based_filtering>`__\ , `hybrid algorithms <https://github.com/microsoft/recommenders/tree/master/examples/02_model_hybrid>`__\ ): Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (\ `ALS <https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS>`__\ ) or eXtreme Deep Factorization Machines (\ `xDeepFM <https://arxiv.org/abs/1803.05170>`__\ ).
* `Evaluate <https://github.com/microsoft/recommenders/tree/master/examples/03_evaluate>`__\ : Evaluating algorithms with offline metrics.
* `Model Select and Optimize <https://github.com/microsoft/recommenders/tree/master/examples/04_model_select_and_optimize>`__\ : Tuning and optimizing hyperparameters for recommender models.
* `Operationalize <https://github.com/microsoft/recommenders/tree/master/examples/05_operationalize>`__\ : Operationalizing models in a production environment on Azure.
The fourth task is tuning and optimizing the model's hyperparameters, this is where NNI could help. To give a concrete example that NNI tunes the models in Recommenders, let's demonstrate with the model `SVD <https://github.com/microsoft/recommenders/blob/master/examples/02_model_collaborative_filtering/surprise_svd_deep_dive.ipynb>`__\ , and data Movielens100k. There are more than 10 hyperparameters to be tuned in this model.
This `Jupyter notebook <https://github.com/microsoft/recommenders/blob/master/examples/04_model_select_and_optimize/nni_surprise_svd.ipynb>`__ provided by Recommenders is a very detailed step-by-step tutorial for this example. It uses different built-in tuning algorithms in NNI, including ``Annealing``\ , ``SMAC``\ , ``Random Search``\ , ``TPE``\ , ``Hyperband``\ , ``Metis`` and ``Evolution``. Finally, the results of different tuning algorithms are compared. Please go through this notebook to learn how to use NNI to tune SVD model, then you could further use NNI to tune other models in Recommenders.
Automatically tuning SPTAG with NNI
===================================
`SPTAG <https://github.com/microsoft/SPTAG>`__ (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by `Microsoft Research (MSR) <https://www.msra.cn/>`__ and `Microsoft Bing <https://www.bing.com/>`__.
This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.
SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.
In SPTAG, there are tens of parameters that can be tuned for specified scenarios or datasets. NNI is a great tool for automatically tuning those parameters. The authors of SPTAG tried NNI for the auto tuning and found good-performing parameters easily, thus, they shared the practice of tuning SPTAG on NNI in their document `here <https://github.com/microsoft/SPTAG/blob/master/docs/Parameters.md>`__. Please refer to it for detailed tutorial.
######################
Automatic Model Tuning
######################
NNI can be applied on various model tuning tasks. Some state-of-the-art model search algorithms, such as EfficientNet, can be easily built on NNI. Popular models, e.g., recommendation models, can be tuned with NNI. The following are some use cases to illustrate how to leverage NNI in your model tuning tasks and how to build your own pipeline with NNI.
.. toctree::
:maxdepth: 1
Tuning SVD automatically <RecommendersSvd>
EfficientNet on NNI <../TrialExample/EfficientNet>
Automatic Model Architecture Search for Reading Comprehension <../TrialExample/SquadEvolutionExamples>
Parallelizing Optimization for TPE <ParallelizingTpeSearch>
\ No newline at end of file
#######################
Automatic System Tuning
#######################
The performance of systems, such as database, tensor operator implementaion, often need to be tuned to adapt to specific hardware configuration, targeted workload, etc. Manually tuning a system is complicated and often requires detailed understanding of hardware and workload. NNI can make such tasks much easier and help system owners find the best configuration to the system automatically. The detailed design philosophy of automatic system tuning can be found in this `paper <https://dl.acm.org/doi/10.1145/3352020.3352031>`__\ . The following are some typical cases that NNI can help.
.. toctree::
:maxdepth: 1
Tuning SPTAG (Space Partition Tree And Graph) automatically <SptagAutoTune>
Tuning the performance of RocksDB <../TrialExample/RocksdbExamples>
Tuning Tensor Operators automatically <../TrialExample/OpEvoExamples>
\ No newline at end of file
#######################
Use Cases and Solutions
#######################
Different from the tutorials and examples in the rest of the document which show the usage of a feature, this part mainly introduces end-to-end scenarios and use cases to help users further understand how NNI can help them. NNI can be widely adopted in various scenarios. We also encourage community contributors to share their AutoML practices especially the NNI usage practices from their experience.
Use Cases and Solutions
=======================
.. toctree::
:maxdepth: 2
Automatic Model Tuning (HPO/NAS) <automodel>
Automatic System Tuning (AutoSys) <autosys>
Model Compression <model_compression>
Feature Engineering <feature_engineering>
Performance measurement, comparison and analysis <perf_compare>
Use NNI on Google Colab <NNI_colab_support>
External Repositories and References
====================================
With authors' permission, we listed a set of NNI usage examples and relevant articles.
External Repositories
=====================
* `Hyperparameter Tuning for Matrix Factorization <https://github.com/microsoft/recommenders/blob/master/examples/04_model_select_and_optimize/nni_surprise_svd.ipynb>`__ with NNI
* `scikit-nni <https://github.com/ksachdeva/scikit-nni>`__ Hyper-parameter search for scikit-learn pipelines using NNI
Relevant Articles
=================
* `Cost-effective Hyper-parameter Tuning using AdaptDL with NNI - Feb 23, 2021 <https://medium.com/casl-project/cost-effective-hyper-parameter-tuning-using-adaptdl-with-nni-e55642888761>`__
* `(in Chinese) A summary of NNI new capabilities in NNI 2.0 - Jan 21, 2021 <https://www.msra.cn/zh-cn/news/features/nni-2>`__
* `(in Chinese) A summary of NNI new capabilities in 2019 - Dec 26, 2019 <https://mp.weixin.qq.com/s/7_KRT-rRojQbNuJzkjFMuA>`__
* `Find thy hyper-parameters for scikit-learn pipelines using Microsoft NNI - Nov 6, 2019 <https://towardsdatascience.com/find-thy-hyper-parameters-for-scikit-learn-pipelines-using-microsoft-nni-f1015b1224c1>`__
* `(in Chinese) AutoML tools (Advisor, NNI and Google Vizier) comparison - Aug 05, 2019 <http://gaocegege.com/Blog/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/katib-new#%E6%80%BB%E7%BB%93%E4%B8%8E%E5%88%86%E6%9E%90>`__
* `Hyper Parameter Optimization Comparison <./HpoComparison.rst>`__
* `Neural Architecture Search Comparison <./NasComparison.rst>`__
* `Parallelizing a Sequential Algorithm TPE <./ParallelizingTpeSearch.rst>`__
* `Automatically tuning SVD with NNI <./RecommendersSvd.rst>`__
* `Automatically tuning SPTAG with NNI <./SptagAutoTune.rst>`__
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment