NNI provides state-of-the-art tuning algorithms within our builtin-assessors and makes them easy to use. Below is a brief overview of NNI's current builtin Assessors.
Note: Click the **Assessor's name** to get each Assessor's installation requirements, suggested usage scenario, and a config example. A link to a detailed description of each algorithm is provided at the end of the suggested scenario for each Assessor.
Currently, we support the following Assessors:
.. list-table::
:header-rows: 1
:widths: auto
* - Assessor
- Brief Introduction of Algorithm
* - `Medianstop <#MedianStop>`__
- Medianstop is a simple early stopping rule. It stops a pending trial X at step S if the trial’s best objective value by step S is strictly worse than the median value of the running averages of all completed trials’ objectives reported up to step S. `Reference Paper <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf>`__
* - `Curvefitting <#Curvefitting>`__
- Curve Fitting Assessor is an LPA (learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of the final epoch's performance worse than the best final performance in the trial history. In this algorithm, we use 12 curves to fit the accuracy curve. `Reference Paper <http://aad.informatik.uni-freiburg.de/papers/15-IJCAI-Extrapolation_of_Learning_Curves.pdf>`__
Usage of Builtin Assessors
--------------------------
Usage of builtin assessors provided by the NNI SDK requires one to declare the **builtinAssessorName** and **classArgs** in the ``config.yml`` file. In this part, we will introduce the details of usage and the suggested scenarios, classArg requirements, and an example for each assessor.
Note: Please follow the provided format when writing your ``config.yml`` file.
:raw-html:`<a name="MedianStop"></a>`
Median Stop Assessor
^^^^^^^^^^^^^^^^^^^^
..
Builtin Assessor Name: **Medianstop**
**Suggested scenario**
It's applicable in a wide range of performance curves, thus, it can be used in various scenarios to speed up the tuning progress. `Detailed Description <./MedianstopAssessor.rst>`__
**classArgs requirements:**
* **optimize_mode** (*maximize or minimize, optional, default = maximize*\ ) - If 'maximize', assessor will **stop** the trial with smaller expectation. If 'minimize', assessor will **stop** the trial with larger expectation.
* **start_step** (*int, optional, default = 0*\ ) - A trial is determined to be stopped or not only after receiving start_step number of reported intermediate results.
**Usage example:**
.. code-block:: yaml
# config.yml
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
start_step: 5
:raw-html:`<br>`
:raw-html:`<a name="Curvefitting"></a>`
Curve Fitting Assessor
^^^^^^^^^^^^^^^^^^^^^^
..
Builtin Assessor Name: **Curvefitting**
**Suggested scenario**
It's applicable in a wide range of performance curves, thus, it can be used in various scenarios to speed up the tuning progress. Even better, it's able to handle and assess curves with similar performance. `Detailed Description <./CurvefittingAssessor.rst>`__
**Note**\ , according to the original paper, only incremental functions are supported. Therefore this assessor can only be used to maximize optimization metrics. For example, it can be used for accuracy, but not for loss.
**classArgs requirements:**
* **epoch_num** (*int,** required***\ ) - The total number of epochs. We need to know the number of epochs to determine which points we need to predict.
* **start_step** (*int, optional, default = 6*\ ) - A trial is determined to be stopped or not only after receiving start_step number of reported intermediate results.
* **threshold** (*float, optional, default = 0.95*\ ) - The threshold that we use to decide to early stop the worst performance curve. For example: if threshold = 0.95, and the best performance in the history is 0.9, then we will stop the trial who's predicted value is lower than 0.95 * 0.9 = 0.855.
* **gap** (*int, optional, default = 1*\ ) - The gap interval between Assessor judgements. For example: if gap = 2, start_step = 6, then we will assess the result when we get 6, 8, 10, 12...intermediate results.
The Curve Fitting Assessor is an LPA (learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of the final epoch's performance is worse than the best final performance in the trial history.
In this algorithm, we use 12 curves to fit the learning curve. The set of parametric curve models are chosen from this `reference paper <http://aad.informatik.uni-freiburg.de/papers/15-IJCAI-Extrapolation_of_Learning_Curves.pdf>`__. The learning curves' shape coincides with our prior knowledge about the form of learning curves: They are typically increasing, saturating functions.
Assuming additive Gaussian noise and the noise parameter being initialized to its maximum likelihood estimate.
We determine the maximum probability value of the new combined parameter vector by learning the historical data. We use such a value to predict future trial performance and stop the inadequate experiments to save computing resources.
Concretely, this algorithm goes through three stages of learning, predicting, and assessing.
*
Step1: Learning. We will learn about the trial history of the current trial and determine the \xi at the Bayesian angle. First of all, We fit each curve using the least-squares method, implemented by ``fit_theta``. After we obtained the parameters, we filter the curve and remove the outliers, implemented by ``filter_curve``. Finally, we use the MCMC sampling method. implemented by ``mcmc_sampling``\ , to adjust the weight of each curve. Up to now, we have determined all the parameters in \xi.
*
Step2: Predicting. It calculates the expected final result accuracy, implemented by ``f_comb``\ , at the target position (i.e., the total number of epochs) by \xi and the formula of the combined model.
*
Step3: If the fitting result doesn't converge, the predicted value will be ``None``. In this case, we return ``AssessResult.Good`` to ask for future accuracy information and predict again. Furthermore, we will get a positive value from the ``predict()`` function. If this value is strictly greater than the best final performance in history * ``THRESHOLD``\ (default value = 0.95), return ``AssessResult.Good``\ , otherwise, return ``AssessResult.Bad``
The figure below is the result of our algorithm on MNIST trial history data, where the green point represents the data obtained by Assessor, the blue point represents the future but unknown data, and the red line is the Curve predicted by the Curve fitting assessor.
.. image:: ../../img/curvefitting_example.PNG
:target: ../../img/curvefitting_example.PNG
:alt: examples
Usage
-----
To use Curve Fitting Assessor, you should add the following spec in your experiment's YAML config file:
.. code-block:: yaml
assessor:
builtinAssessorName: Curvefitting
classArgs:
# (required)The total number of epoch.
# We need to know the number of epoch to determine which point we need to predict.
epoch_num: 20
# (optional) In order to save our computing resource, we start to predict when we have more than only after receiving start_step number of reported intermediate results.
# The default value of start_step is 6.
start_step: 6
# (optional) The threshold that we decide to early stop the worse performance curve.
# For example: if threshold = 0.95, best performance in the history is 0.9, then we will stop the trial which predict value is lower than 0.95 * 0.9 = 0.855.
# The default value of threshold is 0.95.
threshold: 0.95
# (optional) The gap interval between Assesor judgements.
# For example: if gap = 2, start_step = 6, then we will assess the result when we get 6, 8, 10, 12...intermedian result.
# The default value of gap is 1.
gap: 1
Limitation
----------
According to the original paper, only incremental functions are supported. Therefore this assessor can only be used to maximize optimization metrics. For example, it can be used for accuracy, but not for loss.
File Structure
--------------
The assessor has a lot of different files, functions, and classes. Here we briefly describe a few of them.
* ``curvefunctions.py`` includes all the function expressions and default parameters.
* ``modelfactory.py`` includes learning and predicting; the corresponding calculation part is also implemented here.
* ``curvefitting_assessor.py`` is the assessor which receives the trial history and assess whether to early stop the trial.
TODO
----
* Further improve the accuracy of the prediction and test it on more models.
NNI supports to build an assessor by yourself for tuning demand.
If you want to implement a customized Assessor, there are three things to do:
#. Inherit the base Assessor class
#. Implement assess_trial function
#. Configure your customized Assessor in experiment YAML config file
**1. Inherit the base Assessor class**
.. code-block:: python
from nni.assessor import Assessor
class CustomizedAssessor(Assessor):
def __init__(self, ...):
...
**2. Implement assess trial function**
.. code-block:: python
from nni.assessor import Assessor, AssessResult
class CustomizedAssessor(Assessor):
def __init__(self, ...):
...
def assess_trial(self, trial_history):
"""
Determines whether a trial should be killed. Must override.
trial_history: a list of intermediate result objects.
Returns AssessResult.Good or AssessResult.Bad.
"""
# you code implement here.
...
**3. Configure your customized Assessor in experiment YAML config file**
NNI needs to locate your customized Assessor class and instantiate the class, so you need to specify the location of the customized Assessor class and pass literal values as parameters to the __init__ constructor.
.. code-block:: yaml
assessor:
codeDir: /home/abc/myassessor
classFileName: my_customized_assessor.py
className: CustomizedAssessor
# Any parameter need to pass to your Assessor class __init__ constructor
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
Please noted in **2**. The object ``trial_history`` are exact the object that Trial send to Assessor by using SDK ``report_intermediate_result`` function.
The working directory of your assessor is ``<home>/nni-experiments/<experiment_id>/log``\ , which can be retrieved with environment variable ``NNI_LOG_DIRECTORY``\ ,
Medianstop is a simple early stopping rule mentioned in this `paper <https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf>`__. It stops a pending trial X after step S if the trial’s best objective value by step S is strictly worse than the median value of the running averages of all completed trials’ objectives reported up to step S.
NNI's command line tool **nnictl** support auto-completion, i.e., you can complete a nnictl command by pressing the ``tab`` key.
For example, if the current command is
.. code-block:: bash
nnictl cre
By pressing the ``tab`` key, it will be completed to
.. code-block:: bash
nnictl create
For now, auto-completion will not be enabled by default if you install NNI through ``pip``\ , and it only works on Linux with bash shell. If you want to enable this feature on your computer, please refer to the following steps:
Here, {nni-version} should by replaced by the version of NNI, e.g., ``master``\ , ``v1.9``. You can also check the latest ``bash-completion`` script :githublink:`here <tools/bash-completion>`.
Step 2. Install the script
^^^^^^^^^^^^^^^^^^^^^^^^^^
If you are running a root account and want to install this script for all the users
The total search space is 1,204,224, we set the number of maximum trial to 1000. The time limitation is 48 hours.
Results
^^^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* - Algorithm
- Best loss
- Average of Best 5 Losses
- Average of Best 10 Losses
* - Random Search
- 0.418854
- 0.420352
- 0.421553
* - Random Search
- 0.417364
- 0.420024
- 0.420997
* - Random Search
- 0.417861
- 0.419744
- 0.420642
* - Grid Search
- 0.498166
- 0.498166
- 0.498166
* - Evolution
- 0.409887
- 0.409887
- 0.409887
* - Evolution
- 0.413620
- 0.413875
- 0.414067
* - Evolution
- 0.409887
- 0.409887
- 0.409887
* - Anneal
- 0.414877
- 0.417289
- 0.418281
* - Anneal
- 0.409887
- 0.409887
- 0.410118
* - Anneal
- 0.413683
- 0.416949
- 0.417537
* - Metis
- 0.416273
- 0.420411
- 0.422380
* - Metis
- 0.420262
- 0.423175
- 0.424816
* - Metis
- 0.421027
- 0.424172
- 0.425714
* - TPE
- 0.414478
- 0.414478
- 0.414478
* - TPE
- 0.415077
- 0.417986
- 0.418797
* - TPE
- 0.415077
- 0.417009
- 0.418053
* - SMAC
- **0.408386**
- **0.408386**
- **0.408386**
* - SMAC
- 0.414012
- 0.414012
- 0.414012
* - SMAC
- **0.408386**
- **0.408386**
- **0.408386**
* - BOHB
- 0.410464
- 0.415319
- 0.417755
* - BOHB
- 0.418995
- 0.420268
- 0.422604
* - BOHB
- 0.415149
- 0.418072
- 0.418932
* - HyperBand
- 0.414065
- 0.415222
- 0.417628
* - HyperBand
- 0.416807
- 0.417549
- 0.418828
* - HyperBand
- 0.415550
- 0.415977
- 0.417186
* - GP
- 0.414353
- 0.418563
- 0.420263
* - GP
- 0.414395
- 0.418006
- 0.420431
* - GP
- 0.412943
- 0.416566
- 0.418443
In this example, all the algorithms are used with default parameters. For Metis, there are about 300 trials because it runs slowly due to its high time complexity O(n^3) in Gaussian Process.
RocksDB Benchmark 'fillrandom' and 'readrandom'
-----------------------------------------------
Problem Description
^^^^^^^^^^^^^^^^^^^
`DB_Bench <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is the main tool that is used to benchmark `RocksDB <https://rocksdb.org/>`__\ 's performance. It has so many hapermeter to tune.
The performance of ``DB_Bench`` is associated with the machine configuration and installation method. We run the ``DB_Bench``\ in the Linux machine and install the Rock in shared library.
Machine configuration
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
RocksDB: version 6.1
CPU: 6 * Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
CPUCache: 35840 KB
Keys: 16 bytes each
Values: 100 bytes each (50 bytes after compression)
Entries: 1000000
Storage performance
^^^^^^^^^^^^^^^^^^^
**Latency**\ : each IO request will take some time to complete, this is called the average latency. There are several factors that would affect this time including network connection quality and hard disk IO performance.
**IOPS**\ :** IO operations per second**\ , which means the amount of *read or write operations* that could be done in one seconds time.
**IO size**\ :** the size of each IO request**. Depending on the operating system and the application/service that needs disk access it will issue a request to read or write a certain amount of data at the same time.
**Throughput (in MB/s) = Average IO size x IOPS**
IOPS is related to online processing ability and we use the IOPS as the metric in my experiment.
Search Space
^^^^^^^^^^^^
.. code-block:: json
{
"max_background_compactions": {
"_type": "quniform",
"_value": [1, 256, 1]
},
"block_size": {
"_type": "quniform",
"_value": [1, 500000, 1]
},
"write_buffer_size": {
"_type": "quniform",
"_value": [1, 130000000, 1]
},
"max_write_buffer_number": {
"_type": "quniform",
"_value": [1, 128, 1]
},
"min_write_buffer_number_to_merge": {
"_type": "quniform",
"_value": [1, 32, 1]
},
"level0_file_num_compaction_trigger": {
"_type": "quniform",
"_value": [1, 256, 1]
},
"level0_slowdown_writes_trigger": {
"_type": "quniform",
"_value": [1, 1024, 1]
},
"level0_stop_writes_trigger": {
"_type": "quniform",
"_value": [1, 1024, 1]
},
"cache_size": {
"_type": "quniform",
"_value": [1, 30000000, 1]
},
"compaction_readahead_size": {
"_type": "quniform",
"_value": [1, 30000000, 1]
},
"new_table_reader_for_compaction_inputs": {
"_type": "randint",
"_value": [1]
}
}
The search space is enormous (about 10^40) and we set the maximum number of trial to 100 to limit the computation resource.
The sparsity of each layer is set the same as the overall sparsity in this experiment.
*
Only **filter pruning** performances are compared here.
For the pruners with scheduling, ``L1Filter Pruner`` is used as the base algorithm. That is to say, after the sparsities distribution is decided by the scheduling algorithm, ``L1Filter Pruner`` is used to performn real pruning.
*
All the pruners listed above are implemented in :githublink:`nni <docs/en_US/Compression/Overview.rst>`.
Experiment Result
-----------------
For each dataset/model/pruner combination, we prune the model to different levels by setting a series of target sparsities for the pruner.
Here we plot both **Number of Weights - Performances** curve and** FLOPs - Performance** curve.
As a reference, we also plot the result declared in the paper `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <http://arxiv.org/abs/1907.03141>`__ for models VGG16 and ResNet18 on CIFAR-10.
The experiment result are shown in the following figures:
From the experiment result, we get the following conclusions:
* Given the constraint on the number of parameters, the pruners with scheduling ( ``AutoCompress Pruner`` , ``SimualatedAnnealing Pruner`` ) performs better than the others when the constraint is strict. However, they have no such advantage in FLOPs/Performances comparison since only number of parameters constraint is considered in the optimization process;
* The basic algorithms ``L1Filter Pruner`` , ``L2Filter Pruner`` , ``FPGM Pruner`` performs very similarly in these experiments;
* ``NetAdapt Pruner`` can not achieve very high compression rate. This is caused by its mechanism that it prunes only one layer each pruning iteration. This leads to un-acceptable complexity if the sparsity per iteration is much lower than the overall sparisity constraint.
Experiments Reproduction
------------------------
Implementation Details
^^^^^^^^^^^^^^^^^^^^^^
*
The experiment results are all collected with the default configuration of the pruners in nni, which means that when we call a pruner class in nni, we don't change any default class arguments.
*
Both FLOPs and the number of parameters are counted with :githublink:`Model FLOPs/Parameters Counter <docs/en_US/Compression/CompressionUtils.md#model-flopsparameters-counter>` after :githublink:`model speed up <docs/en_US/Compression/ModelSpeedup.rst>`.
This avoids potential issues of counting them of masked models.
*
The experiment code can be found :githublink:`here <examples/model_compress/auto_pruners_torch.py>`.
Experiment Result Rendering
^^^^^^^^^^^^^^^^^^^^^^^^^^^
*
If you follow the practice in the :githublink:`example <examples/model_compress/auto_pruners_torch.py>`\ , for every single pruning experiment, the experiment result will be saved in JSON format as follows:
The article is by a NNI user on Zhihu forum. In the article, Garvin had shared his experience on using NNI for Automatic Feature Engineering. We think this article is very useful for users who are interested in using NNI for feature engineering. With author's permission, we translated the original article into English.
**原文(source)**\ : `如何看待微软最新发布的AutoML平台NNI?By Garvin Li <https://www.zhihu.com/question/297982959/answer/964961829?utm_source=wechat_session&utm_medium=social&utm_oi=28812108627968&from=singlemessage&isappinstalled=0>`__
01 Overview of AutoML
---------------------
In author's opinion, AutoML is not only about hyperparameter optimization, but
also a process that can target various stages of the machine learning process,
including feature engineering, NAS, HPO, etc.
02 Overview of NNI
------------------
NNI (Neural Network Intelligence) is an open source AutoML toolkit from
Microsoft, to help users design and tune machine learning models, neural network
architectures, or a complex system’s parameters in an efficient and automatic
In general, most of Microsoft tools have one prominent characteristic: the
design is highly reasonable (regardless of the technology innovation degree).
NNI's AutoFeatureENG basically meets all user requirements of AutoFeatureENG
with a very reasonable underlying framework design.
03 Details of NNI-AutoFeatureENG
--------------------------------
..
The article is following the github project: `https://github.com/SpongebBob/tabular_automl_NNI <https://github.com/SpongebBob/tabular_automl_NNI>`__.
Each new user could do AutoFeatureENG with NNI easily and efficiently. To exploring the AutoFeatureENG capability, downloads following required files, and then run NNI install through pip.
NNI treats AutoFeatureENG as a two-steps-task, feature generation exploration and feature selection. Feature generation exploration is mainly about feature derivation and high-order feature combination.
04 Feature Exploration
----------------------
For feature derivation, NNI offers many operations which could automatically generate new features, which list \ `as following <https://github.com/SpongebBob/tabular_automl_NNI/blob/master/AutoFEOp.rst>`__\ :
**count**\ : Count encoding is based on replacing categories with their counts computed on the train set, also named frequency encoding.
**target**\ : Target encoding is based on encoding categorical variable values with the mean of target variable per value.
**embedding**\ : Regard features as sentences, generate vectors using *Word2Vec.*
**crosscout**\ : Count encoding on more than one-dimension, alike CTR (Click Through Rate).
**aggregete**\ : Decide the aggregation functions of the features, including min/max/mean/var.
**nunique**\ : Statistics of the number of unique features.
**histsta**\ : Statistics of feature buckets, like histogram statistics.
Search space could be defined in a **JSON file**\ : to define how specific features intersect, which two columns intersect and how features generate from corresponding columns.
The picture shows us the procedure of defining search space. NNI provides count encoding for 1-order-op, as well as cross count encoding, aggerate statistics (min max var mean median nunique) for 2-order-op.
For example, we want to search the features which are a frequency encoding (valuecount) features on columns name {“C1”, ...,” C26”}, in the following way:
The purpose of Exploration is to generate new features. You can use **get_next_parameter** function to get received feature candidates of one trial.
..
RECEIVED_PARAMS = nni.get_next_parameter()
05 Feature selection
--------------------
To avoid feature explosion and overfitting, feature selection is necessary. In the feature selection of NNI-AutoFeatureENG, LightGBM (Light Gradient Boosting Machine), a gradient boosting framework developed by Microsoft, is mainly promoted.
If you have used **XGBoost** or** GBDT**\ , you would know the algorithm based on tree structure can easily calculate the importance of each feature on results. LightGBM is able to make feature selection naturally.
The issue is that selected features might be applicable to *GBDT* (Gradient Boosting Decision Tree), but not to the linear algorithm like *LR* (Logistic Regression).
NNI's AutoFeatureEng sets a well-established standard, showing us the operation procedure, available modules, which is highly convenient to use. However, a simple model is probably not enough for good results.
Suggestions to NNI
------------------
About Exploration: If consider using DNN (like xDeepFM) to extract high-order feature would be better.
About Selection: There could be more intelligent options, such as automatic selection system based on downstream models.
Conclusion: NNI could offer users some inspirations of design and it is a good open source project. I suggest researchers leverage it to accelerate the AI research.
Tips: Because the scripts of open source projects are compiled based on gcc7, Mac system may encounter problems of gcc (GNU Compiler Collection). The solution is as follows:
NNI can easily run on Google Colab platform. However, Colab doesn't expose its public IP and ports, so by default you can not access NNI's Web UI on Colab. To solve this, you need a reverse proxy software like ``ngrok`` or ``frp``. This tutorial will show you how to use ngrok to access NNI's Web UI on Colab.
How to Open NNI's Web UI on Google Colab
----------------------------------------
#. Install required packages and softwares.
.. code-block:: bash
! pip install nni # install nni
! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip # download ngrok and unzip it
! unzip ngrok-stable-linux-amd64.zip
! mkdir -p nni_repo
! git clone https://github.com/microsoft/nni.git nni_repo/nni # clone NNI's offical repo to get examples
#. Register a ngrok account `here <https://ngrok.com/>`__\ , then connect to your account using your authtoken.
.. code-block:: bash
! ./ngrok authtoken <your-authtoken>
#. Start an NNI example on a port bigger than 1024, then start ngrok with the same port. If you want to use gpu, make sure gpuNum >= 1 in config.yml. Use ``get_ipython()`` to start ngrok since it will be stuck if you use ``! ngrok http 5000 &``.
! curl -s http://localhost:4040/api/tunnels # don't change the port number 4040
You will see an url like http://xxxx.ngrok.io after step 4, open this url and you will find NNI's Web UI. Have fun :)
Access Web UI with frp
----------------------
frp is another reverse proxy software with similar functions. However, frp doesn't provide free public urls, so you may need an server with public IP as a frp server. See `here <https://github.com/fatedier/frp>`__ to know more about how to deploy frp.
To avoid over-fitting in **CIFAR-10**\ , we also compare the models in the other five datasets including Fashion-MNIST, CIFAR-100, OUI-Adience-Age, ImageNet-10-1 (subset of ImageNet), ImageNet-10-2 (another subset of ImageNet). We just sample a subset with 10 different labels from ImageNet to make ImageNet-10-1 or ImageNet-10-2.
- Coffee mug, computer keyboard, dining table, wardrobe, lawn mower, microphone, swing, sewing machine, odometer and gas pump.
* - `ImageNet-10-2 <http://www.image-net.org/>`__
- 9,750
- 10
- Drum, banj, whistle, grand piano, violin, organ, acoustic guitar, trombone, flute and sax.
We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
Search phase time for all NAS methods is **two days** as well as the retrain time. Average results are reported based on** three repeat times**. Our evaluation machines have one Nvidia Tesla P100 GPU, 112GB of RAM and one 2.60GHz CPU (Intel E5-2690).
For NAO, it requires too much computing resources, so we only use NAO-WS which provides the pipeline script.
For AutoKeras, we used 0.2.18 version because it was the latest version when we started the experiment.
NAS Performance
---------------
.. list-table::
:header-rows: 1
:widths: auto
* - NAS
- AutoKeras (%)
- ENAS (macro) (%)
- ENAS (micro) (%)
- DARTS (%)
- NAO-WS (%)
* - Fashion-MNIST
- 91.84
- 95.44
- 95.53
- **95.74**
- 95.20
* - CIFAR-10
- 75.78
- 95.68
- **96.16**
- 94.23
- 95.64
* - CIFAR-100
- 43.61
- 78.13
- 78.84
- **79.74**
- 75.75
* - OUI-Adience-Age
- 63.20
- **80.34**
- 78.55
- 76.83
- 72.96
* - ImageNet-10-1
- 61.80
- 77.07
- 79.80
- **80.48**
- 77.20
* - ImageNet-10-2
- 37.20
- 58.13
- 56.47
- 60.53
- **61.20**
Unfortunately, we cannot reproduce all the results in the paper.
The best or average results reported in the paper:
.. list-table::
:header-rows: 1
:widths: auto
* - NAS
- AutoKeras(%)
- ENAS (macro) (%)
- ENAS (micro) (%)
- DARTS (%)
- NAO-WS (%)
* - CIFAR- 10
- 88.56(best)
- 96.13(best)
- 97.11(best)
- 97.17(average)
- 96.47(best)
For AutoKeras, it has relatively worse performance across all datasets due to its random factor on network morphism.
For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro) shows good results in CIFAR-10.
For DARTS, it has a good performance on some datasets but we found its high variance in other datasets. The difference among three runs of benchmarks can be up to 5.37% in OUI-Adience-Age and 4.36% in ImageNet-10-1.
For NAO-WS, it shows good results in ImageNet-10-2 but it can perform very poorly in OUI-Adience-Age.
Reference
---------
#.
Jin, Haifeng, Qingquan Song, and Xia Hu. "Efficient neural architecture search with network morphism." *arXiv preprint arXiv:1806.10282* (2018).
TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete. For the TPE approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation of the y was assigned temporarily, until the evaluation completed and reported the actual loss f(x∗).
Introduction and Problems
-------------------------
Sequential Model-based Global Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sequential Model-Based Global Optimization (SMBO) algorithms have been used in many applications where evaluation of the fitness function is expensive. In an application where the true fitness function f: X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point x∗ that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. This active-learning-like algorithm template is summarized in the figure below. SMBO algorithms differ in what criterion they optimize to obtain x∗ given a model (or surrogate) of f, and in they model f via observation history H.
.. image:: ../../img/parallel_tpe_search4.PNG
:target: ../../img/parallel_tpe_search4.PNG
:alt:
The algorithms in this work optimize the criterion of Expected Improvement (EI). Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the bandit-based criterion. We chose to use the EI criterion in TPE because it is intuitive, and has been shown to work well in a variety of settings. Expected improvement is the expectation under some model M of f : X → RN that f(x) will exceed (negatively) some threshold y∗:
.. image:: ../../img/parallel_tpe_search_ei.PNG
:target: ../../img/parallel_tpe_search_ei.PNG
:alt:
Since calculation of p(y|x) is expensive, TPE approach modeled p(y|x) by p(x|y) and p(y).The TPE defines p(x|y) using two such densities:
.. image:: ../../img/parallel_tpe_search_tpe.PNG
:target: ../../img/parallel_tpe_search_tpe.PNG
:alt:
where l(x) is the density formed by using the observations {x(i)} such that corresponding loss
f(x(i)) was less than y∗ and g(x) is the density formed by using the remaining observations. TPE algorithm depends on a y∗ that is larger than the best observed f(x) so that some points can be used to form l(x). The TPE algorithm chooses y∗ to be some quantile γ of the observed y values, so that p(y<\ ``y∗``\ ) = γ, but no specific model for p(y) is necessary. The tree-structured form of l and g makes it easy to draw many candidates according to l and evaluate them according to g(x)/l(x). On each iteration, the algorithm returns the candidate x∗ with the greatest EI.
Here is a simulation of the TPE algorithm in a two-dimensional search space. The difference of background color represents different values. It can be seen that TPE combines exploration and exploitation very well. (Black indicates the points of this round samples, and yellow indicates the points has been taken in the history.)
.. image:: ../../img/parallel_tpe_search1.gif
:target: ../../img/parallel_tpe_search1.gif
:alt:
**Since EI is a continuous function, the highest x of EI is determined at a certain status.** As shown in the figure below, the blue triangle is the point that is most likely to be sampled in this state.
.. image:: ../../img/parallel_tpe_search_ei2.PNG
:target: ../../img/parallel_tpe_search_ei2.PNG
:alt:
TPE performs well when we use it in sequential, but if we provide a larger concurrency, then **there will be a large number of points produced in the same EI state**\ , too concentrated points will reduce the exploration ability of the tuner, resulting in resources waste.
Here is the simulation figure when we set ``concurrency=60``\ , It can be seen that this phenomenon is obvious.
.. image:: ../../img/parallel_tpe_search2.gif
:target: ../../img/parallel_tpe_search2.gif
:alt:
Research solution
-----------------
Approximated q-EI Maximization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The multi-points criterion that we have presented below can potentially be used to deliver an additional design of experiments in one step through the resolution of the optimization problem.
.. image:: ../../img/parallel_tpe_search_qEI.PNG
:target: ../../img/parallel_tpe_search_qEI.PNG
:alt:
However, the computation of q-EI becomes intensive as q increases. After our research, there are four popular greedy strategies that approach the result of problem while avoiding its numerical cost.
Solution 1: Believing the OK Predictor: The KB(Kriging Believer) Heuristic Strategy
The Kriging Believer strategy replaces the conditional knowledge about the responses at the sites chosen within the last iterations by deterministic values equal to the expectation of the Kriging predictor. Keeping the same notations as previously, the strategy can be summed up as follows:
.. image:: ../../img/parallel_tpe_search_kb.PNG
:target: ../../img/parallel_tpe_search_kb.PNG
:alt:
This sequential strategy delivers a q-points design and is computationally affordable since it relies on the analytically known EI, optimized in d dimensions. However, there is a risk of failure, since believing an OK predictor that overshoots the observed data may lead to a sequence that gets trapped in a non-optimal region for many iterations. We now propose a second strategy that reduces this risk.
Solution 2: The CL(Constant Liar) Heuristic Strategy
Let us now consider a sequential strategy in which the metamodel is updated (still without hyperparameter re-estimation) at each iteration with a value L exogenously fixed by the user, here called a ”lie”. The strategy referred to as the Constant Liar consists in lying with the same value L at every iteration: maximize EI (i.e. find xn+1), actualize the model as if y(xn+1) = L, and so on always with the same L ∈ R:
.. image:: ../../img/parallel_tpe_search_cl.PNG
:target: ../../img/parallel_tpe_search_cl.PNG
:alt:
L should logically be determined on the basis of the values taken by y at X. Three values, min{Y}, mean{Y}, and max{Y} are considered here. **The larger L is, the more explorative the algorithm will be, and vice versa.**
We have simulated the method above. The following figure shows the result of using mean value liars to maximize q-EI. We find that the points we have taken have begun to be scattered.
.. image:: ../../img/parallel_tpe_search3.gif
:target: ../../img/parallel_tpe_search3.gif
:alt:
Experiment
----------
Branin-Hoo
^^^^^^^^^^
The four optimization strategies presented in the last section are now compared on the Branin-Hoo function which is a classical test-case in global optimization.
The recommended values of a, b, c, r, s and t are: a = 1, b = 5.1 ⁄ (4π2), c = 5 ⁄ π, r = 6, s = 10 and t = 1 ⁄ (8π). This function has three global minimizers(-3.14, 12.27), (3.14, 2.27), (9.42, 2.47).
Next is the comparison of the q-EI associated with the q first points (q ∈ [1,10]) given by the constant liar strategies (min and max), 2000 q-points designs uniformly drawn for every q, and 2000 q-points LHS designs taken at random for every q.
As we can seen on figure, CL[max] and CL[min] offer very good q-EI results compared to random designs, especially for small values of q.
Gaussian Mixed Model function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We also compared the case of using parallel optimization and not using parallel optimization. A two-dimensional multimodal Gaussian Mixed distribution is used to simulate, the following is our result:
.. list-table::
:header-rows: 1
:widths: auto
* -
- concurrency=80
- concurrency=60
- concurrency=40
- concurrency=20
- concurrency=10
* - Without parallel optimization
- avg = 0.4841 :raw-html:`<br>` var = 0.1953
- avg = 0.5155 :raw-html:`<br>` var = 0.2219
- avg = 0.5773 :raw-html:`<br>` var = 0.2570
- avg = 0.4680 :raw-html:`<br>` var = 0.1994
- avg = 0.2774 :raw-html:`<br>` var = 0.1217
* - With parallel optimization
- avg = 0.2132 :raw-html:`<br>` var = 0.0700
- avg = 0.2177\ :raw-html:`<br>`\ var = 0.0796
- avg = 0.1835 :raw-html:`<br>` var = 0.0533
- avg = 0.1671 :raw-html:`<br>` var = 0.0413
- avg = 0.1918 :raw-html:`<br>` var = 0.0697
Note: The total number of samples per test is 240 (ensure that the budget is equal). The trials in each form were repeated 1000 times, the value is the average and variance of the best results in 1000 trials.
References
----------
[1] James Bergstra, Remi Bardenet, Yoshua Bengio, Balazs Kegl. "Algorithms for Hyper-Parameter Optimization". `Link <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
[3] M. Jordan, J. Kleinberg, B. Scho¨lkopf. "Pattern Recognition and Machine Learning". `Link <http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf>`__
In this tutorial, we first introduce a github repo `Recommenders <https://github.com/Microsoft/Recommenders>`__. It is a repository that provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. It has various models that are popular and widely deployed in recommendation systems. To provide a complete end-to-end experience, they present each example in five key tasks, as shown below:
* `Prepare Data <https://github.com/Microsoft/Recommenders/blob/master/notebooks/01_prepare_data/README.rst>`__\ : Preparing and loading data for each recommender algorithm.
* `Model <https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/README.rst>`__\ : Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (\ `ALS <https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS>`__\ ) or eXtreme Deep Factorization Machines (\ `xDeepFM <https://arxiv.org/abs/1803.05170>`__\ ).
* `Evaluate <https://github.com/Microsoft/Recommenders/blob/master/notebooks/03_evaluate/README.rst>`__\ : Evaluating algorithms with offline metrics.
* `Model Select and Optimize <https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/README.rst>`__\ : Tuning and optimizing hyperparameters for recommender models.
* `Operationalize <https://github.com/Microsoft/Recommenders/blob/master/notebooks/05_operationalize/README.rst>`__\ : Operationalizing models in a production environment on Azure.
The fourth task is tuning and optimizing the model's hyperparameters, this is where NNI could help. To give a concrete example that NNI tunes the models in Recommenders, let's demonstrate with the model `SVD <https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb>`__\ , and data Movielens100k. There are more than 10 hyperparameters to be tuned in this model.
`This Jupyter notebook <https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb>`__ provided by Recommenders is a very detailed step-by-step tutorial for this example. It uses different built-in tuning algorithms in NNI, including ``Annealing``\ , ``SMAC``\ , ``Random Search``\ , ``TPE``\ , ``Hyperband``\ , ``Metis`` and ``Evolution``. Finally, the results of different tuning algorithms are compared. Please go through this notebook to learn how to use NNI to tune SVD model, then you could further use NNI to tune other models in Recommenders.
`SPTAG <https://github.com/microsoft/SPTAG>`__ (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by `Microsoft Research (MSR) <https://www.msra.cn/>`__ and `Microsoft Bing <https://www.bing.com/>`__.
This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.
SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.
In SPTAG, there are tens of parameters that can be tuned for specified scenarios or datasets. NNI is a great tool for automatically tuning those parameters. The authors of SPTAG tried NNI for the auto tuning and found good-performing parameters easily, thus, they shared the practice of tuning SPTAG on NNI in their document `here <https://github.com/microsoft/SPTAG/blob/master/docs/Parameters.rst>`__. Please refer to it for detailed tutorial.
The 'default' op_type stands for the module types defined in :githublink:`default_layers.py <src/sdk/pynni/nni/compression/pytorch/default_layers.py>` for pytorch.
Therefore ``{ 'sparsity': 0.8, 'op_types': ['default'] }``\ means that **all layers with specified op_types will be compressed with the same 0.8 sparsity**. When ``pruner.compress()`` called, the model is compressed with masks and after that you can normally fine tune this model and **pruned weights won't be updated** which have been masked.
Then, make this automatic
-------------------------
The previous example manually choosed LevelPruner and pruned all layers with the same sparsity, this is obviously sub-optimal because different layers may have different redundancy. Layer sparsity should be carefully tuned to achieve least model performance degradation and this can be done with NNI tuners.
The first thing we need to do is to design a search space, here we use a nested search space which contains choosing pruning algorithm and optimizing layer sparsity.
.. code-block:: json
{
"prune_method": {
"_type": "choice",
"_value": [
{
"_name": "agp",
"conv0_sparsity": {
"_type": "uniform",
"_value": [
0.1,
0.9
]
},
"conv1_sparsity": {
"_type": "uniform",
"_value": [
0.1,
0.9
]
},
},
{
"_name": "level",
"conv0_sparsity": {
"_type": "uniform",
"_value": [
0.1,
0.9
]
},
"conv1_sparsity": {
"_type": "uniform",
"_value": [
0.01,
0.9
]
},
}
]
}
}
Then we need to modify our codes for few lines
.. code-block:: python
import nni
from nni.algorithms.compression.pytorch.pruning import *
We provide several easy-to-use tools for users to analyze their model during model compression.
Sensitivity Analysis
--------------------
First, we provide a sensitivity analysis tool (\ **SensitivityAnalysis**\ ) for users to analyze the sensitivity of each convolutional layer in their model. Specifically, the SensitiviyAnalysis gradually prune each layer of the model, and test the accuracy of the model at the same time. Note that, SensitivityAnalysis only prunes a layer once a time, and the other layers are set to their original weights. According to the accuracies of different convolutional layers under different sparsities, we can easily find out which layers the model accuracy is more sensitive to.
Usage
^^^^^
The following codes show the basic usage of the SensitivityAnalysis.
.. code-block:: python
from nni.compression.pytorch.utils.sensitivity_analysis import SensitivityAnalysis
def val(model):
model.eval()
total = 0
correct = 0
with torch.no_grad():
for batchid, (data, label) in enumerate(val_loader):
Two key parameters of SensitivityAnalysis are ``model``\ , and ``val_func``. ``model`` is the neural network that to be analyzed and the ``val_func`` is the validation function that returns the model accuracy/loss/ or other metrics on the validation dataset. Due to different scenarios may have different ways to calculate the loss/accuracy, so users should prepare a function that returns the model accuracy/loss on the dataset and pass it to SensitivityAnalysis.
SensitivityAnalysis can export the sensitivity results as a csv file usage is shown in the example above.
Futhermore, users can specify the sparsities values used to prune for each layer by optional parameter ``sparsities``.
the SensitivityAnalysis will prune 25% 50% 75% weights gradually for each layer, and record the model's accuracy at the same time (SensitivityAnalysis only prune a layer once a time, the other layers are set to their original weights). If the sparsities is not set, SensitivityAnalysis will use the numpy.arange(0.1, 1.0, 0.1) as the default sparsity values.
Users can also speed up the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes: minimize, maximize, dropped, raised.
minimize: The analysis stops when the validation metric return by the val_func lower than ``early_stop_value``.
maximize: The analysis stops when the validation metric return by the val_func larger than ``early_stop_value``.
dropped: The analysis stops when the validation metric has dropped by ``early_stop_value``.
raised: The analysis stops when the validation metric has raised by ``early_stop_value``.
If users only want to analyze several specified convolutional layers, users can specify the target conv layers by the ``specified_layers`` in analysis function. ``specified_layers`` is a list that consists of the Pytorch module names of the conv layers. For example
In this example, only the ``Conv1`` layer is analyzed. In addtion, users can quickly and easily achieve the analysis parallelization by launching multiple processes and assigning different conv layers of the same model to each process.
Output example
^^^^^^^^^^^^^^
The following lines are the example csv file exported from SensitivityAnalysis. The first line is constructed by 'layername' and sparsity list. Here the sparsity value means how much weight SensitivityAnalysis prune for each layer. Each line below records the model accuracy when this layer is under different sparsities. Note that, due to the early_stop option, some layers may
not have model accuracies/losses under all sparsities, for example, its accuracy drop has already exceeded the threshold set by the user.
We also provide several tools for the topology analysis during the model compression. These tools are to help users compress their model better. Because of the complex topology of the network, when compressing the model, users often need to spend a lot of effort to check whether the compression configuration is reasonable. So we provide these tools for topology analysis to reduce the burden on users.
ChannelDependency
^^^^^^^^^^^^^^^^^
Complicated models may have residual connection/concat operations in their models. When the user prunes these models, they need to be careful about the channel-count dependencies between the convolution layers in the model. Taking the following residual block in the resnet18 as an example. The output features of the ``layer2.0.conv2`` and ``layer2.0.downsample.0`` are added together, so the number of the output channels of ``layer2.0.conv2`` and ``layer2.0.downsample.0`` should be the same, or there may be a tensor shape conflict.
If the layers have channel dependency are assigned with different sparsities (here we only discuss the structured pruning by L1FilterPruner/L2FilterPruner), then there will be a shape conflict during these layers. Even the pruned model with mask works fine, the pruned model cannot be speedup to the final model directly that runs on the devices, because there will be a shape conflict when the model tries to add/concat the outputs of these layers. This tool is to find the layers that have channel count dependencies to help users better prune their model.
Usage
^^^^^
.. code-block:: python
from nni.compression.pytorch.utils.shape_dependency import ChannelDependency
data = torch.ones(1, 3, 224, 224).cuda()
channel_depen = ChannelDependency(net, data)
channel_depen.export('dependency.csv')
Output Example
^^^^^^^^^^^^^^
The following lines are the output example of torchvision.models.resnet18 exported by ChannelDependency. The layers at the same line have output channel dependencies with each other. For example, layer1.1.conv2, conv1, and layer1.0.conv2 have output channel dependencies with each other, which means the output channel(filters) numbers of these three layers should be same with each other, otherwise, the model may have shape conflict.
.. code-block:: bash
Dependency Set,Convolutional Layers
Set 1,layer1.1.conv2,layer1.0.conv2,conv1
Set 2,layer1.0.conv1
Set 3,layer1.1.conv1
Set 4,layer2.0.conv1
Set 5,layer2.1.conv2,layer2.0.conv2,layer2.0.downsample.0
Set 6,layer2.1.conv1
Set 7,layer3.0.conv1
Set 8,layer3.0.downsample.0,layer3.1.conv2,layer3.0.conv2
Set 9,layer3.1.conv1
Set 10,layer4.0.conv1
Set 11,layer4.0.downsample.0,layer4.1.conv2,layer4.0.conv2
Set 12,layer4.1.conv1
MaskConflict
^^^^^^^^^^^^
When the masks of different layers in a model have conflict (for example, assigning different sparsities for the layers that have channel dependency), we can fix the mask conflict by MaskConflict. Specifically, the MaskConflict loads the masks exported by the pruners(L1FilterPruner, etc), and check if there is mask conflict, if so, MaskConflict sets the conflicting masks to the same value.
.. code-block:: bash
from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict
We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup).
We support two modes to collect information of modules. The first mode is ``default``\ , which only collect the information of convolution and linear. The second mode is ``full``\ , which also collect the information of other operations. Users can easily use our collected ``results`` for futher analysis.
Usage
^^^^^
.. code-block:: python
from nni.compression.pytorch.utils.counter import count_flops_params
In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference `Framework overview of model compression </Compression/Framework.html>`__
Customize a new pruning algorithm
---------------------------------
Implementing a new pruning algorithm requires implementing a ``weight masker`` class which shoud be a subclass of ``WeightMasker``\ , and a ``pruner`` class, which should be a subclass ``Pruner``.
An implementation of ``weight masker`` may look like this:
.. code-block:: python
class MyMasker(WeightMasker):
def __init__(self, model, pruner):
super().__init__(model, pruner)
# You can do some initialization here, such as collecting some statistics data
# if it is necessary for your algorithms to calculate the masks.
# calculate the masks based on the wrapper.weight, and sparsity,
# and anything else
# mask = ...
return {'weight_mask': mask}
You can reference nni provided :githublink:`weight masker <src/sdk/pynni/nni/compression/pytorch/pruning/structured_pruning.py>` implementations to implement your own weight masker.
Reference nni provided :githublink:`pruner <src/sdk/pynni/nni/compression/pytorch/pruning/one_shot.py>` implementations to implement your own pruner class.
----
Customize a new quantization algorithm
--------------------------------------
To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``. Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``. ``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
.. code-block:: python
from nni.compression.pytorch import Quantizer
class YourQuantizer(Quantizer):
def __init__(self, model, config_list):
"""
Suggest you to use the NNI defined spec for config
quantize should overload this method to quantize input.
This method is effectively hooked to :meth:`forward` of the model.
Parameters
----------
inputs : Tensor
inputs that needs to be quantized
config : dict
the configuration for inputs quantization
"""
# Put your code to generate `new_input` here
return new_input
def update_epoch(self, epoch_num):
pass
def step(self):
"""
Can do some processing based on the model or weights binded
in the func bind_model
"""
pass
Customize backward function
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ , user can customize a backward function as follow:
.. code-block:: python
from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
Currently, we have several filter pruning algorithm for the convolutional layers: FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner. In these filter pruning algorithms, the pruner will prune each convolutional layer separately. While pruning a convolution layer, the algorithm will quantify the importance of each filter based on some specific rules(such as l1-norm), and prune the less important filters.
As `dependency analysis utils <./CompressionUtils.md>`__ shows, if the output channels of two convolutional layers(conv1, conv2) are added together, then these two conv layers have channel dependency with each other(more details please see `Compression Utils <./CompressionUtils.rst>`__\ ). Take the following figure as an example.
.. image:: ../../img/mask_conflict.jpg
:target: ../../img/mask_conflict.jpg
:alt:
If we prune the first 50% of output channels(filters) for conv1, and prune the last 50% of output channels for conv2. Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels. In this case, we cannot harvest the speed benefit from the model pruning.
To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the Filter Pruner. In the dependency-aware mode, the pruner prunes the model not only based on the l1 norm of each filter, but also the topology of the whole network architecture.
In the dependency-aware mode(\ ``dependency_aware`` is set ``True``\ ), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
.. image:: ../../img/dependency-aware.jpg
:target: ../../img/dependency-aware.jpg
:alt:
Take the dependency-aware mode of L1Filter Pruner as an example. Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel. Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set(denoted by ``min_sparsity``\ ). According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers. Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel. For example, suppose the output channels of ``conv1`` , ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively. In this case, the ``dependency-aware pruner`` will
.. code-block:: bash
- First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
- Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
In addition, for the convolutional layers that have more than one filter group, ``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group. Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains(channel dependency, etc) to improve the final speed gain after the speedup process.
In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
Usage
-----
In this section, we will show how to enable the dependency-aware mode for the filter pruner. Currently, only the one-shot pruners such as FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner, support the dependency-aware mode.
To enable the dependency-aware mode for ``L1FilterPruner``\ :
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
In order to compare the performance of the pruner with or without the dependency-aware mode, we use L1FilterPruner to prune the Mobilenet_v2 separately when the dependency-aware mode is turned on and off. To simplify the experiment, we use the uniform pruning which means we allocate the same sparsity for all convolutional layers in the model.
We trained a Mobilenet_v2 model on the cifar10 dataset and prune the model based on this pretrained checkpoint. The following figure shows the accuracy and FLOPs of the model pruned by different pruners.
.. image:: ../../img/mobilev2_l1_cifar.jpg
:target: ../../img/mobilev2_l1_cifar.jpg
:alt:
In the figure, the ``Dependency-aware`` represents the L1FilterPruner with dependency-aware mode enabled. ``L1 Filter`` is the normal ``L1FilterPruner`` without the dependency-aware mode, and the ``No-Dependency`` means pruner only prunes the layers that has no channel dependency with other layers. As we can see in the figure, when the dependency-aware mode enabled, the pruner can bring higher accuracy under the same Flops.
Thereare3majorcomponents/classesinNNImodelcompressionframework:``Compressor``\,``Pruner``and``Quantizer``.Let's look at them in detail one by one:
Compressor
----------
Compressor is the base class for pruner and quntizer, it provides a unified interface for pruner and quantizer for end users, so that pruner and quantizer can be used in the same way. For example, to use a pruner:
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import LevelPruner
# load a pretrained model or train a model before using a pruner
View :githublink:`example code <examples/model_compress>` for more information.
``Compressor`` class provides some utility methods for subclass and users:
Set wrapper attribute
^^^^^^^^^^^^^^^^^^^^^
Sometimes ``calc_mask`` must save some state data, therefore users can use ``set_wrappers_attribute`` API to register attribute just like how buffers are registered in PyTorch modules. These buffers will be registered to ``module wrapper``. Users can access these buffers through ``module wrapper``.
In above example, we use ``set_wrappers_attribute`` to set a buffer ``if_calculated`` which is used as flag indicating if the mask of a layer is already calculated.
Collect data during forward
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes users want to collect some data during the modules'forwardmethod,forexample,themeanvalueoftheactivation.Thiscanbedonebyaddingacustomizedcollectortomodule.
A pruner receives ``model``\ , ``config_list`` and ``optimizer`` as arguments. It prunes the model per the ``config_list`` during training loop by adding a hook on ``optimizer.step()``.
Pruner class is a subclass of Compressor, so it contains everything in the Compressor class and some additional components only for pruning, it contains:
Weight masker
^^^^^^^^^^^^^
A ``weight masker`` is the implementation of pruning algorithms, it can prune a specified layer wrapped by ``module wrapper`` with specified sparsity.
Pruning module wrapper
^^^^^^^^^^^^^^^^^^^^^^
A ``pruning module wrapper`` is a module containing:
#. the origin module
#. some buffers used by ``calc_mask``
#. a new forward method that applies masks before running the original forward method.
the reasons to use ``module wrapper``\ :
#. some buffers are needed by ``calc_mask`` to calculate masks and these buffers should be registered in ``module wrapper`` so that the original modules are not contaminated.
#. a new ``forward`` method is needed to apply masks to weight before calling the real ``forward`` method.
Pruning hook
^^^^^^^^^^^^
A pruning hook is installed on a pruner when the pruner is constructed, it is used to call pruner'scalc_maskmethodat``optimizer.step()``isinvoked.
Eachmodule/layerofthemodeltobequantizediswrappedbyaquantizationmodulewrapper,itprovidesanew``forward``methodtoquantizetheoriginalmodule's weight, input and output.
Quantization hook
^^^^^^^^^^^^^^^^^
A quantization hook is installed on a quntizer when it is constructed, it is call at ``optimizer.step()``.
Quantization methods
^^^^^^^^^^^^^^^^^^^^
``Quantizer`` class provides following methods for subclass to implement quantization algorithms:
On multi-GPU training, buffers and parameters are copied to multiple GPU every time the ``forward`` method runs on multiple GPU. If buffers and parameters are updated in the ``forward`` method, an ``in-place`` update is needed to ensure the update is effective.
Since ``calc_mask`` is called in the ``optimizer.step`` method, which happens after the ``forward`` method and happens only on one GPU, it supports multi-GPU naturally.