Commit df6145a2 authored by Yuge Zhang's avatar Yuge Zhang
Browse files

Merge branch 'master' of https://github.com/microsoft/nni into dev-retiarii

parents 0f0c6288 f8424a9f
# Use NNI on Google Colab
NNI can easily run on Google Colab platform. However, Colab doesn't expose its public IP and ports, so by default you can not access NNI's Web UI on Colab. To solve this, you need a reverse proxy software like `ngrok` or `frp`. This tutorial will show you how to use ngrok to access NNI's Web UI on Colab.
## How to Open NNI's Web UI on Google Colab
1. Install required packages and softwares.
```
! pip install nni # install nni
! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip # download ngrok and unzip it
! unzip ngrok-stable-linux-amd64.zip
! mkdir -p nni_repo
! git clone https://github.com/microsoft/nni.git nni_repo/nni # clone NNI's offical repo to get examples
```
2. Register a ngrok account [here](https://ngrok.com/), then connect to your account using your authtoken.
```
! ./ngrok authtoken <your-authtoken>
```
3. Start an NNI example on a port bigger than 1024, then start ngrok with the same port. If you want to use gpu, make sure gpuNum >= 1 in config.yml. Use `get_ipython()` to start ngrok since it will be stuck if you use `! ngrok http 5000 &`.
```
! nnictl create --config nni_repo/nni/examples/trials/mnist-pytorch/config.yml --port 5000 &
get_ipython().system_raw('./ngrok http 5000 &')
```
4. Check the public url.
```
! curl -s http://localhost:4040/api/tunnels # don't change the port number 4040
```
You will see an url like http://xxxx.ngrok.io after step 4, open this url and you will find NNI's Web UI. Have fun :)
## Access Web UI with frp
frp is another reverse proxy software with similar functions. However, frp doesn't provide free public urls, so you may need an server with public IP as a frp server. See [here](https://github.com/fatedier/frp) to know more about how to deploy frp.
# Neural Architecture Search Comparison
*Posted by Anonymous Author*
Train and Compare NAS (Neural Architecture Search) models including Autokeras, DARTS, ENAS and NAO.
Their source code link is as below:
- Autokeras: [https://github.com/jhfjhfj1/autokeras](https://github.com/jhfjhfj1/autokeras)
- DARTS: [https://github.com/quark0/darts](https://github.com/quark0/darts)
- ENAS: [https://github.com/melodyguan/enas](https://github.com/melodyguan/enas)
- NAO: [https://github.com/renqianluo/NAO](https://github.com/renqianluo/NAO)
## Experiment Description
To avoid over-fitting in **CIFAR-10**, we also compare the models in the other five datasets including Fashion-MNIST, CIFAR-100, OUI-Adience-Age, ImageNet-10-1 (subset of ImageNet), ImageNet-10-2 (another subset of ImageNet). We just sample a subset with 10 different labels from ImageNet to make ImageNet-10-1 or ImageNet-10-2.
| Dataset | Training Size | Numer of Classes | Descriptions |
| :----------------------------------------------------------- | ------------- | ---------------- | ------------------------------------------------------------ |
| [Fashion-MNIST](<https://github.com/zalandoresearch/fashion-mnist>) | 60,000 | 10 | T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. |
| [CIFAR-10](<https://www.cs.toronto.edu/~kriz/cifar.html>) | 50,000 | 10 | Airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks. |
| [CIFAR-100](<https://www.cs.toronto.edu/~kriz/cifar.html>) | 50,000 | 100 | Similar to CIFAR-10 but with 100 classes and 600 images each. |
| [OUI-Adience-Age](<https://talhassner.github.io/home/projects/Adience/Adience-data.html>) | 26,580 | 8 | 8 age groups/labels (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-). |
| [ImageNet-10-1](<http://www.image-net.org/>) | 9,750 | 10 | Coffee mug, computer keyboard, dining table, wardrobe, lawn mower, microphone, swing, sewing machine, odometer and gas pump. |
| [ImageNet-10-2](<http://www.image-net.org/>) | 9,750 | 10 | Drum, banj, whistle, grand piano, violin, organ, acoustic guitar, trombone, flute and sax. |
We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
Search phase time for all NAS methods is **two days** as well as the retrain time. Average results are reported based on **three repeat times**. Our evaluation machines have one Nvidia Tesla P100 GPU, 112GB of RAM and one 2.60GHz CPU (Intel E5-2690).
For NAO, it requires too much computing resources, so we only use NAO-WS which provides the pipeline script.
For AutoKeras, we used 0.2.18 version because it was the latest version when we started the experiment.
## NAS Performance
| NAS | AutoKeras (%) | ENAS (macro) (%) | ENAS (micro) (%) | DARTS (%) | NAO-WS (%) |
| --------------- | :-----------: | :--------------: | :--------------: | :-------: | :--------: |
| Fashion-MNIST | 91.84 | 95.44 | 95.53 | **95.74** | 95.20 |
| CIFAR-10 | 75.78 | 95.68 | **96.16** | 94.23 | 95.64 |
| CIFAR-100 | 43.61 | 78.13 | 78.84 | **79.74** | 75.75 |
| OUI-Adience-Age | 63.20 | **80.34** | 78.55 | 76.83 | 72.96 |
| ImageNet-10-1 | 61.80 | 77.07 | 79.80 | **80.48** | 77.20 |
| ImageNet-10-2 | 37.20 | 58.13 | 56.47 | 60.53 | **61.20** |
Unfortunately, we cannot reproduce all the results in the paper.
The best or average results reported in the paper:
| NAS | AutoKeras(%) | ENAS (macro) (%) | ENAS (micro) (%) | DARTS (%) | NAO-WS (%) |
| --------- | ------------ | :--------------: | :--------------: | :------------: | :---------: |
| CIFAR- 10 | 88.56(best) | 96.13(best) | 97.11(best) | 97.17(average) | 96.47(best) |
For AutoKeras, it has relatively worse performance across all datasets due to its random factor on network morphism.
For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro) shows good results in CIFAR-10.
For DARTS, it has a good performance on some datasets but we found its high variance in other datasets. The difference among three runs of benchmarks can be up to 5.37% in OUI-Adience-Age and 4.36% in ImageNet-10-1.
For NAO-WS, it shows good results in ImageNet-10-2 but it can perform very poorly in OUI-Adience-Age.
## Reference
1. Jin, Haifeng, Qingquan Song, and Xia Hu. "Efficient neural architecture search with network morphism." *arXiv preprint arXiv:1806.10282* (2018).
2. Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018).
3. Pham, Hieu, et al. "Efficient Neural Architecture Search via Parameters Sharing." international conference on machine learning (2018): 4092-4101.
4. Luo, Renqian, et al. "Neural Architecture Optimization." neural information processing systems (2018): 7827-7838.
# Parallelizing a Sequential Algorithm TPE
TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete. For the TPE approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation of the y was assigned temporarily, until the evaluation completed and reported the actual loss f(x∗).
## Introduction and Problems
### Sequential Model-based Global Optimization
Sequential Model-Based Global Optimization (SMBO) algorithms have been used in many applications where evaluation of the fitness function is expensive. In an application where the true fitness function f: X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point x∗ that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. This active-learning-like algorithm template is summarized in the figure below. SMBO algorithms differ in what criterion they optimize to obtain x∗ given a model (or surrogate) of f, and in they model f via observation history H.
![](../../img/parallel_tpe_search4.PNG)
The algorithms in this work optimize the criterion of Expected Improvement (EI). Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the bandit-based criterion. We chose to use the EI criterion in TPE because it is intuitive, and has been shown to work well in a variety of settings. Expected improvement is the expectation under some model M of f : X → RN that f(x) will exceed (negatively) some threshold y∗:
![](../../img/parallel_tpe_search_ei.PNG)
Since calculation of p(y|x) is expensive, TPE approach modeled p(y|x) by p(x|y) and p(y).The TPE defines p(x|y) using two such densities:
![](../../img/parallel_tpe_search_tpe.PNG)
where l(x) is the density formed by using the observations {x(i)} such that corresponding loss
f(x(i)) was less than y∗ and g(x) is the density formed by using the remaining observations. TPE algorithm depends on a y∗ that is larger than the best observed f(x) so that some points can be used to form l(x). The TPE algorithm chooses y∗ to be some quantile γ of the observed y values, so that p(y<`y∗`) = γ, but no specific model for p(y) is necessary. The tree-structured form of l and g makes it easy to draw many candidates according to l and evaluate them according to g(x)/l(x). On each iteration, the algorithm returns the candidate x∗ with the greatest EI.
Here is a simulation of the TPE algorithm in a two-dimensional search space. The difference of background color represents different values. It can be seen that TPE combines exploration and exploitation very well. (Black indicates the points of this round samples, and yellow indicates the points has been taken in the history.)
![](../../img/parallel_tpe_search1.gif)
**Since EI is a continuous function, the highest x of EI is determined at a certain status.** As shown in the figure below, the blue triangle is the point that is most likely to be sampled in this state.
![](../../img/parallel_tpe_search_ei2.PNG)
TPE performs well when we use it in sequential, but if we provide a larger concurrency, then **there will be a large number of points produced in the same EI state**, too concentrated points will reduce the exploration ability of the tuner, resulting in resources waste.
Here is the simulation figure when we set `concurrency=60`, It can be seen that this phenomenon is obvious.
![](../../img/parallel_tpe_search2.gif)
## Research solution
### Approximated q-EI Maximization
The multi-points criterion that we have presented below can potentially be used to deliver an additional design of experiments in one step through the resolution of the optimization problem.
![](../../img/parallel_tpe_search_qEI.PNG)
However, the computation of q-EI becomes intensive as q increases. After our research, there are four popular greedy strategies that approach the result of problem while avoiding its numerical cost.
#### Solution 1: Believing the OK Predictor: The KB(Kriging Believer) Heuristic Strategy
The Kriging Believer strategy replaces the conditional knowledge about the responses at the sites chosen within the last iterations by deterministic values equal to the expectation of the Kriging predictor. Keeping the same notations as previously, the strategy can be summed up as follows:
![](../../img/parallel_tpe_search_kb.PNG)
This sequential strategy delivers a q-points design and is computationally affordable since it relies on the analytically known EI, optimized in d dimensions. However, there is a risk of failure, since believing an OK predictor that overshoots the observed data may lead to a sequence that gets trapped in a non-optimal region for many iterations. We now propose a second strategy that reduces this risk.
#### Solution 2: The CL(Constant Liar) Heuristic Strategy
Let us now consider a sequential strategy in which the metamodel is updated (still without hyperparameter re-estimation) at each iteration with a value L exogenously fixed by the user, here called a ”lie”. The strategy referred to as the Constant Liar consists in lying with the same value L at every iteration: maximize EI (i.e. find xn+1), actualize the model as if y(xn+1) = L, and so on always with the same L ∈ R:
![](../../img/parallel_tpe_search_cl.PNG)
L should logically be determined on the basis of the values taken by y at X. Three values, min{Y}, mean{Y}, and max{Y} are considered here. **The larger L is, the more explorative the algorithm will be, and vice versa.**
We have simulated the method above. The following figure shows the result of using mean value liars to maximize q-EI. We find that the points we have taken have begun to be scattered.
![](../../img/parallel_tpe_search3.gif)
## Experiment
### Branin-Hoo
The four optimization strategies presented in the last section are now compared on the Branin-Hoo function which is a classical test-case in global optimization.
![](../../img/parallel_tpe_search_branin.PNG)
The recommended values of a, b, c, r, s and t are: a = 1, b = 5.1 ⁄ (4π2), c = 5 ⁄ π, r = 6, s = 10 and t = 1 ⁄ (8π). This function has three global minimizers(-3.14, 12.27), (3.14, 2.27), (9.42, 2.47).
Next is the comparison of the q-EI associated with the q first points (q ∈ [1,10]) given by the constant liar strategies (min and max), 2000 q-points designs uniformly drawn for every q, and 2000 q-points LHS designs taken at random for every q.
![](../../img/parallel_tpe_search_result.PNG)
As we can seen on figure, CL[max] and CL[min] offer very good q-EI results compared to random designs, especially for small values of q.
### Gaussian Mixed Model function
We also compared the case of using parallel optimization and not using parallel optimization. A two-dimensional multimodal Gaussian Mixed distribution is used to simulate, the following is our result:
||concurrency=80|concurrency=60|concurrency=40|concurrency=20|concurrency=10|
|---|---|---|---|---|---|
|Without parallel optimization|avg = 0.4841 <br> var = 0.1953|avg = 0.5155 <br> var = 0.2219|avg = 0.5773 <br> var = 0.2570|avg = 0.4680 <br> var = 0.1994| avg = 0.2774 <br> var = 0.1217|
|With parallel optimization|avg = 0.2132 <br> var = 0.0700|avg = 0.2177<br>var = 0.0796| avg = 0.1835 <br> var = 0.0533|avg = 0.1671 <br> var = 0.0413|avg = 0.1918 <br> var = 0.0697|
Note: The total number of samples per test is 240 (ensure that the budget is equal). The trials in each form were repeated 1000 times, the value is the average and variance of the best results in 1000 trials.
## References
[1] James Bergstra, Remi Bardenet, Yoshua Bengio, Balazs Kegl. "Algorithms for Hyper-Parameter Optimization". [Link](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)
[2] Meng-Hiot Lim, Yew-Soon Ong. "Computational Intelligence in Expensive Optimization Problems". [Link](https://link.springer.com/content/pdf/10.1007%2F978-3-642-10701-6.pdf)
[3] M. Jordan, J. Kleinberg, B. Scho¨lkopf. "Pattern Recognition and Machine Learning". [Link](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)
# Automatically tuning SVD (NNI in Recommenders)
In this tutorial, we first introduce a github repo [Recommenders](https://github.com/Microsoft/Recommenders). It is a repository that provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. It has various models that are popular and widely deployed in recommendation systems. To provide a complete end-to-end experience, they present each example in five key tasks, as shown below:
- [Prepare Data](https://github.com/Microsoft/Recommenders/blob/master/notebooks/01_prepare_data/README.md): Preparing and loading data for each recommender algorithm.
- [Model](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/README.md): Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)) or eXtreme Deep Factorization Machines ([xDeepFM](https://arxiv.org/abs/1803.05170)).
- [Evaluate](https://github.com/Microsoft/Recommenders/blob/master/notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics.
- [Model Select and Optimize](https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/README.md): Tuning and optimizing hyperparameters for recommender models.
- [Operationalize](https://github.com/Microsoft/Recommenders/blob/master/notebooks/05_operationalize/README.md): Operationalizing models in a production environment on Azure.
The fourth task is tuning and optimizing the model's hyperparameters, this is where NNI could help. To give a concrete example that NNI tunes the models in Recommenders, let's demonstrate with the model [SVD](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb), and data Movielens100k. There are more than 10 hyperparameters to be tuned in this model.
[This Jupyter notebook](https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) provided by Recommenders is a very detailed step-by-step tutorial for this example. It uses different built-in tuning algorithms in NNI, including `Annealing`, `SMAC`, `Random Search`, `TPE`, `Hyperband`, `Metis` and `Evolution`. Finally, the results of different tuning algorithms are compared. Please go through this notebook to learn how to use NNI to tune SVD model, then you could further use NNI to tune other models in Recommenders.
# Automatically tuning SPTAG with NNI
[SPTAG](https://github.com/microsoft/SPTAG) (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by [Microsoft Research (MSR)](https://www.msra.cn/) and [Microsoft Bing](https://www.bing.com/).
This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.
SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.
In SPTAG, there are tens of parameters that can be tuned for specified scenarios or datasets. NNI is a great tool for automatically tuning those parameters. The authors of SPTAG tried NNI for the auto tuning and found good-performing parameters easily, thus, they shared the practice of tuning SPTAG on NNI in their document [here](https://github.com/microsoft/SPTAG/blob/master/docs/Parameters.md). Please refer to it for detailed tutorial.
\ No newline at end of file
# Automatic Model Pruning using NNI Tuners
It's convenient to implement auto model pruning with NNI compression and NNI tuners
## First, model compression with NNI
You can easily compress a model with NNI compression. Take pruning for example, you can prune a pretrained model with LevelPruner like this
```python
from nni.algorithms.compression.pytorch.pruning import LevelPruner
config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
pruner = LevelPruner(model, config_list)
pruner.compress()
```
The 'default' op_type stands for the module types defined in [default_layers.py](https://github.com/microsoft/nni/blob/v1.9/src/sdk/pynni/nni/compression/pytorch/default_layers.py) for pytorch.
Therefore ```{ 'sparsity': 0.8, 'op_types': ['default'] }```means that **all layers with specified op_types will be compressed with the same 0.8 sparsity**. When ```pruner.compress()``` called, the model is compressed with masks and after that you can normally fine tune this model and **pruned weights won't be updated** which have been masked.
## Then, make this automatic
The previous example manually choosed LevelPruner and pruned all layers with the same sparsity, this is obviously sub-optimal because different layers may have different redundancy. Layer sparsity should be carefully tuned to achieve least model performance degradation and this can be done with NNI tuners.
The first thing we need to do is to design a search space, here we use a nested search space which contains choosing pruning algorithm and optimizing layer sparsity.
```json
{
"prune_method": {
"_type": "choice",
"_value": [
{
"_name": "agp",
"conv0_sparsity": {
"_type": "uniform",
"_value": [
0.1,
0.9
]
},
"conv1_sparsity": {
"_type": "uniform",
"_value": [
0.1,
0.9
]
},
},
{
"_name": "level",
"conv0_sparsity": {
"_type": "uniform",
"_value": [
0.1,
0.9
]
},
"conv1_sparsity": {
"_type": "uniform",
"_value": [
0.01,
0.9
]
},
}
]
}
}
```
Then we need to modify our codes for few lines
```python
import nni
from nni.algorithms.compression.pytorch.pruning import *
params = nni.get_parameters()
conv0_sparsity = params['prune_method']['conv0_sparsity']
conv1_sparsity = params['prune_method']['conv1_sparsity']
# these raw sparsity should be scaled if you need total sparsity constrained
config_list_level = [{ 'sparsity': conv0_sparsity, 'op_name': 'conv0' },
{ 'sparsity': conv1_sparsity, 'op_name': 'conv1' }]
config_list_agp = [{'initial_sparsity': 0, 'final_sparsity': conv0_sparsity,
'start_epoch': 0, 'end_epoch': 3,
'frequency': 1,'op_name': 'conv0' },
{'initial_sparsity': 0, 'final_sparsity': conv1_sparsity,
'start_epoch': 0, 'end_epoch': 3,
'frequency': 1,'op_name': 'conv1' },]
PRUNERS = {'level':LevelPruner(model, config_list_level), 'agp':AGPPruner(model, config_list_agp)}
pruner = PRUNERS(params['prune_method']['_name'])
pruner.compress()
... # fine tuning
acc = evaluate(model) # evaluation
nni.report_final_results(acc)
```
Last, define our task and automatically tuning pruning methods with layers sparsity
```yaml
authorName: default
experimentName: Auto_Compression
trialConcurrency: 2
maxExecDuration: 100h
maxTrialNum: 500
#choice: local, remote, pai
trainingServicePlatform: local
#choice: true, false
useAnnotation: False
searchSpacePath: search_space.json
tuner:
#choice: TPE, Random, Anneal...
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: bash run_prune.sh
codeDir: .
gpuNum: 1
```
# Python API Reference of Compression Utilities
```eval_rst
.. contents::
```
## Sensitivity Utilities
```eval_rst
.. autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
:members:
```
## Topology Utilities
```eval_rst
.. autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
:members:
.. autoclass:: nni.compression.pytorch.utils.shape_dependency.GroupDependency
:members:
.. autoclass:: nni.compression.pytorch.utils.mask_conflict.CatMaskPadding
:members:
.. autoclass:: nni.compression.pytorch.utils.mask_conflict.GroupMaskConflict
:members:
.. autoclass:: nni.compression.pytorch.utils.mask_conflict.ChannelMaskConflict
:members:
```
## Model FLOPs/Parameters Counter
```eval_rst
.. autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
```
# Analysis Utils for Model Compression
```eval_rst
.. contents::
```
We provide several easy-to-use tools for users to analyze their model during model compression.
## Sensitivity Analysis
First, we provide a sensitivity analysis tool (**SensitivityAnalysis**) for users to analyze the sensitivity of each convolutional layer in their model. Specifically, the SensitiviyAnalysis gradually prune each layer of the model, and test the accuracy of the model at the same time. Note that, SensitivityAnalysis only prunes a layer once a time, and the other layers are set to their original weights. According to the accuracies of different convolutional layers under different sparsities, we can easily find out which layers the model accuracy is more sensitive to.
### Usage
The following codes show the basic usage of the SensitivityAnalysis.
```python
from nni.compression.pytorch.utils.sensitivity_analysis import SensitivityAnalysis
def val(model):
model.eval()
total = 0
correct = 0
with torch.no_grad():
for batchid, (data, label) in enumerate(val_loader):
data, label = data.cuda(), label.cuda()
out = model(data)
_, predicted = out.max(1)
total += data.size(0)
correct += predicted.eq(label).sum().item()
return correct / total
s_analyzer = SensitivityAnalysis(model=net, val_func=val)
sensitivity = s_analyzer.analysis(val_args=[net])
os.makedir(outdir)
s_analyzer.export(os.path.join(outdir, filename))
```
Two key parameters of SensitivityAnalysis are `model`, and `val_func`. `model` is the neural network that to be analyzed and the `val_func` is the validation function that returns the model accuracy/loss/ or other metrics on the validation dataset. Due to different scenarios may have different ways to calculate the loss/accuracy, so users should prepare a function that returns the model accuracy/loss on the dataset and pass it to SensitivityAnalysis.
SensitivityAnalysis can export the sensitivity results as a csv file usage is shown in the example above.
Futhermore, users can specify the sparsities values used to prune for each layer by optional parameter `sparsities`.
```python
s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75])
```
the SensitivityAnalysis will prune 25% 50% 75% weights gradually for each layer, and record the model's accuracy at the same time (SensitivityAnalysis only prune a layer once a time, the other layers are set to their original weights). If the sparsities is not set, SensitivityAnalysis will use the numpy.arange(0.1, 1.0, 0.1) as the default sparsity values.
Users can also speed up the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes: minimize, maximize, dropped, raised.
minimize: The analysis stops when the validation metric return by the val_func lower than `early_stop_value`.
maximize: The analysis stops when the validation metric return by the val_func larger than `early_stop_value`.
dropped: The analysis stops when the validation metric has dropped by `early_stop_value`.
raised: The analysis stops when the validation metric has raised by `early_stop_value`.
```python
s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75], early_stop_mode='dropped', early_stop_value=0.1)
```
If users only want to analyze several specified convolutional layers, users can specify the target conv layers by the `specified_layers` in analysis function. `specified_layers` is a list that consists of the Pytorch module names of the conv layers. For example
```python
sensitivity = s_analyzer.analysis(val_args=[net], specified_layers=['Conv1'])
```
In this example, only the `Conv1` layer is analyzed. In addtion, users can quickly and easily achieve the analysis parallelization by launching multiple processes and assigning different conv layers of the same model to each process.
### Output example
The following lines are the example csv file exported from SensitivityAnalysis. The first line is constructed by 'layername' and sparsity list. Here the sparsity value means how much weight SensitivityAnalysis prune for each layer. Each line below records the model accuracy when this layer is under different sparsities. Note that, due to the early_stop option, some layers may
not have model accuracies/losses under all sparsities, for example, its accuracy drop has already exceeded the threshold set by the user.
```
layername,0.05,0.1,0.2,0.3,0.4,0.5,0.7,0.85,0.95
features.0,0.54566,0.46308,0.06978,0.0374,0.03024,0.01512,0.00866,0.00492,0.00184
features.3,0.54878,0.51184,0.37978,0.19814,0.07178,0.02114,0.00438,0.00442,0.00142
features.6,0.55128,0.53566,0.4887,0.4167,0.31178,0.19152,0.08612,0.01258,0.00236
features.8,0.55696,0.54194,0.48892,0.42986,0.33048,0.2266,0.09566,0.02348,0.0056
features.10,0.55468,0.5394,0.49576,0.4291,0.3591,0.28138,0.14256,0.05446,0.01578
```
## Topology Analysis
We also provide several tools for the topology analysis during the model compression. These tools are to help users compress their model better. Because of the complex topology of the network, when compressing the model, users often need to spend a lot of effort to check whether the compression configuration is reasonable. So we provide these tools for topology analysis to reduce the burden on users.
### ChannelDependency
Complicated models may have residual connection/concat operations in their models. When the user prunes these models, they need to be careful about the channel-count dependencies between the convolution layers in the model. Taking the following residual block in the resnet18 as an example. The output features of the `layer2.0.conv2` and `layer2.0.downsample.0` are added together, so the number of the output channels of `layer2.0.conv2` and `layer2.0.downsample.0` should be the same, or there may be a tensor shape conflict.
![](../../img/channel_dependency_example.jpg)
If the layers have channel dependency are assigned with different sparsities (here we only discuss the structured pruning by L1FilterPruner/L2FilterPruner), then there will be a shape conflict during these layers. Even the pruned model with mask works fine, the pruned model cannot be speedup to the final model directly that runs on the devices, because there will be a shape conflict when the model tries to add/concat the outputs of these layers. This tool is to find the layers that have channel count dependencies to help users better prune their model.
#### Usage
```python
from nni.compression.pytorch.utils.shape_dependency import ChannelDependency
data = torch.ones(1, 3, 224, 224).cuda()
channel_depen = ChannelDependency(net, data)
channel_depen.export('dependency.csv')
```
#### Output Example
The following lines are the output example of torchvision.models.resnet18 exported by ChannelDependency. The layers at the same line have output channel dependencies with each other. For example, layer1.1.conv2, conv1, and layer1.0.conv2 have output channel dependencies with each other, which means the output channel(filters) numbers of these three layers should be same with each other, otherwise, the model may have shape conflict.
```
Dependency Set,Convolutional Layers
Set 1,layer1.1.conv2,layer1.0.conv2,conv1
Set 2,layer1.0.conv1
Set 3,layer1.1.conv1
Set 4,layer2.0.conv1
Set 5,layer2.1.conv2,layer2.0.conv2,layer2.0.downsample.0
Set 6,layer2.1.conv1
Set 7,layer3.0.conv1
Set 8,layer3.0.downsample.0,layer3.1.conv2,layer3.0.conv2
Set 9,layer3.1.conv1
Set 10,layer4.0.conv1
Set 11,layer4.0.downsample.0,layer4.1.conv2,layer4.0.conv2
Set 12,layer4.1.conv1
```
### MaskConflict
When the masks of different layers in a model have conflict (for example, assigning different sparsities for the layers that have channel dependency), we can fix the mask conflict by MaskConflict. Specifically, the MaskConflict loads the masks exported by the pruners(L1FilterPruner, etc), and check if there is mask conflict, if so, MaskConflict sets the conflicting masks to the same value.
```
from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict
fixed_mask = fix_mask_conflict('./resnet18_mask', net, data)
```
## Model FLOPs/Parameters Counter
We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup).
We support two modes to collect information of modules. The first mode is `default`, which only collect the information of convolution and linear. The second mode is `full`, which also collect the information of other operations. Users can easily use our collected `results` for futher analysis.
### Usage
```
from nni.compression.pytorch.utils.counter import count_flops_params
# Given input size (1, 1, 28, 28)
flops, params, results = count_flops_params(model, (1, 1, 28, 28))
# Given input tensor with size (1, 1, 28, 28) and switch to full mode
x = torch.randn(1, 1, 28, 28)
flops, params, results = count_flops_params(model, (x,) mode='full') # tuple of tensor as input
# Format output size to M (i.e., 10^6)
print(f'FLOPs: {flops/1e6:.3f}M, Params: {params/1e6:.3f}M)
print(results)
{
'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']},
'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}
}
```
# Customize New Compression Algorithm
```eval_rst
.. contents::
```
In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference [Framework overview of model compression](https://nni.readthedocs.io/en/latest/Compression/Framework.html)
## Customize a new pruning algorithm
Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should be a subclass `Pruner`.
An implementation of `weight masker` may look like this:
```python
class MyMasker(WeightMasker):
def __init__(self, model, pruner):
super().__init__(model, pruner)
# You can do some initialization here, such as collecting some statistics data
# if it is necessary for your algorithms to calculate the masks.
def calc_mask(self, sparsity, wrapper, wrapper_idx=None):
# calculate the masks based on the wrapper.weight, and sparsity,
# and anything else
# mask = ...
return {'weight_mask': mask}
```
You can reference nni provided [weight masker](https://github.com/microsoft/nni/blob/v1.9/src/sdk/pynni/nni/compression/pytorch/pruning/structured_pruning.py) implementations to implement your own weight masker.
A basic `pruner` looks likes this:
```python
class MyPruner(Pruner):
def __init__(self, model, config_list, optimizer):
super().__init__(model, config_list, optimizer)
self.set_wrappers_attribute("if_calculated", False)
# construct a weight masker instance
self.masker = MyMasker(model, self)
def calc_mask(self, wrapper, wrapper_idx=None):
sparsity = wrapper.config['sparsity']
if wrapper.if_calculated:
# Already pruned, do not prune again as a one-shot pruner
return None
else:
# call your masker to actually calcuate the mask for this layer
masks = self.masker.calc_mask(sparsity=sparsity, wrapper=wrapper, wrapper_idx=wrapper_idx)
wrapper.if_calculated = True
return masks
```
Reference nni provided [pruner](https://github.com/microsoft/nni/blob/v1.9/src/sdk/pynni/nni/compression/pytorch/pruning/one_shot.py) implementations to implement your own pruner class.
***
## Customize a new quantization algorithm
To write a new quantization algorithm, you can write a class that inherits `nni.compression.pytorch.Quantizer`. Then, override the member functions with the logic of your algorithm. The member function to override is `quantize_weight`. `quantize_weight` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
```python
from nni.compression.pytorch import Quantizer
class YourQuantizer(Quantizer):
def __init__(self, model, config_list):
"""
Suggest you to use the NNI defined spec for config
"""
super().__init__(model, config_list)
def quantize_weight(self, weight, config, **kwargs):
"""
quantize should overload this method to quantize weight tensors.
This method is effectively hooked to :meth:`forward` of the model.
Parameters
----------
weight : Tensor
weight that needs to be quantized
config : dict
the configuration for weight quantization
"""
# Put your code to generate `new_weight` here
return new_weight
def quantize_output(self, output, config, **kwargs):
"""
quantize should overload this method to quantize output.
This method is effectively hooked to `:meth:`forward` of the model.
Parameters
----------
output : Tensor
output that needs to be quantized
config : dict
the configuration for output quantization
"""
# Put your code to generate `new_output` here
return new_output
def quantize_input(self, *inputs, config, **kwargs):
"""
quantize should overload this method to quantize input.
This method is effectively hooked to :meth:`forward` of the model.
Parameters
----------
inputs : Tensor
inputs that needs to be quantized
config : dict
the configuration for inputs quantization
"""
# Put your code to generate `new_input` here
return new_input
def update_epoch(self, epoch_num):
pass
def step(self):
"""
Can do some processing based on the model or weights binded
in the func bind_model
"""
pass
```
### Customize backward function
Sometimes it's necessary for a quantization operation to have a customized backward function, such as [Straight-Through Estimator](https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste), user can customize a backward function as follow:
```python
from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
class ClipGrad(QuantGrad):
@staticmethod
def quant_backward(tensor, grad_output, quant_type):
"""
This method should be overrided by subclass to provide customized backward function,
default implementation is Straight-Through Estimator
Parameters
----------
tensor : Tensor
input of quantization operation
grad_output : Tensor
gradient of the output of quantization operation
quant_type : QuantType
the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
you can define different behavior for different types.
Returns
-------
tensor
gradient of the input of quantization operation
"""
# for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
if quant_type == QuantType.QUANT_OUTPUT:
grad_output[torch.abs(tensor) > 1] = 0
return grad_output
class YourQuantizer(Quantizer):
def __init__(self, model, config_list):
super().__init__(model, config_list)
# set your customized backward function to overwrite default backward function
self.quant_grad = ClipGrad
```
If you do not customize `QuantGrad`, the default backward is Straight-Through Estimator.
_Coming Soon_ ...
# Dependency-aware Mode for Filter Pruning
Currently, we have several filter pruning algorithm for the convolutional layers: FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner. In these filter pruning algorithms, the pruner will prune each convolutional layer separately. While pruning a convolution layer, the algorithm will quantify the importance of each filter based on some specific rules(such as l1-norm), and prune the less important filters.
As [dependency analysis utils](./CompressionUtils.md) shows, if the output channels of two convolutional layers(conv1, conv2) are added together, then these two conv layers have channel dependency with each other(more details please see [Compression Utils](./CompressionUtils.md)). Take the following figure as an example.
![](../../img/mask_conflict.jpg)
If we prune the first 50% of output channels(filters) for conv1, and prune the last 50% of output channels for conv2. Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels. In this case, we cannot harvest the speed benefit from the model pruning.
To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the Filter Pruner. In the dependency-aware mode, the pruner prunes the model not only based on the l1 norm of each filter, but also the topology of the whole network architecture.
In the dependency-aware mode(`dependency_aware` is set `True`), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
![](../../img/dependency-aware.jpg)
Take the dependency-aware mode of L1Filter Pruner as an example. Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel. Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set(denoted by `min_sparsity`). According to the L1 norm sum of each channel, the pruner will prune the same `min_sparsity` channels for all the layers. Next, the pruner will additionally prune `sparsity` - `min_sparsity` channels for each convolutional layer based on its own L1 norm of each channel. For example, suppose the output channels of `conv1` , `conv2` are added together and the configured sparsities of `conv1` and `conv2` are 0.3, 0.2 respectively. In this case, the `dependency-aware pruner` will
- First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
- Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
In addition, for the convolutional layers that have more than one filter group, `dependency-aware pruner` will also try to prune the same number of the channels for each filter group. Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains(channel dependency, etc) to improve the final speed gain after the speedup process.
In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
## Usage
In this section, we will show how to enable the dependency-aware mode for the filter pruner. Currently, only the one-shot pruners such as FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner, support the dependency-aware mode.
To enable the dependency-aware mode for `L1FilterPruner`:
```python
from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
# dummy_input is necessary for the dependency_aware mode
dummy_input = torch.ones(1, 3, 224, 224).cuda()
pruner = L1FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
# for L2FilterPruner
# pruner = L2FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
# for FPGMPruner
# pruner = FPGMPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
# for ActivationAPoZRankFilterPruner
# pruner = ActivationAPoZRankFilterPruner(model, config_list, statistics_batch_num=1, , dependency_aware=True, dummy_input=dummy_input)
# for ActivationMeanRankFilterPruner
# pruner = ActivationMeanRankFilterPruner(model, config_list, statistics_batch_num=1, dependency_aware=True, dummy_input=dummy_input)
# for TaylorFOWeightFilterPruner
# pruner = TaylorFOWeightFilterPruner(model, config_list, statistics_batch_num=1, dependency_aware=True, dummy_input=dummy_input)
pruner.compress()
```
## Evaluation
In order to compare the performance of the pruner with or without the dependency-aware mode, we use L1FilterPruner to prune the Mobilenet_v2 separately when the dependency-aware mode is turned on and off. To simplify the experiment, we use the uniform pruning which means we allocate the same sparsity for all convolutional layers in the model.
We trained a Mobilenet_v2 model on the cifar10 dataset and prune the model based on this pretrained checkpoint. The following figure shows the accuracy and FLOPs of the model pruned by different pruners.
![](../../img/mobilev2_l1_cifar.jpg)
In the figure, the `Dependency-aware` represents the L1FilterPruner with dependency-aware mode enabled. `L1 Filter` is the normal `L1FilterPruner` without the dependency-aware mode, and the `No-Dependency` means pruner only prunes the layers that has no channel dependency with other layers. As we can see in the figure, when the dependency-aware mode enabled, the pruner can bring higher accuracy under the same Flops.
# Framework overview of model compression
```eval_rst
.. contents::
```
Below picture shows the components overview of model compression framework.
![](../../img/compressor_framework.jpg)
There are 3 major components/classes in NNI model compression framework: `Compressor`, `Pruner` and `Quantizer`. Let's look at them in detail one by one:
## Compressor
Compressor is the base class for pruner and quntizer, it provides a unified interface for pruner and quantizer for end users, so that pruner and quantizer can be used in the same way. For example, to use a pruner:
```python
from nni.algorithms.compression.pytorch.pruning import LevelPruner
# load a pretrained model or train a model before using a pruner
configure_list = [{
'sparsity': 0.7,
'op_types': ['Conv2d', 'Linear'],
}]
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
pruner = LevelPruner(model, configure_list, optimizer)
model = pruner.compress()
# model is ready for pruning, now start finetune the model,
# the model will be pruned during training automatically
```
To use a quantizer:
```python
from nni.algorithms.compression.pytorch.pruning import DoReFaQuantizer
configure_list = [{
'quant_types': ['weight'],
'quant_bits': {
'weight': 8,
},
'op_types':['Conv2d', 'Linear']
}]
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
quantizer = DoReFaQuantizer(model, configure_list, optimizer)
quantizer.compress()
```
View [example code](https://github.com/microsoft/nni/tree/v1.9/examples/model_compress) for more information.
`Compressor` class provides some utility methods for subclass and users:
### Set wrapper attribute
Sometimes `calc_mask` must save some state data, therefore users can use `set_wrappers_attribute` API to register attribute just like how buffers are registered in PyTorch modules. These buffers will be registered to `module wrapper`. Users can access these buffers through `module wrapper`.
In above example, we use `set_wrappers_attribute` to set a buffer `if_calculated` which is used as flag indicating if the mask of a layer is already calculated.
### Collect data during forward
Sometimes users want to collect some data during the modules' forward method, for example, the mean value of the activation. This can be done by adding a customized collector to module.
```python
class MyMasker(WeightMasker):
def __init__(self, model, pruner):
super().__init__(model, pruner)
# Set attribute `collected_activation` for all wrappers to store
# activations for each layer
self.pruner.set_wrappers_attribute("collected_activation", [])
self.activation = torch.nn.functional.relu
def collector(wrapper, input_, output):
# The collected activation can be accessed via each wrapper's collected_activation
# attribute
wrapper.collected_activation.append(self.activation(output.detach().cpu()))
self.pruner.hook_id = self.pruner.add_activation_collector(collector)
```
The collector function will be called each time the forward method runs.
Users can also remove this collector like this:
```python
# Save the collector identifier
collector_id = self.pruner.add_activation_collector(collector)
# When the collector is not used any more, it can be remove using
# the saved collector identifier
self.pruner.remove_activation_collector(collector_id)
```
***
## Pruner
A pruner receives `model`, `config_list` and `optimizer` as arguments. It prunes the model per the `config_list` during training loop by adding a hook on `optimizer.step()`.
Pruner class is a subclass of Compressor, so it contains everything in the Compressor class and some additional components only for pruning, it contains:
### Weight masker
A `weight masker` is the implementation of pruning algorithms, it can prune a specified layer wrapped by `module wrapper` with specified sparsity.
### Pruning module wrapper
A `pruning module wrapper` is a module containing:
1. the origin module
2. some buffers used by `calc_mask`
3. a new forward method that applies masks before running the original forward method.
the reasons to use `module wrapper`:
1. some buffers are needed by `calc_mask` to calculate masks and these buffers should be registered in `module wrapper` so that the original modules are not contaminated.
2. a new `forward` method is needed to apply masks to weight before calling the real `forward` method.
### Pruning hook
A pruning hook is installed on a pruner when the pruner is constructed, it is used to call pruner's calc_mask method at `optimizer.step()` is invoked.
***
## Quantizer
Quantizer class is also a subclass of `Compressor`, it is used to compress models by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time. It contains:
### Quantization module wrapper
Each module/layer of the model to be quantized is wrapped by a quantization module wrapper, it provides a new `forward` method to quantize the original module's weight, input and output.
### Quantization hook
A quantization hook is installed on a quntizer when it is constructed, it is call at `optimizer.step()`.
### Quantization methods
`Quantizer` class provides following methods for subclass to implement quantization algorithms:
```python
class Quantizer(Compressor):
"""
Base quantizer for pytorch quantizer
"""
def quantize_weight(self, weight, wrapper, **kwargs):
"""
quantize should overload this method to quantize weight.
This method is effectively hooked to :meth:`forward` of the model.
Parameters
----------
weight : Tensor
weight that needs to be quantized
wrapper : QuantizerModuleWrapper
the wrapper for origin module
"""
raise NotImplementedError('Quantizer must overload quantize_weight()')
def quantize_output(self, output, wrapper, **kwargs):
"""
quantize should overload this method to quantize output.
This method is effectively hooked to :meth:`forward` of the model.
Parameters
----------
output : Tensor
output that needs to be quantized
wrapper : QuantizerModuleWrapper
the wrapper for origin module
"""
raise NotImplementedError('Quantizer must overload quantize_output()')
def quantize_input(self, *inputs, wrapper, **kwargs):
"""
quantize should overload this method to quantize input.
This method is effectively hooked to :meth:`forward` of the model.
Parameters
----------
inputs : Tensor
inputs that needs to be quantized
wrapper : QuantizerModuleWrapper
the wrapper for origin module
"""
raise NotImplementedError('Quantizer must overload quantize_input()')
```
***
## Multi-GPU support
On multi-GPU training, buffers and parameters are copied to multiple GPU every time the `forward` method runs on multiple GPU. If buffers and parameters are updated in the `forward` method, an `in-place` update is needed to ensure the update is effective.
Since `calc_mask` is called in the `optimizer.step` method, which happens after the `forward` method and happens only on one GPU, it supports multi-GPU naturally.
# Speed up Masked Model
*This feature is in Beta version.*
## Introduction
Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
to check model performance of a specific pruning (or sparsity), but there is no real speedup.
Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
to convert a model to a smaller one based on user provided masks (the masks come from the
pruning algorithms).
There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights, and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer. The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning. To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one. Since the support of sparse kernels in community is limited, we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
## Design and Implementation
To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask, or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors, thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change. Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced; second, replace the modules. The first step requires topology (i.e., connections) of the model, we use `jit.trace` to obtain the model graph for PyTorch.
For each module, we should prepare four functions, three for shape inference and one for module replacement. The three shape inference functions are: given weight shape infer input/output shape, given input shape infer weight/output shape, given output shape infer weight/input shape. The module replacement function returns a newly created module which is smaller.
## Usage
```python
from nni.compression.pytorch import ModelSpeedup
# model: the model you want to speed up
# dummy_input: dummy input of the model, given to `jit.trace`
# masks_file: the mask file created by pruning algorithms
m_speedup = ModelSpeedup(model, dummy_input.to(device), masks_file)
m_speedup.speedup_model()
dummy_input = dummy_input.to(device)
start = time.time()
out = model(dummy_input)
print('elapsed time: ', time.time() - start)
```
For complete examples please refer to [the code](https://github.com/microsoft/nni/tree/v1.9/examples/model_compress/model_speedup.py)
NOTE: The current implementation supports PyTorch 1.3.1 or newer.
## Limitations
Since every module requires four functions for shape inference and module replacement, this is a large amount of work, we only implemented the ones that are required by the examples. If you want to speed up your own model which cannot supported by the current implementation, you are welcome to contribute.
For PyTorch we can only replace modules, if functions in `forward` should be replaced, our current implementation does not work. One workaround is make the function a PyTorch module.
## Speedup Results of Examples
The code of these experiments can be found [here](https://github.com/microsoft/nni/tree/v1.9/examples/model_compress/model_speedup.py).
### slim pruner example
on one V100 GPU,
input tensor: `torch.randn(64, 3, 32, 32)`
|Times| Mask Latency| Speedup Latency |
|---|---|---|
| 1 | 0.01197 | 0.005107 |
| 2 | 0.02019 | 0.008769 |
| 4 | 0.02733 | 0.014809 |
| 8 | 0.04310 | 0.027441 |
| 16 | 0.07731 | 0.05008 |
| 32 | 0.14464 | 0.10027 |
### fpgm pruner example
on cpu,
input tensor: `torch.randn(64, 1, 28, 28)`,
too large variance
|Times| Mask Latency| Speedup Latency |
|---|---|---|
| 1 | 0.01383 | 0.01839 |
| 2 | 0.01167 | 0.003558 |
| 4 | 0.01636 | 0.01088 |
| 40 | 0.14412 | 0.08268 |
| 40 | 1.29385 | 0.14408 |
| 40 | 0.41035 | 0.46162 |
| 400 | 6.29020 | 5.82143 |
### l1filter pruner example
on one V100 GPU,
input tensor: `torch.randn(64, 3, 32, 32)`
|Times| Mask Latency| Speedup Latency |
|---|---|---|
| 1 | 0.01026 | 0.003677 |
| 2 | 0.01657 | 0.008161 |
| 4 | 0.02458 | 0.020018 |
| 8 | 0.03498 | 0.025504 |
| 16 | 0.06757 | 0.047523 |
| 32 | 0.10487 | 0.086442 |
### APoZ pruner example
on one V100 GPU,
input tensor: `torch.randn(64, 3, 32, 32)`
|Times| Mask Latency| Speedup Latency |
|---|---|---|
| 1 | 0.01389 | 0.004208 |
| 2 | 0.01628 | 0.008310 |
| 4 | 0.02521 | 0.014008 |
| 8 | 0.03386 | 0.023923 |
| 16 | 0.06042 | 0.046183 |
| 32 | 0.12421 | 0.087113 |
# Model Compression with NNI
```eval_rst
.. contents::
```
As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
* Support many popular pruning and quantization algorithms.
* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
* Speed up a compressed model to make it have lower inference latency and also make it become smaller.
* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
* Concise interface for users to customize their own compression algorithms.
*Note that the interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.*
## Supported Algorithms
The algorithms include pruning algorithms and quantization algorithms.
### Pruning Algorithms
Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and address the over-fitting issue.
|Name|Brief Introduction of Algorithm|
|---|---|
| [Level Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#level-pruner) | Pruning the specified ratio on each weight based on absolute values of weights |
| [AGP Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#agp-pruner) | Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) [Reference Paper](https://arxiv.org/abs/1710.01878)|
| [Lottery Ticket Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#lottery-ticket-hypothesis) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
| [FPGM Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
| [L1Filter Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
| [L2Filter Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
| [ActivationAPoZRankFilterPruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#activationapozrankfilterpruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
| [ActivationMeanRankFilterPruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#activationmeanrankfilterpruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
| [Slim Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
| [TaylorFO Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
| [ADMM Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#admm-pruner) | Pruning based on ADMM optimization technique [Reference Paper](https://arxiv.org/abs/1804.03294) |
| [NetAdapt Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#netadapt-pruner) | Automatically simplify a pretrained network to meet the resource budget by iterative pruning [Reference Paper](https://arxiv.org/abs/1804.03230) |
| [SimulatedAnnealing Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#simulatedannealing-pruner) | Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm [Reference Paper](https://arxiv.org/abs/1907.03141) |
| [AutoCompress Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#autocompress-pruner) | Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner [Reference Paper](https://arxiv.org/abs/1907.03141) |
| [AMC Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#amc-pruner) | AMC: AutoML for Model Compression and Acceleration on Mobile Devices [Reference Paper](https://arxiv.org/pdf/1802.03494.pdf) |
You can refer to this [benchmark](https://github.com/microsoft/nni/tree/v1.9/docs/en_US/CommunitySharings/ModelCompressionComparison.md) for the performance of these pruners on some benchmark problems.
### Quantization Algorithms
Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
|Name|Brief Introduction of Algorithm|
|---|---|
| [Naive Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#naive-quantizer) | Quantize weights to default 8 bits |
| [QAT Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
| [DoReFa Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
| [BNN Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#bnn-quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
## Automatic Model Compression
Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found [here](./AutoPruningUsingTuners.md).
## Model Speedup
The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Model Speedup can be found [here](./ModelSpeedup.md).
## Compression Utilities
Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to [here](./CompressionUtils.md) for a complete list of compression utilities.
## Customize Your Own Compression Algorithms
NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found [here](./Framework.md).
## Reference and Feedback
* To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;
* To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub;
* To know more about [Feature Engineering with NNI](../FeatureEngineering/Overview.md);
* To know more about [NAS with NNI](../NAS/Overview.md);
* To know more about [Hyperparameter Tuning with NNI](../Tuner/BuiltinTuner.md);
This diff is collapsed.
# Supported Quantization Algorithms on NNI
Index of supported quantization algorithms
* [Naive Quantizer](#naive-quantizer)
* [QAT Quantizer](#qat-quantizer)
* [DoReFa Quantizer](#dorefa-quantizer)
* [BNN Quantizer](#bnn-quantizer)
## Naive Quantizer
We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
### Usage
pytorch
```python
model = nni.algorithms.compression.pytorch.quantization.NaiveQuantizer(model).compress()
```
***
## QAT Quantizer
In [Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf), authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
>We propose an approach that simulates quantization effects in the forward pass of training. Backpropagation still happens as usual, and all weights and biases are stored in floating point so that they can be easily nudged by small amounts. The forward propagation pass however simulates quantized inference as it will happen in the inference engine, by implementing in floating-point arithmetic the rounding behavior of the quantization scheme
>* Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer, the batch normalization parameters are “folded into” the weights before quantization.
>* Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
### Usage
You can quantize your model to 8 bits with the code below before your training code.
PyTorch code
```python
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
model = Mnist()
config_list = [{
'quant_types': ['weight'],
'quant_bits': {
'weight': 8,
}, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
'op_types':['Conv2d', 'Linear']
}, {
'quant_types': ['output'],
'quant_bits': 8,
'quant_start_step': 7000,
'op_types':['ReLU6']
}]
quantizer = QAT_Quantizer(model, config_list)
quantizer.compress()
```
You can view example for more information
#### User configuration for QAT Quantizer
common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
configuration needed by this algorithm :
* **quant_start_step:** int
disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
state where activation quantization ranges do not exclude a significant fraction of values, default value is 0
### note
batch normalization folding is currently not supported.
***
## DoReFa Quantizer
In [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160), authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
### Usage
To implement DoReFa Quantizer, you can add code below before your training code
PyTorch code
```python
from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
config_list = [{
'quant_types': ['weight'],
'quant_bits': 8,
'op_types': 'default'
}]
quantizer = DoReFaQuantizer(model, config_list)
quantizer.compress()
```
You can view example for more information
#### User configuration for DoReFa Quantizer
common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
configuration needed by this algorithm :
***
## BNN Quantizer
In [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830),
>We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
### Usage
PyTorch code
```python
from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
model = VGG_Cifar10(num_classes=10)
configure_list = [{
'quant_bits': 1,
'quant_types': ['weight'],
'op_types': ['Conv2d', 'Linear'],
'op_names': ['features.0', 'features.3', 'features.7', 'features.10', 'features.14', 'features.17', 'classifier.0', 'classifier.3']
}, {
'quant_bits': 1,
'quant_types': ['output'],
'op_types': ['Hardtanh'],
'op_names': ['features.6', 'features.9', 'features.13', 'features.16', 'features.20', 'classifier.2', 'classifier.5']
}]
quantizer = BNNQuantizer(model, configure_list)
model = quantizer.compress()
```
You can view example [examples/model_compress/BNN_quantizer_cifar10.py]( https://github.com/microsoft/nni/tree/v1.9/examples/model_compress/BNN_quantizer_cifar10.py) for more information.
#### User configuration for BNN Quantizer
common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
configuration needed by this algorithm :
### Experiment
We implemented one of the experiments in [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
| Model | Accuracy |
| ------------- | --------- |
| VGGNet | 86.93% |
The experiments code can be found at [examples/model_compress/BNN_quantizer_cifar10.py]( https://github.com/microsoft/nni/tree/v1.9/examples/model_compress/BNN_quantizer_cifar10.py)
# Tutorial for Model Compression
```eval_rst
.. contents::
```
In this tutorial, we use the [first section](#quick-start-to-compress-a-model) to quickly go through the usage of model compression on NNI. Then use the [second section](#detailed-usage-guide) to explain more details of the usage.
## Quick Start to Compress a Model
NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use [slim pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#slim-pruner) as an example to show the usage.
### Write configuration
Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the `BatchNorm2d`s to sparsity 0.7 while keeping other layers unpruned.
```python
configure_list = [{
'sparsity': 0.7,
'op_types': ['BatchNorm2d'],
}]
```
The specification of configuration can be found [here](#specification-of-config-list). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](./Pruner.md) for details, and adjust the configuration accordingly.
### Choose a compression algorithm
Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke `compress()` to compress your model.
```python
pruner = SlimPruner(model, configure_list)
model = pruner.compress()
```
Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
### Export compression result
After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
```python
pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
```
The complete code of model compression examples can be found [here](https://github.com/microsoft/nni/blob/v1.9/examples/model_compress/model_prune_torch.py).
### Speed up the model
Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking `apply_compression_results` on your model, your model becomes a smaller one with shorter inference latency.
```python
from nni.compression.pytorch import apply_compression_results
apply_compression_results(model, 'mask_vgg19_cifar10.pth')
```
Please refer to [here](ModelSpeedup.md) for detailed description.
## Detailed Usage Guide
The example code for users to apply model compression on a user model can be found below:
PyTorch code
```python
from nni.algorithms.compression.pytorch.pruning import LevelPruner
config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
pruner = LevelPruner(model, config_list)
pruner.compress()
```
Tensorflow code
```python
from nni.algorithms.compression.tensorflow.pruning import LevelPruner
config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
pruner = LevelPruner(tf.get_default_graph(), config_list)
pruner.compress()
```
You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under `nni.compression.pytorch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)
A compression algorithm is first instantiated with a `config_list` passed in. The specification of this `config_list` will be described later.
The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
*Note that, `pruner.compress` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after `pruner.compress`.*
### Specification of `config_list`
Users can specify the configuration (i.e., `config_list`) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python `list` object, where each element is a `dict` object.
The `dict`s in the `list` are applied one by one, that is, the configurations in latter `dict` will overwrite the configurations in former ones on the operations that are within the scope of both of them.
There are different keys in a `dict`. Some of them are common keys supported by all the compression algorithms:
* __op_types__: This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
* __op_names__: This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
* __exclude__: Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
Some other keys are often specific to a certain algorithms, users can refer to [pruning algorithms](./Pruner.md) and [quantization algorithms](./Quantizer.md) for the keys allowed by each algorithm.
A simple example of configuration is shown below:
```python
[
{
'sparsity': 0.8,
'op_types': ['default']
},
{
'sparsity': 0.6,
'op_names': ['op_name1', 'op_name2']
},
{
'exclude': True,
'op_names': ['op_name3']
}
]
```
It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for `op_name1` and `op_name2` use sparsity 0.6, and do not compress `op_name3`.
#### Quantization specific keys
Besides the keys explained above, if you use quantization algorithms you need to specify more keys in `config_list`, which are explained below.
* __quant_types__ : list of string.
Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
* __quant_bits__ : int or dict of {str : int}
bits length of quantization, key is the quantization type, value is the quantization bits length, eg.
```
{
quant_bits: {
'weight': 8,
'output': 4,
},
}
```
when the value is int type, all quantization types share same bits length. eg.
```
{
quant_bits: 8, # weight or output quantization are all 8 bits
}
```
The following example shows a more complete `config_list`, it uses `op_names` (or `op_types`) to specify the target layers along with the quantization bits for those layers.
```
configure_list = [{
'quant_types': ['weight'],
'quant_bits': 8,
'op_names': ['conv1']
}, {
'quant_types': ['weight'],
'quant_bits': 4,
'quant_start_step': 0,
'op_names': ['conv2']
}, {
'quant_types': ['weight'],
'quant_bits': 3,
'op_names': ['fc1']
},
{
'quant_types': ['weight'],
'quant_bits': 2,
'op_names': ['fc2']
}
]
```
In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
### APIs for Updating Fine Tuning Status
Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.
`update_epoch` should be invoked in every epoch, while `step` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
### Export Compressed Model
You can easily export the compressed model using the following API if you are pruning your model, ```state_dict``` of the sparse model weights will be stored in ```model.pth```, which can be loaded by ```torch.load('model.pth')```. In this exported ```model.pth```, the masked weights are zero.
```
pruner.export_model(model_path='model.pth')
```
```mask_dict ``` and pruned model in ```onnx``` format(```input_shape``` need to be specified) can also be exported like this:
```python
pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
```
If you want to really speed up the compressed model, please refer to [NNI model speedup](./ModelSpeedup.md) for details.
## GBDTSelector
GBDTSelector is based on [LightGBM](https://github.com/microsoft/LightGBM), which is a gradient boosting framework that uses tree-based learning algorithms.
When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
For now, we support the `importance_type` is `split` and `gain`. But we will support customized `importance_type` in the future, which means the user could define how to calculate the `feature score` by themselves.
### Usage
First you need to install dependency:
```
pip install lightgbm
```
Then
```python
from nni.feature_engineering.gbdt_selector import GBDTSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = GBDTSelector()
# fit data
fgs.fit(X_train, y_train, ...)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features(10))
...
```
And you could reference the examples in `/examples/feature_engineering/gbdt_selector/`, too.
**Requirement of `fit` FuncArgs**
* **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
* **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
* **lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html)
* **eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
* **early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).
* **importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance).
* **num_boost_round** (int, require) - number of boost round. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train).
**Requirement of `get_selected_features` FuncArgs**
* **topk** (int, require) - the topK impotance features you want to selected.
## GradientFeatureSelector
The algorithm in GradientFeatureSelector comes from ["Feature Gradients: Scalable Feature Selection via Discrete Relaxation"](https://arxiv.org/pdf/1908.10382.pdf).
GradientFeatureSelector, a gradient-based search algorithm
for feature selection.
1) This approach extends a recent result on the estimation of
learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N.
2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
### Usage
```python
from nni.feature_engineering.gradient_selector import FeatureGradientSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = FeatureGradientSelector(n_features=10)
# fit data
fgs.fit(X_train, y_train)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features())
...
```
And you could reference the examples in `/examples/feature_engineering/gradient_feature_selector/`, too.
**Parameters of class FeatureGradientSelector constructor**
* **order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
* **penatly** (int, optional, default = 1) - Constant that multiplies the regularization term.
* **n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
* **max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
* **learning_rate** (float, optional, default = 1e-1) - learning rate
* **init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*) - How to initialize the vector of scores. 'zero' is the default.
* **n_epochs** (int, optional, default = 1) - number of epochs to run
* **shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
* **batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
* **target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
* **classification** (bool, optional, default = True) - If True, problem is classification, else regression.
* **ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
* **balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
* **prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
* **soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
* **verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
* **device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
**Requirement of `fit` FuncArgs**
* **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
* **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
* **groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
**Requirement of `get_selected_features` FuncArgs**
For now, the `get_selected_features` function has no parameters.
This diff is collapsed.
# Customize a NAS Algorithm
## Extend the Ability of One-Shot Trainers
Users might want to do multiple things if they are using the trainers on real tasks, for example, distributed training, half-precision training, logging periodically, writing tensorboard, dumping checkpoints and so on. As mentioned previously, some trainers do have support for some of the items listed above; others might not. Generally, there are two recommended ways to add anything you want to an existing trainer: inherit an existing trainer and override, or copy an existing trainer and modify.
Either way, you are walking into the scope of implementing a new trainer. Basically, implementing a one-shot trainer is no different from any traditional deep learning trainer, except that a new concept called mutator will reveal itself. So that the implementation will be different in at least two places:
* Initialization
```python
model = Model()
mutator = MyMutator(model)
```
* Training
```python
for _ in range(epochs):
for x, y in data_loader:
mutator.reset() # reset all the choices in model
out = model(x) # like traditional model
loss = criterion(out, y)
loss.backward()
# no difference below
```
To demonstrate what mutators are for, we need to know how one-shot NAS normally works. Usually, one-shot NAS "co-optimize model weights and architecture weights". It repeatedly: sample an architecture or combination of several architectures from the supernet, train the chosen architectures like traditional deep learning model, update the trained parameters to the supernet, and use the metrics or loss as some signal to guide the architecture sampler. The mutator, is the architecture sampler here, often defined to be another deep-learning model. Therefore, you can treat it as any model, by defining parameters in it and optimizing it with optimizers. One mutator is initialized with exactly one model. Once a mutator is binded to a model, it cannot be rebinded to another model.
`mutator.reset()` is the core step. That's where all the choices in the model are finalized. The reset result will be always effective, until the next reset flushes the data. After the reset, the model can be seen as a traditional model to do forward-pass and backward-pass.
Finally, mutators provide a method called `mutator.export()` that export a dict with architectures to the model. Note that currently this dict this a mapping from keys of mutables to tensors of selection. So in order to dump to json, users need to convert the tensors explicitly into python list.
Meanwhile, NNI provides some useful tools so that users can implement trainers more easily. See [Trainers](./NasReference.md) for details.
## Implement New Mutators
To start with, here is the pseudo-code that demonstrates what happens on `mutator.reset()` and `mutator.export()`.
```python
def reset(self):
self.apply_on_model(self.sample_search())
```
```python
def export(self):
return self.sample_final()
```
On reset, a new architecture is sampled with `sample_search()` and applied on the model. Then the model is trained for one or more steps in search phase. On export, a new architecture is sampled with `sample_final()` and **do nothing to the model**. This is either for checkpoint or exporting the final architecture.
The requirements of return values of `sample_search()` and `sample_final()` are the same: a mapping from mutable keys to tensors. The tensor can be either a BoolTensor (true for selected, false for negative), or a FloatTensor which applies weight on each candidate. The selected branches will then be computed (in `LayerChoice`, modules will be called; in `InputChoice`, it's just tensors themselves), and reduce with the reduction operation specified in the choices. For most algorithms only worrying about the former part, here is an example of your mutator implementation.
```python
class RandomMutator(Mutator):
def __init__(self, model):
super().__init__(model) # don't forget to call super
# do something else
def sample_search(self):
result = dict()
for mutable in self.mutables: # this is all the mutable modules in user model
# mutables share the same key will be de-duplicated
if isinstance(mutable, LayerChoice):
# decided that this mutable should choose `gen_index`
gen_index = np.random.randint(mutable.length)
result[mutable.key] = torch.tensor([i == gen_index for i in range(mutable.length)],
dtype=torch.bool)
elif isinstance(mutable, InputChoice):
if mutable.n_chosen is None: # n_chosen is None, then choose any number
result[mutable.key] = torch.randint(high=2, size=(mutable.n_candidates,)).view(-1).bool()
# else do something else
return result
def sample_final(self):
return self.sample_search() # use the same logic here. you can do something different
```
The complete example of random mutator can be found [here](https://github.com/microsoft/nni/blob/v1.9/src/sdk/pynni/nni/nas/pytorch/random/mutator.py).
For advanced usages, e.g., users want to manipulate the way modules in `LayerChoice` are executed, they can inherit `BaseMutator`, and overwrite `on_forward_layer_choice` and `on_forward_input_choice`, which are the callback implementation of `LayerChoice` and `InputChoice` respectively. Users can still use property `mutables` to get all `LayerChoice` and `InputChoice` in the model code. For details, please refer to [reference](https://github.com/microsoft/nni/tree/v1.9/src/sdk/pynni/nni/nas/pytorch) here to learn more.
```eval_rst
.. tip::
A useful application of random mutator is for debugging. Use
.. code-block:: python
mutator = RandomMutator(model)
mutator.reset()
will immediately set one possible candidate in the search space as the active one.
```
## Implemented a Distributed NAS Tuner
Before learning how to write a distributed NAS tuner, users should first learn how to write a general tuner. read [Customize Tuner](../Tuner/CustomizeTuner.md) for tutorials.
When users call "[nnictl ss_gen](../Tutorial/Nnictl.md)" to generate search space file, a search space file like this will be generated:
```json
{
"key_name": {
"_type": "layer_choice",
"_value": ["op1_repr", "op2_repr", "op3_repr"]
},
"key_name": {
"_type": "input_choice",
"_value": {
"candidates": ["in1_key", "in2_key", "in3_key"],
"n_chosen": 1
}
}
}
```
This is the exact search space tuners will receive in `update_search_space`. It's then tuners' responsibility to interpret the search space and generate new candidates in `generate_parameters`. A valid "parameters" will be in the following format:
```json
{
"key_name": {
"_value": "op1_repr",
"_idx": 0
},
"key_name": {
"_value": ["in2_key"],
"_idex": [1]
}
}
```
Send it through `generate_parameters`, and the tuner would look like any HPO tuner. Refer to [SPOS](./SPOS.md) example code for an example.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment