@@ -46,7 +46,7 @@ Futhermore, users can specify the sparsities values used to prune for each layer
the SensitivityAnalysis will prune 25% 50% 75% weights gradually for each layer, and record the model's accuracy at the same time (SensitivityAnalysis only prune a layer once a time, the other layers are set to their original weights). If the sparsities is not set, SensitivityAnalysis will use the numpy.arange(0.1, 1.0, 0.1) as the default sparsity values.
Users can also speedup the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes: minimize, maximize, dropped, raised.
Users can also speedup the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes: minimize, maximize, dropped, raised.
minimize: The analysis stops when the validation metric return by the val_func lower than ``early_stop_value``.
@@ -38,7 +38,7 @@ There are several core features supported by NNI model compression:
* Support many popular pruning and quantization algorithms.
* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
* Speedup a compressed model to make it have lower inference latency and also make it smaller.
* Speedup a compressed model to make it have lower inference latency and also make it smaller.
* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
* Concise interface for users to customize their own compression algorithms.
...
...
@@ -54,7 +54,7 @@ If users want to apply both, a sequential mode is recommended as common practise
.. note::
Note that NNI pruners or quantizers are not meant to physically compact the model but for simulating the compression effect. Whereas NNI speedup tool can truly compress model by changing the network architecture and therefore reduce latency.
To obtain a truly compact model, users should conduct :doc:`pruning speedup <../tutorials/pruning_speed_up>` or :doc:`quantizaiton speedup <../tutorials/quantization_speed_up>`.
To obtain a truly compact model, users should conduct :doc:`pruning speedup <../tutorials/pruning_speedup>` or :doc:`quantizaiton speedup <../tutorials/quantization_speedup>`.
The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
...
...
@@ -131,7 +131,7 @@ Quantization algorithms compress the original network by reducing the number of
The final goal of model compression is to reduce inference latency and model size.
However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
Given the output masks and quantization bits produced by those algorithms, NNI can really speedup the model.
Given the output masks and quantization bits produced by those algorithms, NNI can really speedup the model.
The following figure shows how NNI prunes and speeds up your models.
...
...
@@ -140,8 +140,8 @@ The following figure shows how NNI prunes and speeds up your models.
:scale: 40%
:alt:
The detailed tutorial of Speed Up Model with Mask can be found :doc:`here <../tutorials/pruning_speed_up>`.
The detailed tutorial of Speed Up Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speed_up>`.
The detailed tutorial of Speedup Model with Mask can be found :doc:`here <../tutorials/pruning_speedup>`.
The detailed tutorial of Speedup Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speedup>`.
The experiment results are all collected with the default configuration of the pruners in nni, which means that when we call a pruner class in nni, we don't change any default class arguments.
*
Both FLOPs and the number of parameters are counted with :githublink:`Model FLOPs/Parameters Counter <docs/en_US/Compression/CompressionUtils.md#model-flopsparameters-counter>` after :githublink:`model speedup <docs/en_US/Compression/ModelSpeedup.rst>`.
Both FLOPs and the number of parameters are counted with :githublink:`Model FLOPs/Parameters Counter <docs/en_US/Compression/CompressionUtils.md#model-flopsparameters-counter>` after :githublink:`model speedup <docs/en_US/Compression/ModelSpeedup.rst>`.
This avoids potential issues of counting them of masked models.
"Speedup the original model with masks, note that `ModelSpeedup` requires an unwrapped model.\nThe model becomes smaller after speed-up,\nand reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.\n\n"
"Speedup the original model with masks, note that `ModelSpeedup` requires an unwrapped model.\nThe model becomes smaller after speedup,\nand reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.\n\n"
]
},
{
...
...
@@ -109,14 +109,14 @@
},
"outputs": [],
"source": [
"# need to unwrap the model, if the model is wrapped before speedup\npruner._unwrap_model()\n\n# speedup the model\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
"# need to unwrap the model, if the model is wrapped before speedup\npruner._unwrap_model()\n\n# speedup the model\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the model will become real smaller after speedup\n\n"
"the model will become real smaller after speedup\n\n"
]
},
{
...
...
@@ -134,7 +134,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fine-tuning Compacted Model\nNote that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.\nBecause speedup will replace the masked big layers with dense small ones.\n\n"
"## Fine-tuning Compacted Model\nNote that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.\nBecause speedup will replace the masked big layers with dense small ones.\n\n"