create branch for v2.9

e773dfcc · qianyj · e773dfcc · e773dfcc · e773dfcc · e773dfcc
Commit e773dfcc authored Mar 21, 2023 by qianyj
20 changed files
--- a/docs/source/sharings/images/nn_spider/holiday.png
+++ b/docs/source/sharings/images/nn_spider/holiday.png
--- a/docs/source/sharings/images/nn_spider/home.svg
+++ b/docs/source/sharings/images/nn_spider/home.svg
--- a/docs/source/sharings/images/nn_spider/nobug.png
+++ b/docs/source/sharings/images/nn_spider/nobug.png
--- a/docs/source/sharings/images/nn_spider/sign.png
+++ b/docs/source/sharings/images/nn_spider/sign.png
--- a/docs/source/sharings/images/nn_spider/sweat.png
+++ b/docs/source/sharings/images/nn_spider/sweat.png
--- a/docs/source/sharings/images/nn_spider/weaving.png
+++ b/docs/source/sharings/images/nn_spider/weaving.png
--- a/docs/source/sharings/images/nn_spider/working.png
+++ b/docs/source/sharings/images/nn_spider/working.png
--- a/docs/source/sharings/kd_example.rst
+++ b/docs/source/sharings/kd_example.rst
+Knowledge Distillation on NNI
+=============================
+KnowledgeDistill
+----------------
+Knowledge Distillation (KD) is proposed in `Distilling the Knowledge in a Neural Network <https://arxiv.org/abs/1503.02531>`__\ ,  the compressed model is trained to mimic a pre-trained, larger model.  This training setting is also referred to as "teacher-student",  where the large model is the teacher and the small model is the student. KD is often used to fine-tune the pruned model.
+.. image:: ../../img/distill.png
+   :target: ../../img/distill.png
+   :alt: 
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+      for batch_idx, (data, target) in enumerate(train_loader):
+         data, target = data.to(device), target.to(device)
+         optimizer.zero_grad()
+         y_s = model_s(data)
+         y_t = model_t(data)
+         loss_cri = F.cross_entropy(y_s, target)
+         # kd loss
+         p_s = F.log_softmax(y_s/kd_T, dim=1)
+         p_t = F.softmax(y_t/kd_T, dim=1)
+         loss_kd = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0]
+         # total loss
+         loss = loss_cir + loss_kd
+         loss.backward()
+The complete code for fine-tuning the pruned model can be found :githublink:`here <examples/model_compress/pruning/legacy/finetune_kd_torch.py>`
+.. code-block:: bash
+   python finetune_kd_torch.py --model [model name] --teacher-model-dir [pretrained checkpoint path]  --student-model-dir [pruned checkpoint path] --mask-path [mask file path]
+Note that: for fine-tuning a pruned model, run :githublink:`basic_pruners_torch.py <examples/model_compress/pruning/legacy/basic_pruners_torch.py>` first to get the mask file, then pass the mask path as argument to the script.
--- a/docs/source/sharings/model_compress_comp.rst
+++ b/docs/source/sharings/model_compress_comp.rst
+Comparison of Filter Pruning Algorithms
+=======================================
+To provide an initial insight into the performance of various filter pruning algorithms, 
+we conduct extensive experiments with various pruning algorithms on some benchmark models and datasets.
+We present the experiment result in this document.
+In addition, we provide friendly instructions on the re-implementation of these experiments to facilitate further contributions to this effort.
+Experiment Setting
+------------------
+The experiments are performed with the following pruners/datasets/models:
+* 
+  Models: :githublink:`VGG16, ResNet18, ResNet50 <examples/model_compress/models/cifar10>`
+* 
+  Datasets: CIFAR-10
+* 
+  Pruners: 
+  * These pruners are included:
+    * Pruners with scheduling : ``SimulatedAnnealing Pruner``\ , ``NetAdapt Pruner``\ , ``AutoCompress Pruner``.
+      Given the overal sparsity requirement, these pruners can automatically generate a sparsity distribution among different layers.
+    * One-shot pruners: ``L1Filter Pruner``\ , ``L2Filter Pruner``\ , ``FPGM Pruner``.
+      The sparsity of each layer is set the same as the overall sparsity in this experiment.
+  * 
+    Only **filter pruning** performances are compared here. 
+    For the pruners with scheduling, ``L1Filter Pruner`` is used as the base algorithm. That is to say, after the sparsities distribution is decided by the scheduling algorithm, ``L1Filter Pruner`` is used to performn real pruning.
+  * 
+    All the pruners listed above are implemented in :doc:`nni </compression/overview>`.
+Experiment Result
+-----------------
+For each dataset/model/pruner combination, we prune the model to different levels by setting a series of target sparsities for the pruner. 
+Here we plot both **Number of Weights - Performances** curve and **FLOPs - Performance** curve. 
+As a reference, we also plot the result declared in the paper `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <http://arxiv.org/abs/1907.03141>`__ for models VGG16 and ResNet18 on CIFAR-10.
+The experiment result are shown in the following figures:
+CIFAR-10, VGG16:
+.. image:: ../../../examples/model_compress/pruning/legacy/comparison_of_pruners/img/performance_comparison_vgg16.png
+   :target: ../../../examples/model_compress/pruning/legacy/comparison_of_pruners/img/performance_comparison_vgg16.png
+   :alt: 
+CIFAR-10, ResNet18:
+.. image:: ../../../examples/model_compress/pruning/legacy/comparison_of_pruners/img/performance_comparison_resnet18.png
+   :target: ../../../examples/model_compress/pruning/legacy/comparison_of_pruners/img/performance_comparison_resnet18.png
+   :alt: 
+CIFAR-10, ResNet50:
+.. image:: ../../../examples/model_compress/pruning/legacy/comparison_of_pruners/img/performance_comparison_resnet50.png
+   :target: ../../../examples/model_compress/pruning/legacy/comparison_of_pruners/img/performance_comparison_resnet50.png
+   :alt: 
+Analysis
+--------
+From the experiment result, we get the following conclusions:
+* Given the constraint on the number of parameters, the pruners with scheduling ( ``AutoCompress Pruner`` , ``SimualatedAnnealing Pruner`` ) performs better than the others when the constraint is strict. However, they have no such advantage in FLOPs/Performances comparison since only number of parameters constraint is considered in the optimization process; 
+* The basic algorithms ``L1Filter Pruner`` , ``L2Filter Pruner`` , ``FPGM Pruner`` performs very similarly in these experiments; 
+* ``NetAdapt Pruner`` can not achieve very high compression rate. This is caused by its mechanism that it prunes only one layer each pruning iteration. This leads to un-acceptable complexity if the sparsity per iteration is much lower than the overall sparisity constraint.
+Experiments Reproduction
+------------------------
+Implementation Details
+^^^^^^^^^^^^^^^^^^^^^^
+* The experiment results are all collected with the default configuration of the pruners in nni, which means that when we call a pruner class in nni, we don't change any default class arguments.
+* Both FLOPs and the number of parameters are counted with :ref:`Model FLOPs/Parameters Counter <flops-counter>` after :doc:`model speedup </tutorials/pruning_speedup>`.
+  This avoids potential issues of counting them of masked models.
+* The experiment code can be found :githublink:`here <examples/model_compress/pruning/legacy/auto_pruners_torch.py>`.
+Experiment Result Rendering
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* 
+  If you follow the practice in the :githublink:`example <examples/model_compress/pruning/legacy/auto_pruners_torch.py>`\ , for every single pruning experiment, the experiment result will be saved in JSON format as follows:
+  .. code-block:: json
+       {
+           "performance": {"original": 0.9298, "pruned": 0.1, "speedup": 0.1, "finetuned": 0.7746}, 
+           "params": {"original": 14987722.0, "speedup": 167089.0}, 
+           "flops": {"original": 314018314.0, "speedup": 38589922.0}
+       }
+* 
+  The experiment results are saved :githublink:`here <examples/model_compress/pruning/legacy/comparison_of_pruners>`. 
+  You can refer to :githublink:`analyze <examples/model_compress/pruning/legacy/comparison_of_pruners/analyze.py>` to plot new performance comparison figures.
+Contribution
+------------
+TODO Items
+^^^^^^^^^^
+* Pruners constrained by FLOPS/latency
+* More pruning algorithms/datasets/models
+Issues
+^^^^^^
+For algorithm implementation & experiment issues, please `create an issue <https://github.com/microsoft/nni/issues/new/>`__.
--- a/docs/source/sharings/model_compression_toctree.rst
+++ b/docs/source/sharings/model_compression_toctree.rst
+Model Compression
+=================
+..  toctree::
+    :maxdepth: 1
+    Knowledge distillation with NNI model compression <kd_example>
\ No newline at end of file
--- a/docs/source/sharings/nas_comparison.rst
+++ b/docs/source/sharings/nas_comparison.rst
+Neural Architecture Search Comparison
+=====================================
+*Posted by Anonymous Author*
+Train and Compare NAS (Neural Architecture Search) models including Autokeras, DARTS, ENAS and NAO.
+Their source code link is as below:
+* 
+  Autokeras: `https://github.com/jhfjhfj1/autokeras <https://github.com/jhfjhfj1/autokeras>`__
+* 
+  DARTS: `https://github.com/quark0/darts <https://github.com/quark0/darts>`__
+* 
+  ENAS: `https://github.com/melodyguan/enas <https://github.com/melodyguan/enas>`__
+* 
+  NAO: `https://github.com/renqianluo/NAO <https://github.com/renqianluo/NAO>`__
+Experiment Description
+----------------------
+To avoid over-fitting in **CIFAR-10**\ , we also compare the models in the other five datasets including Fashion-MNIST, CIFAR-100, OUI-Adience-Age, ImageNet-10-1 (subset of ImageNet), ImageNet-10-2 (another subset of ImageNet). We just sample a subset with 10 different labels from ImageNet to make ImageNet-10-1 or ImageNet-10-2.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Dataset
+     - Training Size
+     - Numer of Classes
+     - Descriptions
+   * - `Fashion-MNIST <https://github.com/zalandoresearch/fashion-mnist>`__
+     - 60,000
+     - 10
+     - T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot.
+   * - `CIFAR-10 <https://www.cs.toronto.edu/~kriz/cifar.html>`__
+     - 50,000
+     - 10
+     - Airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks.
+   * - `CIFAR-100 <https://www.cs.toronto.edu/~kriz/cifar.html>`__
+     - 50,000
+     - 100
+     - Similar to CIFAR-10 but with 100 classes and 600 images each.
+   * - `OUI-Adience-Age <https://talhassner.github.io/home/projects/Adience/Adience-data.html>`__
+     - 26,580
+     - 8
+     - 8 age groups/labels (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-).
+   * - `ImageNet-10-1 <http://www.image-net.org/>`__
+     - 9,750
+     - 10
+     - Coffee mug, computer keyboard, dining table, wardrobe, lawn mower, microphone, swing, sewing machine, odometer and gas pump.
+   * - `ImageNet-10-2 <http://www.image-net.org/>`__
+     - 9,750
+     - 10
+     - Drum, banj, whistle, grand piano, violin, organ, acoustic guitar, trombone, flute and sax.
+We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
+Search phase time for all NAS methods is **two days** as well as the retrain time.  Average results are reported based on **three repeat times**. Our evaluation machines have one Nvidia Tesla P100 GPU, 112GB of RAM and one 2.60GHz CPU (Intel E5-2690).
+For NAO, it requires too much computing resources, so we only use NAO-WS which provides the pipeline script.
+For AutoKeras, we used  0.2.18 version because it was the latest version when we started the experiment.
+NAS Performance
+---------------
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - NAS
+     - AutoKeras (%)
+     - ENAS (macro) (%)
+     - ENAS (micro) (%)
+     - DARTS (%)
+     - NAO-WS (%)
+   * - Fashion-MNIST
+     - 91.84
+     - 95.44
+     - 95.53
+     - **95.74**
+     - 95.20
+   * - CIFAR-10
+     - 75.78
+     - 95.68
+     - **96.16**
+     - 94.23
+     - 95.64
+   * - CIFAR-100
+     - 43.61
+     - 78.13
+     - 78.84
+     - **79.74**
+     - 75.75
+   * - OUI-Adience-Age
+     - 63.20
+     - **80.34**
+     - 78.55
+     - 76.83
+     - 72.96
+   * - ImageNet-10-1
+     - 61.80
+     - 77.07
+     - 79.80
+     - **80.48**
+     - 77.20
+   * - ImageNet-10-2
+     - 37.20
+     - 58.13
+     - 56.47
+     - 60.53
+     - **61.20**
+Unfortunately, we cannot reproduce all the results in the paper.
+The best or average results reported in the paper:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - NAS
+     - AutoKeras(%)
+     - ENAS (macro) (%)
+     - ENAS (micro) (%)
+     - DARTS (%)
+     - NAO-WS (%)
+   * - CIFAR- 10
+     - 88.56(best)
+     - 96.13(best)
+     - 97.11(best)
+     - 97.17(average)
+     - 96.47(best)
+For AutoKeras, it has relatively worse performance across all datasets due to its random factor on network morphism.
+For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro)  shows good results in CIFAR-10.
+For DARTS, it has a good performance on some datasets but we found its high variance in other datasets. The difference among three runs of benchmarks can be up to 5.37% in OUI-Adience-Age and 4.36% in ImageNet-10-1.
+For NAO-WS, it shows good results in ImageNet-10-2 but it can perform very poorly in OUI-Adience-Age.
+Reference
+---------
+#. 
+   Jin, Haifeng, Qingquan Song, and Xia Hu. "Efficient neural architecture search with network morphism." *arXiv preprint arXiv:1806.10282* (2018).
+#. 
+   Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018).
+#. 
+   Pham, Hieu, et al. "Efficient Neural Architecture Search via Parameters Sharing." international conference on machine learning (2018): 4092-4101.
+#. 
+   Luo, Renqian, et al. "Neural Architecture Optimization." neural information processing systems (2018): 7827-7838.
--- a/docs/source/sharings/nn_spider.rst
+++ b/docs/source/sharings/nn_spider.rst
+nnSpider Emoticons
+==================
+* Comfort
+  .. image:: images/nn_spider/comfort.png
+     :width: 400
+* Crying
+  .. image:: images/nn_spider/crying.png
+     :width: 400
+* Cut
+  .. image:: images/nn_spider/cut.png
+     :width: 400
+* Error
+  .. image:: images/nn_spider/error.png
+     :width: 400
+* Holiday
+  .. image:: images/nn_spider/holiday.png
+     :width: 400
+* No bug
+  .. image:: images/nn_spider/nobug.png
+     :width: 400
+* Sign
+  .. image:: images/nn_spider/sign.png
+     :width: 400
+* Sweat
+  .. image:: images/nn_spider/sweat.png
+     :width: 400
+* Weaving
+  .. image:: images/nn_spider/weaving.png
+     :width: 400
+* Working
+  .. image:: images/nn_spider/working.png
+     :width: 400
--- a/docs/source/sharings/nni_autofeatureeng.rst
+++ b/docs/source/sharings/nni_autofeatureeng.rst
+.. role:: raw-html(raw)
+   :format: html
+NNI review article from Zhihu: :raw-html:`<an open source project with highly reasonable design>` - By Garvin Li
+========================================================================================================================
+The article is by a NNI user on Zhihu forum. In the article, Garvin had shared his experience on using NNI for Automatic Feature Engineering. We think this article is very useful for users who are interested in using NNI for feature engineering. With author's permission, we translated the original article into English.  
+**source**\ : `如何看待微软最新发布的AutoML平台NNI？By Garvin Li <https://www.zhihu.com/question/297982959/answer/964961829?utm_source=wechat_session&utm_medium=social&utm_oi=28812108627968&from=singlemessage&isappinstalled=0>`__
+01 Overview of AutoML
+---------------------
+In author's opinion, AutoML is not only about hyperparameter optimization, but
+also a process that can target various stages of the machine learning process,
+including feature engineering, NAS, HPO, etc.
+02 Overview of NNI
+------------------
+NNI (Neural Network Intelligence) is an open source AutoML toolkit from
+Microsoft, to help users design and tune machine learning models, neural network
+architectures, or a complex system’s parameters in an efficient and automatic
+way.
+Link: `https://github.com/Microsoft/nni <https://github.com/Microsoft/nni>`__
+In general, most of Microsoft tools have one prominent characteristic: the
+design is highly reasonable (regardless of the technology innovation degree).
+NNI's AutoFeatureENG basically meets all user requirements of AutoFeatureENG
+with a very reasonable underlying framework design.
+03 Details of NNI-AutoFeatureENG
+--------------------------------
+..
+   The article is following the github project: `https://github.com/SpongebBob/tabular_automl_NNI <https://github.com/SpongebBob/tabular_automl_NNI>`__. 
+Each new user could do AutoFeatureENG with NNI easily and efficiently. To exploring the AutoFeatureENG capability, downloads following required files, and then run NNI install through pip.
+.. image:: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
+   :target: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
+   :alt: 
+NNI treats AutoFeatureENG as a two-steps-task, feature generation exploration and feature selection. Feature generation exploration is mainly about feature derivation and high-order feature combination.
+04 Feature Exploration
+----------------------
+For feature derivation, NNI offers many operations which could automatically generate new features, which list \ `as following <https://github.com/SpongebBob/tabular_automl_NNI/blob/master/AutoFEOp.md>`__\  :
+**count**\ : Count encoding is based on replacing categories with their counts computed on the train set, also named frequency encoding.
+**target**\ : Target encoding is based on encoding categorical variable values with the mean of target variable per value.
+**embedding**\ : Regard features as sentences, generate vectors using *Word2Vec.*
+**crosscout**\ : Count encoding on more than one-dimension, alike CTR (Click Through Rate).
+**aggregete**\ : Decide the aggregation functions of the features, including min/max/mean/var.
+**nunique**\ : Statistics of the number of unique features.
+**histsta**\ : Statistics of feature buckets, like histogram statistics.
+Search space could be defined in a **JSON file**\ : to define how specific features intersect, which two columns intersect and how features generate from corresponding columns.
+.. image:: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
+   :target: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
+   :alt: 
+The picture shows us the procedure of defining search space. NNI provides count encoding for 1-order-op, as well as cross count encoding, aggerate statistics (min max var mean median nunique) for 2-order-op. 
+For example, we want to search the features which are a frequency encoding (valuecount) features on columns name {“C1”, ...,” C26”}, in the following way:
+.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
+   :target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
+   :alt: 
+we can define a cross frequency encoding (value count on cross dims) method on columns {"C1",...,"C26"} x {"C1",...,"C26"} in the following way:
+.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
+   :target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
+   :alt: 
+The purpose of Exploration is to generate new features. You can use **get_next_parameter** function to get received feature candidates of one trial.
+..
+   RECEIVED_PARAMS = nni.get_next_parameter()
+05 Feature selection
+--------------------
+To avoid feature explosion and overfitting, feature selection is necessary. In the feature selection of NNI-AutoFeatureENG, LightGBM (Light Gradient Boosting Machine), a gradient boosting framework developed by Microsoft, is mainly promoted.
+.. image:: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
+   :target: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
+   :alt: 
+If you have used **XGBoost** or **GBDT**\ , you would know the algorithm based on tree structure can easily calculate the importance of each feature on results. LightGBM is able to make feature selection naturally.
+The issue is that selected features might be applicable to *GBDT* (Gradient Boosting Decision Tree), but not to the linear algorithm like *LR* (Logistic Regression).
+.. image:: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
+   :target: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
+   :alt: 
+06 Summary
+----------
+NNI's AutoFeatureEng sets a well-established standard, showing us the operation procedure, available modules, which is highly convenient to use. However, a simple model is probably not enough for good results.
+Suggestions to NNI
+------------------
+About Exploration: If consider using DNN (like xDeepFM) to extract high-order feature would be better.
+About Selection: There could be more intelligent options, such as automatic selection system based on downstream models.
+Conclusion: NNI could offer users some inspirations of design and it is a good open source project. I suggest researchers leverage it to accelerate the AI research.
+Tips: Because the scripts of open source projects are compiled based on gcc7, Mac system may encounter problems of gcc (GNU Compiler Collection). The solution is as follows:
+.. code-block:: bash
+   brew install libomp
--- a/docs/source/sharings/nni_colab_support.rst
+++ b/docs/source/sharings/nni_colab_support.rst
+Use NNI on Google Colab
+=======================
+NNI can easily run on Google Colab platform. However, Colab doesn't expose its public IP and ports, so by default you can not access NNI's Web UI on Colab. To solve this, you need a reverse proxy software like ``ngrok`` or ``frp``. This tutorial will show you how to use ngrok to access NNI's Web UI on Colab.
+How to Open NNI's Web UI on Google Colab
+----------------------------------------
+#. Install required packages and softwares.
+   .. code-block:: bash
+      ! pip install nni # install nni
+      ! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip # download ngrok and unzip it
+      ! unzip ngrok-stable-linux-amd64.zip
+      ! mkdir -p nni_repo
+      ! git clone https://github.com/microsoft/nni.git nni_repo/nni # clone NNI's offical repo to get examples
+#. Register a ngrok account `here <https://ngrok.com/>`__, then connect to your account using your authtoken.
+   .. code-block:: bash
+      ! ./ngrok authtoken YOUR_AUTH_TOKEN
+#. Start an NNI example on a port bigger than 1024, then start ngrok with the same port. If you want to use gpu, make sure gpuNum >= 1 in config.yml. Use ``get_ipython()`` to start ngrok since it will be stuck if you use ``! ngrok http 5000 &``.
+   .. code-block:: bash
+      ! nnictl create --config nni_repo/nni/examples/trials/mnist-pytorch/config.yml --port 5000 &
+   .. code-block:: python
+      get_ipython().system_raw('./ngrok http 5000 &')
+#. Check the public url.
+   .. code-block:: bash
+      ! curl -s http://localhost:4040/api/tunnels # don't change the port number 4040
+   You will see an url like ``http://xxxx.ngrok.io`` after step 4, open this url and you will find NNI's Web UI. Have fun :)
+Access Web UI with frp
+----------------------
+frp is another reverse proxy software with similar functions. However, frp doesn't provide free public urls, so you may need an server with public IP as a frp server. See `here <https://github.com/fatedier/frp>`__ to know more about how to deploy frp.
--- a/docs/source/sharings/op_evo_examples.rst
+++ b/docs/source/sharings/op_evo_examples.rst
+.. role:: raw-html(raw)
+   :format: html
+Tuning Tensor Operators on NNI
+==============================
+Overview
+--------
+Abundant applications raise the demands of training and inference deep neural networks (DNNs) efficiently on diverse hardware platforms ranging from cloud servers to embedded devices. Moreover, computational graph-level optimization of deep neural network, like tensor operator fusion, may introduce new tensor operators. Thus, manually optimized tensor operators provided by hardware-specific libraries have limitations in terms of supporting new hardware platforms or supporting new operators, so automatically optimizing tensor operators on diverse hardware platforms is essential for large-scale deployment and application of deep learning technologies in the real-world problems.
+Tensor operator optimization is substantially a combinatorial optimization problem. The objective function is the performance of a tensor operator on specific hardware platform, which should be maximized with respect to the hyper-parameters of corresponding device code, such as how to tile a matrix or whether to unroll a loop. Unlike many typical problems of this type, such as travelling salesman problem, the objective function of tensor operator optimization is a black box and expensive to sample. One has to compile a device code with a specific configuration and run it on real hardware to get the corresponding performance metric. Therefore, a desired method for optimizing tensor operators should find the best configuration with as few samples as possible.
+The expensive objective function makes solving tensor operator optimization problem with traditional combinatorial optimization methods, for example, simulated annealing and evolutionary algorithms, almost impossible. Although these algorithms inherently support combinatorial search spaces, they do not take sample-efficiency into account,
+thus thousands of or even more samples are usually needed, which is unacceptable when tuning tensor operators in product environments. On the other hand, sequential model based optimization (SMBO) methods are proved sample-efficient for optimizing black-box functions with continuous search spaces. However, when optimizing ones with combinatorial search spaces, SMBO methods are not as sample-efficient as their continuous counterparts, because there is lack of prior assumptions about the objective functions, such as continuity and differentiability in the case of continuous search spaces. For example, if one could assume an objective function with a continuous search space is infinitely differentiable, a Gaussian process with a radial basis function (RBF) kernel could be used to model the objective function. In this way, a sample provides not only a single value at a point but also the local properties of the objective function in its neighborhood or even global properties,
+which results in a high sample-efficiency. In contrast, SMBO methods for combinatorial optimization suffer poor sample-efficiency due to the lack of proper prior assumptions and surrogate models which can leverage them.
+OpEvo is recently proposed for solving this challenging problem. It efficiently explores the search spaces of tensor operators by introducing a topology-aware mutation operation based on q-random walk distribution to leverage the topological structures over the search spaces. Following this example, you can use OpEvo to tune three representative types of tensor operators selected from two popular neural networks, BERT and AlexNet. Three comparison baselines, AutoTVM, G-BFS and N-A2C, are also provided. Please refer to `OpEvo: An Evolutionary Method for Tensor Operator Optimization <https://arxiv.org/abs/2006.05664>`__ for detailed explanation about these algorithms.
+Environment Setup
+-----------------
+We prepared a dockerfile for setting up experiment environments. Before starting, please make sure the Docker daemon is running and the driver of your GPU accelerator is properly installed. Enter into the example folder ``examples/trials/systems/opevo`` and run below command to build and instantiate a Docker image from the dockerfile.
+.. code-block:: bash
+   # if you are using Nvidia GPU
+   make cuda-env
+   # if you are using AMD GPU
+   make rocm-env
+Run Experiments:
+----------------
+Three representative kinds of tensor operators, **matrix multiplication**\ , **batched matrix multiplication** and **2D convolution**\ , are chosen from BERT and AlexNet, and tuned with NNI. The ``Trial`` code for all tensor operators is ``/root/compiler_auto_tune_stable.py``\ , and ``Search Space`` files and ``config`` files for each tuning algorithm locate in ``/root/experiments/``\ , which are categorized by tensor operators. Here ``/root`` refers to the root of the container.
+For tuning the operators of matrix multiplication, please run below commands from ``/root``\ :
+.. code-block:: bash
+   # (N, K) x (K, M) represents a matrix of shape (N, K) multiplies a matrix of shape (K, M)
+   # (512, 1024) x (1024, 1024)
+   # tuning with OpEvo
+   nnictl create --config experiments/mm/N512K1024M1024/config_opevo.yml
+   # tuning with G-BFS
+   nnictl create --config experiments/mm/N512K1024M1024/config_gbfs.yml
+   # tuning with N-A2C
+   nnictl create --config experiments/mm/N512K1024M1024/config_na2c.yml
+   # tuning with AutoTVM
+   OP=matmul STEP=512 N=512 M=1024 K=1024 P=NN ./run.s
+   # (512, 1024) x (1024, 4096)
+   # tuning with OpEvo
+   nnictl create --config experiments/mm/N512K1024M4096/config_opevo.yml
+   # tuning with G-BFS
+   nnictl create --config experiments/mm/N512K1024M4096/config_gbfs.yml
+   # tuning with N-A2C
+   nnictl create --config experiments/mm/N512K1024M4096/config_na2c.yml
+   # tuning with AutoTVM
+   OP=matmul STEP=512 N=512 M=1024 K=4096 P=NN ./run.sh
+   # (512, 4096) x (4096, 1024)
+   # tuning with OpEvo
+   nnictl create --config experiments/mm/N512K4096M1024/config_opevo.yml
+   # tuning with G-BFS
+   nnictl create --config experiments/mm/N512K4096M1024/config_gbfs.yml
+   # tuning with N-A2C
+   nnictl create --config experiments/mm/N512K4096M1024/config_na2c.yml
+   # tuning with AutoTVM
+   OP=matmul STEP=512 N=512 M=4096 K=1024 P=NN ./run.sh
+For tuning the operators of batched matrix multiplication, please run below commands from ``/root``\ :
+.. code-block:: bash
+   # batched matrix with batch size 960 and shape of matrix (128, 128) multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
+   # tuning with OpEvo
+   nnictl create --config experiments/bmm/B960N128K128M64PNN/config_opevo.yml
+   # tuning with AutoTVM
+   OP=batch_matmul STEP=512 B=960 N=128 K=128 M=64 P=NN ./run.sh
+   # batched matrix with batch size 960 and shape of matrix (128, 128) is transposed first and then multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
+   # tuning with OpEvo
+   nnictl create --config experiments/bmm/B960N128K128M64PTN/config_opevo.yml
+   # tuning with AutoTVM
+   OP=batch_matmul STEP=512 B=960 N=128 K=128 M=64 P=TN ./run.sh
+   # batched matrix with batch size 960 and shape of matrix (128, 64) is transposed first and then right multiplies batched matrix with batch size 960 and shape of matrix (128, 64).
+   # tuning with OpEvo
+   nnictl create --config experiments/bmm/B960N128K64M128PNT/config_opevo.yml
+   # tuning with AutoTVM
+   OP=batch_matmul STEP=512 B=960 N=128 K=64 M=128 P=NT ./run.sh
+For tuning the operators of 2D convolution, please run below commands from ``/root``\ :
+.. code-block:: bash
+   # image tensor of shape (512, 3, 227, 227) convolves with kernel tensor of shape (64, 3, 11, 11) with stride 4 and padding 0
+   # tuning with OpEvo
+   nnictl create --config experiments/conv/N512C3HW227F64K11ST4PD0/config_opevo.yml
+   # tuning with AutoTVM
+   OP=convfwd_direct STEP=512 N=512 C=3 H=227 W=227 F=64 K=11 ST=4 PD=0 ./run.sh
+   # image tensor of shape (512, 64, 27, 27) convolves with kernel tensor of shape (192, 64, 5, 5) with stride 1 and padding 2
+   # tuning with OpEvo
+   nnictl create --config experiments/conv/N512C64HW27F192K5ST1PD2/config_opevo.yml
+   # tuning with AutoTVM
+   OP=convfwd_direct STEP=512 N=512 C=64 H=27 W=27 F=192 K=5 ST=1 PD=2 ./run.sh
+Please note that G-BFS and N-A2C are only designed for tuning tiling schemes of multiplication of matrices with only power of 2 rows and columns, so they are not compatible with other types of configuration spaces, thus not eligible to tune the operators of batched matrix multiplication and 2D convolution. Here, AutoTVM is implemented by its authors in the TVM project, so the tuning results are printed on the screen rather than reported to NNI manager. The port 8080 of the container is bind to the host on the same port, so one can access the NNI Web UI through ``host_ip_addr:8080`` and monitor tuning process as below screenshot.
+.. image:: ../../img/opevo.png
+Citing OpEvo
+------------
+If you feel OpEvo is helpful, please consider citing the paper as follows:
+.. code-block:: bib
+   @misc{gao2020opevo,
+       title={OpEvo: An Evolutionary Method for Tensor Operator Optimization},
+       author={Xiaotian Gao and Cui Wei and Lintao Zhang and Mao Yang},
+       year={2020},
+       eprint={2006.05664},
+       archivePrefix={arXiv},
+       primaryClass={cs.LG}
+   }
--- a/docs/source/sharings/overview.rst
+++ b/docs/source/sharings/overview.rst
+Use Cases and Solutions
+=======================
+Different from the tutorials and examples in the rest of the document which show the usage of a feature, this part mainly introduces end-to-end scenarios and use cases to help users further understand how NNI can help them. NNI can be widely adopted in various scenarios. We also encourage community contributors to share their AutoML practices especially the NNI usage practices from their experience.
+Automatic Model Tuning
+----------------------
+NNI can be applied on various model tuning tasks. Some state-of-the-art model search algorithms, such as EfficientNet, can be easily built on NNI. Popular models, e.g., recommendation models, can be tuned with NNI. The following are some use cases to illustrate how to leverage NNI in your model tuning tasks and how to build your own pipeline with NNI.
+* :doc:`Tuning SVD automatically <recommenders_svd>`
+* :doc:`EfficientNet on NNI <efficientnet>`
+* :doc:`Automatic Model Architecture Search for Reading Comprehension <squad_evolution_examples>`
+* :doc:`Parallelizing Optimization for TPE <parallelizing_tpe_search>`
+Automatic System Tuning
+-----------------------
+The performance of systems, such as database, tensor operator implementaion, often need to be tuned to adapt to specific hardware configuration, targeted workload, etc. Manually tuning a system is complicated and often requires detailed understanding of hardware and workload. NNI can make such tasks much easier and help system owners find the best configuration to the system automatically. The detailed design philosophy of automatic system tuning can be found in this `paper <https://dl.acm.org/doi/10.1145/3352020.3352031>`__ . The following are some typical cases that NNI can help.
+* :doc:`Tuning SPTAG (Space Partition Tree And Graph) automatically <sptag_auto_tune>`
+* :doc:`Tuning the performance of RocksDB <rocksdb_examples>`
+* :doc:`Tuning Tensor Operators automatically <op_evo_examples>`
+Model Compression
+-----------------
+The following one shows how to apply knowledge distillation on NNI model compression. More use cases and solutions will be added in the future.
+* :doc:`Knowledge distillation with NNI model compression <kd_example>`
+Feature Engineering
+-------------------
+The following is an article about how NNI helps in auto feature engineering shared by a community contributor. More use cases and solutions will be added in the future.
+* :doc:`NNI review article from Zhihu: - By Garvin Li <nni_autofeatureeng>`
+Performance Measurement, Comparison and Analysis
+------------------------------------------------
+Performance comparison and analysis can help users decide a proper algorithm (e.g., tuner, NAS algorithm) for their scenario. The following are some measurement and comparison data for users' reference.
+* :doc:`Neural Architecture Search Comparison <nas_comparison>`
+* :doc:`Hyper-parameter Tuning Algorithm Comparsion <hpo_comparison>`
+* :doc:`Model Compression Algorithm Comparsion <model_compress_comp>`
--- a/docs/source/sharings/parallelizing_tpe_search.rst
+++ b/docs/source/sharings/parallelizing_tpe_search.rst
+.. role:: raw-html(raw)
+   :format: html
+Parallelizing a Sequential Algorithm TPE
+========================================
+TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete. For the TPE approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation of the y was assigned temporarily, until the evaluation completed and reported the actual loss f(x∗).
+Introduction and Problems
+-------------------------
+Sequential Model-based Global Optimization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Sequential Model-Based Global Optimization (SMBO) algorithms have been used in many applications where evaluation of the fitness function is expensive. In an application where the true fitness function f: X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point x∗ that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. This active-learning-like algorithm template is summarized in the figure below. SMBO algorithms differ in what criterion they optimize to obtain x∗ given a model (or surrogate) of f, and in they model f via observation history H.
+.. image:: ../../img/parallel_tpe_search4.PNG
+   :target: ../../img/parallel_tpe_search4.PNG
+   :alt: 
+The algorithms in this work optimize the criterion of Expected Improvement (EI). Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the bandit-based criterion. We chose to use the EI criterion in TPE because it is intuitive, and has been shown to work well in a variety of settings. Expected improvement is the expectation under some model M of f : X → RN that f(x) will exceed (negatively) some threshold y∗:
+.. image:: ../../img/parallel_tpe_search_ei.PNG
+   :target: ../../img/parallel_tpe_search_ei.PNG
+   :alt: 
+Since calculation of p(y|x) is expensive, TPE approach modeled p(y|x) by p(x|y) and p(y).The TPE defines p(x|y) using two such densities:
+.. image:: ../../img/parallel_tpe_search_tpe.PNG
+   :target: ../../img/parallel_tpe_search_tpe.PNG
+   :alt: 
+where l(x) is the density formed by using the observations {x(i)} such that corresponding loss
+f(x(i)) was less than y∗ and g(x) is the density formed by using the remaining observations. TPE algorithm depends on a y∗ that is larger than the best observed f(x) so that some points can be used to form l(x). The TPE algorithm chooses y∗ to be some quantile γ of the observed y values, so that p(y<\ ``y∗``\ ) = γ, but no specific model for p(y) is necessary. The tree-structured form of l and g makes it easy to draw many candidates according to l and evaluate them according to g(x)/l(x). On each iteration, the algorithm returns the candidate x∗ with the greatest EI.
+Here is a simulation of the TPE algorithm in a two-dimensional search space. The difference of background color represents different values. It can be seen that TPE combines exploration and exploitation very well. (Black indicates the points of this round samples, and yellow indicates the points has been taken in the history.)
+.. image:: ../../img/parallel_tpe_search1.gif
+   :target: ../../img/parallel_tpe_search1.gif
+   :alt: 
+**Since EI is a continuous function, the highest x of EI is determined at a certain status.** As shown in the figure below, the blue triangle is the point that is most likely to be sampled in this state.
+.. image:: ../../img/parallel_tpe_search_ei2.PNG
+   :target: ../../img/parallel_tpe_search_ei2.PNG
+   :alt: 
+TPE performs well when we use it in sequential, but if we provide a larger concurrency, then **there will be a large number of points produced in the same EI state**\ , too concentrated points will reduce the exploration ability of the tuner, resulting in resources waste.
+Here is the simulation figure when we set ``concurrency=60``\ , It can be seen that this phenomenon is obvious.
+.. image:: ../../img/parallel_tpe_search2.gif
+   :target: ../../img/parallel_tpe_search2.gif
+   :alt: 
+Research solution
+-----------------
+Approximated q-EI Maximization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The multi-points criterion that we have presented below can potentially be used to deliver an additional design of experiments in one step through the resolution of the optimization problem.
+.. image:: ../../img/parallel_tpe_search_qEI.PNG
+   :target: ../../img/parallel_tpe_search_qEI.PNG
+   :alt: 
+However, the computation of q-EI becomes intensive as q increases. After our research, there are four popular greedy strategies that approach the result of problem while avoiding its numerical cost.
+Solution 1: Believing the OK Predictor: The KB(Kriging Believer) Heuristic Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The Kriging Believer strategy replaces the conditional knowledge about the responses at the sites chosen within the last iterations by deterministic values equal to the expectation of the Kriging predictor. Keeping the same notations as previously, the strategy can be summed up as follows:
+.. image:: ../../img/parallel_tpe_search_kb.PNG
+   :target: ../../img/parallel_tpe_search_kb.PNG
+   :alt: 
+This sequential strategy delivers a q-points design and is computationally affordable since it relies on the analytically known EI, optimized in d dimensions. However, there is a risk of failure, since believing an OK predictor that overshoots the observed data may lead to a sequence that gets trapped in a non-optimal region for many iterations. We now propose a second strategy that reduces this risk.
+Solution 2: The CL(Constant Liar) Heuristic Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Let us now consider a sequential strategy in which the metamodel is updated (still without hyperparameter re-estimation) at each iteration with a value L exogenously fixed by the user, here called a ”lie”. The strategy referred to as the Constant Liar consists in lying with the same value L at every iteration: maximize EI (i.e. find xn+1), actualize the model as if y(xn+1) = L, and so on always with the same L ∈ R:
+.. image:: ../../img/parallel_tpe_search_cl.PNG
+   :target: ../../img/parallel_tpe_search_cl.PNG
+   :alt: 
+L should logically be determined on the basis of the values taken by y at X. Three values, min{Y}, mean{Y}, and max{Y} are considered here. **The larger L is, the more explorative the algorithm will be, and vice versa.**
+We have simulated the method above. The following figure shows the result of using mean value liars to maximize q-EI. We find that the points we have taken have begun to be scattered.
+.. image:: ../../img/parallel_tpe_search3.gif
+   :target: ../../img/parallel_tpe_search3.gif
+   :alt: 
+Experiment
+----------
+Branin-Hoo
+^^^^^^^^^^
+The four optimization strategies presented in the last section are now compared on the Branin-Hoo function which is a classical test-case in global optimization.
+.. image:: ../../img/parallel_tpe_search_branin.PNG
+   :target: ../../img/parallel_tpe_search_branin.PNG
+   :alt: 
+The recommended values of a, b, c, r, s and t are: a = 1, b = 5.1 ⁄ (4π2), c = 5 ⁄ π, r = 6, s = 10 and t = 1 ⁄ (8π). This function has three global minimizers(-3.14, 12.27), (3.14, 2.27), (9.42, 2.47).
+Next is the comparison of the q-EI associated with the q first points (q ∈ [1,10]) given by the constant liar strategies (min and max), 2000 q-points designs uniformly drawn for every q, and 2000 q-points LHS designs taken at random for every q.
+.. image:: ../../img/parallel_tpe_search_result.PNG
+   :target: ../../img/parallel_tpe_search_result.PNG
+   :alt: 
+As we can seen on figure, CL[max] and CL[min] offer very good q-EI results compared to random designs, especially for small values of q.
+Gaussian Mixed Model function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+We also compared the case of using parallel optimization and not using parallel optimization. A two-dimensional multimodal Gaussian Mixed distribution is used to simulate, the following is our result:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - 
+     - concurrency=80
+     - concurrency=60
+     - concurrency=40
+     - concurrency=20
+     - concurrency=10
+   * - Without parallel optimization
+     - avg =  0.4841 :raw-html:`<br>` var =  0.1953
+     - avg =  0.5155 :raw-html:`<br>` var =  0.2219
+     - avg =  0.5773 :raw-html:`<br>` var =  0.2570
+     - avg =  0.4680 :raw-html:`<br>` var =  0.1994
+     - avg = 0.2774 :raw-html:`<br>` var = 0.1217
+   * - With parallel optimization
+     - avg =  0.2132 :raw-html:`<br>` var = 0.0700
+     - avg =  0.2177\ :raw-html:`<br>`\ var =  0.0796
+     - avg =  0.1835 :raw-html:`<br>` var =  0.0533
+     - avg =  0.1671 :raw-html:`<br>` var =  0.0413
+     - avg =  0.1918 :raw-html:`<br>` var =  0.0697
+Note: The total number of samples per test is 240 (ensure that the budget is equal). The trials in each form were repeated 1000 times, the value is the average and variance of the best results in 1000 trials.
+References
+----------
+[1] James Bergstra, Remi Bardenet, Yoshua Bengio, Balazs Kegl. `Algorithms for Hyper-Parameter Optimization. <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
+[2] Meng-Hiot Lim, Yew-Soon Ong. `Computational Intelligence in Expensive Optimization Problems. <https://link.springer.com/content/pdf/10.1007%2F978-3-642-10701-6.pdf>`__
+[3] M. Jordan, J. Kleinberg, B. Scho¨lkopf. `Pattern Recognition and Machine Learning. <http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf>`__
--- a/docs/source/sharings/perf_compare_toctree.rst
+++ b/docs/source/sharings/perf_compare_toctree.rst
+Performance Measurement, Comparison and Analysis
+================================================
+..  toctree::
+    :maxdepth: 1
+    Neural Architecture Search Comparison <nas_comparison>
+    Hyper-parameter Tuning Algorithm Comparsion <hpo_comparison>
+    Model Compression Algorithm Comparsion <model_compress_comp>
\ No newline at end of file
--- a/docs/source/sharings/recommenders_svd.rst
+++ b/docs/source/sharings/recommenders_svd.rst
+Automatically tuning SVD (NNI in Recommenders)
+==============================================
+In this tutorial, we first introduce a github repo `Recommenders <https://github.com/Microsoft/Recommenders>`__. It is a repository that provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. It has various models that are popular and widely deployed in recommendation systems. To provide a complete end-to-end experience, they present each example in five key tasks, as shown below:
+* `Prepare Data <https://github.com/microsoft/recommenders/tree/master/examples/01_prepare_data>`__\ : Preparing and loading data for each recommender algorithm.
+* Model(`collaborative filtering algorithms <https://github.com/microsoft/recommenders/tree/master/examples/02_model_collaborative_filtering>`__\ , `content-based filtering algorithms <https://github.com/microsoft/recommenders/tree/master/examples/02_model_content_based_filtering>`__\ , `hybrid algorithms <https://github.com/microsoft/recommenders/tree/master/examples/02_model_hybrid>`__\ ): Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (\ `ALS <https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS>`__\ ) or eXtreme Deep Factorization Machines (\ `xDeepFM <https://arxiv.org/abs/1803.05170>`__\ ).
+* `Evaluate <https://github.com/microsoft/recommenders/tree/master/examples/03_evaluate>`__\ : Evaluating algorithms with offline metrics.
+* `Model Select and Optimize <https://github.com/microsoft/recommenders/tree/master/examples/04_model_select_and_optimize>`__\ : Tuning and optimizing hyperparameters for recommender models.
+* `Operationalize <https://github.com/microsoft/recommenders/tree/master/examples/05_operationalize>`__\ : Operationalizing models in a production environment on Azure.
+The fourth task is tuning and optimizing the model's hyperparameters, this is where NNI could help. To give a concrete example that NNI tunes the models in Recommenders, let's demonstrate with the model `SVD <https://github.com/microsoft/recommenders/blob/master/examples/02_model_collaborative_filtering/surprise_svd_deep_dive.ipynb>`__\ , and data Movielens100k. There are more than 10 hyperparameters to be tuned in this model.
+This `Jupyter notebook <https://github.com/microsoft/recommenders/blob/master/examples/04_model_select_and_optimize/nni_surprise_svd.ipynb>`__ provided by Recommenders is a very detailed step-by-step tutorial for this example. It uses different built-in tuning algorithms in NNI, including ``Annealing``\ , ``SMAC``\ , ``Random Search``\ , ``TPE``\ , ``Hyperband``\ , ``Metis`` and ``Evolution``. Finally, the results of different tuning algorithms are compared. Please go through this notebook to learn how to use NNI to tune SVD model, then you could further use NNI to tune other models in Recommenders.
--- a/docs/source/sharings/rocksdb_examples.rst
+++ b/docs/source/sharings/rocksdb_examples.rst
+Tuning RocksDB on NNI
+=====================
+Overview
+--------
+`RocksDB <https://github.com/facebook/rocksdb>`__ is a popular high performance embedded key-value database used in production systems at various web-scale enterprises including Facebook, Yahoo!, and LinkedIn.. It is a fork of `LevelDB <https://github.com/google/leveldb>`__ by Facebook optimized to exploit many central processing unit (CPU) cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads.
+The performance of RocksDB is highly contingent on its tuning. However, because of the complexity of its underlying technology and a large number of configurable parameters, a good configuration is sometimes hard to obtain. NNI can help to address this issue. NNI supports many kinds of tuning algorithms to search the best configuration of RocksDB, and support many kinds of environments like local machine, remote servers and cloud. 
+This example illustrates how to use NNI to search the best configuration of RocksDB for a ``fillrandom`` benchmark supported by a benchmark tool ``db_bench``\ , which is an official benchmark tool provided by RocksDB itself. Therefore, before running this example, please make sure NNI is installed and `db_bench <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is in your ``PATH``. Please refer to :doc:`here </installation>` for detailed information about installation and preparing of NNI environment, and `here <https://github.com/facebook/rocksdb/blob/master/INSTALL.md>`__ for compiling RocksDB as well as ``db_bench``.
+We also provide a simple script :githublink:`db_bench_installation.sh <examples/trials/systems_auto_tuning/rocksdb-fillrandom/db_bench_installation.sh>` helping to compile and install ``db_bench`` as well as its dependencies on Ubuntu. Installing RocksDB on other systems can follow the same procedure.
+:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom>`
+Experiment setup
+----------------
+There are mainly three steps to setup an experiment of tuning systems on NNI. Define search space with a ``json`` file, write a benchmark code, and start NNI experiment by passing a config file to NNI manager.
+Search Space
+^^^^^^^^^^^^
+For simplicity, this example tunes three parameters, ``write_buffer_size``\ , ``min_write_buffer_num`` and ``level0_file_num_compaction_trigger``\ , for writing 16M keys with 20 Bytes of key size and 100 Bytes of value size randomly, based on writing operations per second (OPS). ``write_buffer_size`` sets the size of a single memtable. Once memtable exceeds this size, it is marked immutable and a new one is created. ``min_write_buffer_num`` is the minimum number of memtables to be merged before flushing to storage. Once the number of files in level 0 reaches ``level0_file_num_compaction_trigger``\ , level 0 to level 1 compaction is triggered.
+In this example, the search space is specified by a ``search_space.json`` file as shown below. Detailed explanation of search space could be found :doc:`here </hpo/search_space>`.
+.. code-block:: json
+   {
+       "write_buffer_size": {
+           "_type": "quniform",
+           "_value": [2097152, 16777216, 1048576]
+       },
+       "min_write_buffer_number_to_merge": {
+           "_type": "quniform",
+           "_value": [2, 16, 1]
+       },
+       "level0_file_num_compaction_trigger": {
+           "_type": "quniform",
+           "_value": [2, 16, 1]
+       }
+   }
+:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/search_space.json>`
+Benchmark code
+^^^^^^^^^^^^^^
+Benchmark code should receive a configuration from NNI manager, and report the corresponding benchmark result back. Following NNI APIs are designed for this purpose. In this example, writing operations per second (OPS) is used as a performance metric.
+* Use ``nni.get_next_parameter()`` to get next system configuration.
+* Use ``nni.report_final_result(metric)`` to report the benchmark result.
+:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/main.py>`
+Config file
+^^^^^^^^^^^
+One could start a NNI experiment with a config file. A config file for NNI is a ``yaml`` file usually including experiment settings (\ ``trialConcurrency``\ , ``trialGpuNumber``\ , etc.), platform settings (\ ``trainingService``\ ), path settings (\ ``searchSpaceFile``\ , ``trialCodeDirectory``\ , etc.) and tuner settings (\ ``tuner``\ , ``tuner optimize_mode``\ , etc.). Please refer to :doc:`/reference/experiment_config`.
+Here is an example of tuning RocksDB with SMAC algorithm:
+:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/config_smac.yml>`
+Here is an example of tuning RocksDB with TPE algorithm:
+:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/config_tpe.yml>`
+Other tuners can be easily adopted in the same way. Please refer to :doc:`here </hpo/tuners>` for more information.
+Finally, we could enter the example folder and start the experiment using following commands:
+.. code-block:: bash
+   # tuning RocksDB with SMAC tuner
+   nnictl create --config ./config_smac.yml
+   # tuning RocksDB with TPE tuner
+   nnictl create --config ./config_tpe.yml
+Experiment results
+------------------
+We ran these two examples on the same machine with following details:
+* 16 * Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
+* 465 GB of rotational hard drive with ext4 file system
+* 128 GB of RAM
+* Kernel version: 4.15.0-58-generic
+* NNI version: v1.0-37-g1bd24577
+* RocksDB version: 6.4
+* RocksDB DEBUG_LEVEL: 0
+The detailed experiment results are shown in the below figure. Horizontal axis is sequential order of trials. Vertical axis is the metric, write OPS in this example. Blue dots represent trials for tuning RocksDB with SMAC tuner, and orange dots stand for trials for tuning RocksDB with TPE tuner. 
+.. image:: ../../img/rocksdb-fillrandom-plot.png
+   :target: ../../img/rocksdb-fillrandom-plot.png
+   :alt: image
+Following table lists the best trials and corresponding parameters and metric obtained by the two tuners. Unsurprisingly, both of them found the same optimal configuration for ``fillrandom`` benchmark.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Tuner
+     - Best trial
+     - Best OPS
+     - write_buffer_size
+     - min_write_buffer_number_to_merge
+     - level0_file_num_compaction_trigger
+   * - SMAC
+     - 255
+     - 779289
+     - 2097152
+     - 7.0
+     - 7.0
+   * - TPE
+     - 169
+     - 761456
+     - 2097152
+     - 7.0
+     - 7.0