improve doc for PBT tuner (#2258)

a84b32b4 · QuanluZhang · GitHub · a7b96de9 · a84b32b4 · a84b32b4
Unverified Commit a84b32b4 authored Apr 03, 2020 by QuanluZhang Committed by GitHub Apr 03, 2020
Showing with 52 additions and 5 deletions

README.md README.md +1 -0

docs/en_US/Tuner/BuiltinTuner.md docs/en_US/Tuner/BuiltinTuner.md +6 -2

docs/en_US/Tuner/PBTTuner.md docs/en_US/Tuner/PBTTuner.md +45 -3

docs/img/pbt.jpg docs/img/pbt.jpg +0 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -108,6 +108,7 @@ Within the following table, we summarized the current NNI capabilities, we are g
            <li><a href="docs/en_US/Tuner/BuiltinTuner.md#Evolution">Naïve Evolution</a></li>
            <li><a href="docs/en_US/Tuner/BuiltinTuner.md#Anneal">Anneal</a></li>
            <li><a href="docs/en_US/Tuner/BuiltinTuner.md#Hyperband">Hyperband</a></li>
+            <li><a href="docs/en_US/Tuner/BuiltinTuner.md#PBTTuner">PBT</a></li>
          </ul>
          <b>Bayesian optimization</b>
            <ul>

--- a/docs/en_US/Tuner/BuiltinTuner.md
+++ b/docs/en_US/Tuner/BuiltinTuner.md
@@ -463,13 +463,13 @@ tuner:
 **Suggested scenario**
-Population Based Training (PBT) which bridges and extends parallel search methods and sequential optimization methods. It has a wallclock run time that is no greater than that of a single optimization process, does not require sequential runs, and is also able to use fewer computational resources than naive search methods. Therefore, it's effective when you want to save computational resources and time. Besides, PBT returns hyperparameter scheduler instead of configuration. If you don't need to get a specific configuration, but just expect good results, you can choose this tuner. It should be noted that, in our implementation, the operation of checkpoint storage location is involved. A trial is considered as several traning epochs of training, so the loading and saving of checkpoint must be specified in the trial code, which is different with other tuners. Otherwise, if the experiment is not local mode, users should provide a path in a shared storage which can be accessed by all the trials. You could try it on very simple task, such as the [mnist-pbt-tuner-pytorch](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-pbt-tuner-pytorch) example. [See details](./PBTTuner.md)
+Population Based Training (PBT) bridges and extends parallel search methods and sequential optimization methods. It requires relatively small computation resource, by inheriting weights from currently good-performing ones to explore better ones periodically. With PBTTuner, users finally get a trained model, rather than a configuration that could reproduce the trained model by training the model from scratch. This is because model weights are inherited periodically through the whole search process. PBT can also be seen as a training approach. If you don't need to get a specific configuration, but just expect a good model, PBTTuner is a good choice. [See details](./PBTTuner.md)
 **classArgs requirements:**
 * **optimize_mode** (*'maximize' or 'minimize'*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
 * **all_checkpoint_dir** (*str, optional, default = None*) - Directory for trials to load and save checkpoint, if not specified, the directory would be "~/nni/checkpoint/<exp-id>". Note that if the experiment is not local mode, users should provide a path in a shared storage which can be accessed by all the trials.
-* **population_size** (*int, optional, default = 10*) - Number of trials for each step. In our implementation, one step is running each trial by specific training epochs set by users.
+* **population_size** (*int, optional, default = 10*) - Number of trials in a population. Each step has this number of trials. In our implementation, one step is running each trial by specific training epochs set by users.
 * **factors** (*tuple, optional, default = (1.2, 0.8)*) - Factors for perturbation of hyperparameters.
 * **fraction** (*float, optional, default = 0.2*) - Fraction for selecting bottom and top trials.
@@ -482,6 +482,10 @@ tuner:
  classArgs:
    optimize_mode: maximize
 ```
+Note that, to use this tuner, your trial code should be modified accordingly, please refer to [the document of PBTTuner](./PBTTuner.md) for details.
 ## **Reference and Feedback**
 * To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;
 * To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub;

--- a/docs/en_US/Tuner/PBTTuner.md
+++ b/docs/en_US/Tuner/PBTTuner.md
@@ -5,8 +5,50 @@ PBT Tuner on NNI
 Population Based Training (PBT) comes from [Population Based Training of Neural Networks](https://arxiv.org/abs/1711.09846v1). It's a simple asynchronous optimization algorithm which effectively utilizes a fixed computational budget to jointly optimize a population of models and their hyperparameters to maximize performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training. 
-PBTTuner initializes a population with several trials. Users can set a specific number of training epochs. After a certain number of epochs, the parameters and hyperparameters in the trial with bad metrics will be replaced with a better trial (exploit). Then the hyperparameters are perturbed (explore). 
+![](../../img/pbt.jpg)
-In our implementation, training epochs in the trial code is regarded as a step of PBT, different with other tuners. At the end of each step, PBT tuner will do exploitation and exploration -- replacing some trials with new trials. This is implemented by constantly modifying the values of `load_checkpoint_dir` and `save_checkpoint_dir`. We can directly change `load_checkpoint_dir` to replace parameters and hyperparameters, and `save_checkpoint_dir` to save a checkpoint that will be loaded in the next step. To this end, we need a shared folder which is accessible to all trials.
+PBTTuner initializes a population with several trials (i.e., `population_size`). There are four steps in the above figure, each trial only runs by one step. How long is one step is controlled by trial code, e.g., one epoch. When a trial starts, it loads a checkpoint specified by PBTTuner and continues to run one step, then saves checkpoint to a directory specified by PBTTuner and exits. The trials in a population run steps synchronously, that is, after all the trials finish the `i`-th step, the `(i+1)`-th step can be started. Exploitation and exploration of PBT are executed between two consecutive steps.
-If the experiment is running in local mode, users could provide an argument `all_checkpoint_dir` which will be the base folder of `load_checkpoint_dir` and `save_checkpoint_dir` (`checkpoint_dir` is set to `all_checkpoint_dir/<population-id>/<step>`). By default, `all_checkpoint_dir` is set to be `~/nni/experiments/<exp-id>/checkpoint`. If the experiment is in non-local mode, then users should provide a path in a shared storage folder which is mounted at `all_checkpoint_dir` on worker machines (but it's not necessarily available on the machine which runs tuner).
+### Provide checkpoint directory
+Since some trials need to load other trial's checkpoint, users should provide a directory (i.e., `all_checkpoint_dir`) which is accessible by every trial. It is easy for local mode, users could directly use the default directory or specify any directory on the local machine. For other training services, users should follow [the document of those training services](../TrainingService/SupportTrainingService.md) to provide a directory in a shared storage, such as NFS, Azure storage.
+### Modify your trial code
+Before running a step, a trial needs to load a checkpoint, the checkpoint directory is specified in hyper-parameter configuration generated by PBTTuner, i.e., `params['load_checkpoint_dir']`. Similarly, the directory for saving checkpoint is also included in the configuration, i.e., `params['save_checkpoint_dir']`. Here, `all_checkpoint_dir` is base folder of `load_checkpoint_dir` and `save_checkpoint_dir` whose format is `all_checkpoint_dir/<population-id>/<step>`.
+```python
+params = nni.get_next_parameter()
+# the path of the checkpoint to load
+load_path = os.path.join(params['load_checkpoint_dir'], 'model.pth')
+# load checkpoint from `load_path`
+...
+# run one step
+...
+# the path for saving a checkpoint
+save_path = os.path.join(params['save_checkpoint_dir'], 'model.pth')
+# save checkpoint to `save_path`
+...
+```
+The complete example code can be found [here](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-pbt-tuner-pytorch).
+### Experiment config
+Below is an exmaple of PBTTuner configuration in experiment config file. **Note that Assessor is not allowed if PBTTuner is used.**
+```yaml
+# config.yml
+tuner:
+  builtinTunerName: PBTTuner
+  classArgs:
+    optimize_mode: maximize
+    all_checkpoint_dir: /the/path/to/store/checkpoints
+    population_size: 10
+```
+### Limitations
+The current implementation only supports search space types in `float`, including `uniform`, `normal`. The support of other search space types is ongoing.
+Importing data has not been supported yet.
\ No newline at end of file
--- a/docs/img/pbt.jpg
+++ b/docs/img/pbt.jpg