**Tutorial: Run an experiment on multiple machines**
===
NNI supports running an experiment on multiple machines through SSH channel, called `remote` mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code.
e.g. Three machines and you login in with account `bob` (Note: the account is not necessarily the same on different machine):
| IP | Username| Password |
| -------- |---------|-------|
| 10.1.1.1 | bob | bob123 |
| 10.1.1.2 | bob | bob123 |
| 10.1.1.3 | bob | bob123 |
## Setup NNI environment
Install NNI on each of your machines following the install guide [here](GetStarted.md).
For remote machines that are used only to run trials but not the nnictl, you can just install python SDK:
* __Install python SDK through pip__
python3 -m pip install --user --upgrade nni-sdk
## Run an experiment
Install NNI on another machine which has network accessibility to those three machines above, or you can just use any machine above to run nnictl command line tool.
We use `examples/trials/mnist-annotation` as an example here. `cat ~/nni/examples/trials/mnist-annotation/config_remote.yml` to see the detailed configuration file:
NNI provides an easy to adopt approach to set up parameter tuning algorithms as well as early stop policies, we call them **Tuners** and **Assessors**.
**Tuner** specifies the algorithm you use to generate hyperparameter sets for each trial. In NNI, we support two approaches to set the tuner.
1. Directly use tuner provided by nni sdk
required fields: builtinTunerName and classArgs.
2. Customize your own tuner file
required fields: codeDirectory, classFileName, className and classArgs.
### **Learn More about tuners**
* For detailed defintion and usage about the required field, please refer to [Config an experiment](ExperimentConfig.md)
*[Tuners in the latest NNI release](HowToChooseTuner.md)
*[How to implement your own tuner](howto_2_CustomizedTuner.md)
**Assessor** specifies the algorithm you use to apply early stop policy. In NNI, there are two approaches to set the assessor.
1. Directly use assessor provided by nni sdk
required fields: builtinAssessorName and classArgs.
2. Customize your own assessor file
required fields: codeDirectory, classFileName, className and classArgs.
### **Learn More about assessor**
* For detailed defintion and usage aobut the required field, please refer to [Config an experiment](ExperimentConfig.md)
* Find more about the detailed instruction about [enable assessor](EnableAssessor.md)
*[How to implement your own assessor](../examples/assessors/README.md)
## **Learn More**
*[How to run an experiment on local (with multiple GPUs)?](tutorial_1_CR_exp_local_api.md)
*[How to run an experiment on multiple machines?](tutorial_2_RemoteMachineMode.md)
*[How to run an experiment on OpenPAI?](PAIMode.md)
Due to the memory limitation of upload, we only upload the source code and complete the data download and training on OpenPAI. This experiment requires sufficient memory that `memoryMB >= 32G`, and the training may last for several hours.
### Update configuration
Modify `nni/examples/trials/ga_squad/config_pai.yaml`, here is the default configuration:
Modify `nni/examples/trials/ga_squad/config_pai.yml`, here is the default configuration:
```
authorName: default
...
...
@@ -114,18 +114,18 @@ trial:
gpuNum: 0
cpuNum: 1
memoryMB: 32869
#The docker image to run nni job on pai
#The docker image to run NNI job on OpenPAI
image: msranni/nni:latest
#The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
#The hdfs directory to store data on OpenPAI, format 'hdfs://host:port/directory'
dataDir: hdfs://10.10.10.10:9000/username/nni
#The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory'
#The hdfs directory to store output data generated by NNI, format 'hdfs://host:port/directory'
Modify `examples/trials/network_morphism/cifar10/config.yaml` to fit your own task, note that searchSpacePath is not required in our configuration. Here is the default configuration:
Modify `examples/trials/network_morphism/cifar10/config.yml` to fit your own task, note that searchSpacePath is not required in our configuration. Here is the default configuration:
```yaml
authorName:default
...
...
@@ -79,16 +79,16 @@ net = build_graph_from_json(RCV_CONFIG)
# training procedure
# ....
# report the final accuracy to nni
# report the final accuracy to NNI
nni.report_final_result(best_acc)
```
### 5. Submit this job
```bash
# You can use nni command tool "nnictl" to create the a job which submit to the nni
# finally you successfully commit a Network Morphism Job to nni
nnictl create --config config.yaml
# You can use NNI command tool "nnictl" to create the a job which submit to the NNI
# finally you successfully commit a Network Morphism Job to NNI
nnictl create --config config.yml
```
## Trial Examples
...
...
@@ -99,10 +99,10 @@ The trial has some examples which can guide you which located in `examples/trial
`Fashion-MNIST` is a dataset of [Zalando](https://jobs.zalando.com/tech/)'s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. It is a modern image classification dataset widely used to replacing MNIST as a baseline dataset, because the dataset MNIST is too easy and overused.
There are two examples, [FashionMNIST-keras.py](./FashionMNIST/FashionMNIST_keras.py) and [FashionMNIST-pytorch.py](./FashionMNIST/FashionMNIST_pytorch.py). Attention, you should change the `input_width` to 28 and `input_channel` to 1 in `config.yaml` for this dataset.
There are two examples, [FashionMNIST-keras.py](./FashionMNIST/FashionMNIST_keras.py) and [FashionMNIST-pytorch.py](./FashionMNIST/FashionMNIST_pytorch.py). Attention, you should change the `input_width` to 28 and `input_channel` to 1 in `config.yml` for this dataset.
### Cifar10
The `CIFAR-10` dataset [Canadian Institute For Advanced Research](https://www.cifar.ca/) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes.
There are two examples, [cifar10-keras.py](./cifar10/cifar10_keras.py) and [cifar10-pytorch.py](./cifar10/cifar10_pytorch.py). The value `input_width` is 32 and the value `input_channel` is 3 in `config.yaml` for this dataset.
There are two examples, [cifar10-keras.py](./cifar10/cifar10_keras.py) and [cifar10-pytorch.py](./cifar10/cifar10_pytorch.py). The value `input_width` is 32 and the value `input_channel` is 3 in `config.yml` for this dataset.
@@ -10,7 +10,7 @@ Frist, this is an example of how to write an automl algorithm based on MsgDispat
Second, this implementation fully leverages Hyperband's internal parallelism. More specifically, the next bucket is not started strictly after the current bucket, instead, it starts when there is available resource.
## 3. Usage
To use Hyperband, you should add the following spec in your experiment's yaml config file:
To use Hyperband, you should add the following spec in your experiment's yml config file: