RemoteMachineMode.md 6.68 KB
Newer Older
QuanluZhang's avatar
QuanluZhang committed
1
# Run an Experiment on Remote Machines
2

QuanluZhang's avatar
QuanluZhang committed
3
NNI can run one experiment on multiple remote machines through SSH, called `remote` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
Chi Song's avatar
Chi Song committed
4

5
The OS of remote machines supports `Linux`, `Windows 10`, and `Windows Server 2019`.
SparkSnail's avatar
SparkSnail committed
6

7
## Requirements
Deshui Yu's avatar
Deshui Yu committed
8

9
* Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into `command` field of NNI config.
QuanluZhang's avatar
QuanluZhang committed
10
11
12
13

* Make sure remote machines can be accessed through SSH from the machine which runs `nnictl` command. It supports both password and key authentication of SSH. For advanced usages, please refer to [machineList part of configuration](../Tutorial/ExperimentConfig.md).

* Make sure the NNI version on each machine is consistent.
Deshui Yu's avatar
Deshui Yu committed
14

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
* Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called `python3` on Linux, and `python` on Windows.

### Linux

* Follow [installation](../Tutorial/InstallationLinux.md) to install NNI on the remote machine.

### Windows

* Follow [installation](../Tutorial/InstallationWin.md) to install NNI on the remote machine.

* Install and start `OpenSSH Server`.

  1. Open `Settings` app on Windows.

  2. Click `Apps`, then click `Optional features`.

  3. Click `Add a feature`, search and select `OpenSSH Server`, and then click `Install`.

  4. Once it's installed, run below command to start and set to automatic start.

  ```bat
  sc config sshd start=auto
  net start sshd
  ```

* Make sure remote account is administrator, so that it can stop running trials.

* Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in `C:\dsvm\tools\setup\welcome.bat`.

  The output like below is ok, when opening a new command window.

  ```text
  Microsoft Windows [Version 10.0.17763.1192]
  (c) 2018 Microsoft Corporation. All rights reserved.

  (py37_default) C:\Users\AzureUser>
  ```

Deshui Yu's avatar
Deshui Yu committed
53
## Run an experiment
Chi Song's avatar
Chi Song committed
54

QuanluZhang's avatar
QuanluZhang committed
55
56
57
58
59
60
61
62
63
e.g. there are three machines, which can be logged in with username and password.

| IP       | Username | Password |
| -------- | -------- | -------- |
| 10.1.1.1 | bob      | bob123   |
| 10.1.1.2 | bob      | bob123   |
| 10.1.1.3 | bob      | bob123   |

Install and run NNI on one of those three machines or another machine, which has network access to them.
64

QuanluZhang's avatar
QuanluZhang committed
65
Use `examples/trials/mnist-annotation` as the example. Below is content of `examples/trials/mnist-annotation/config_remote.yml`:
Chi Song's avatar
Chi Song committed
66

Yan Ni's avatar
Yan Ni committed
67
```yaml
68
69
70
71
72
73
74
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
LongzeSong's avatar
LongzeSong committed
75
76
# search space file
searchSpacePath: search_space.json
77
#choice: true, false
Deshui Yu's avatar
Deshui Yu committed
78
79
useAnnotation: true
tuner:
80
81
  #choice: TPE, Random, Anneal, Evolution, BatchTuner
  #SMAC (SMAC should be installed through nnictl)
82
83
  builtinTunerName: TPE
  classArgs:
84
    #choice: maximize, minimize
85
    optimize_mode: maximize
Deshui Yu's avatar
Deshui Yu committed
86
trial:
87
88
  command: python3 mnist.py
  codeDir: .
89
  gpuNum: 0
Deshui Yu's avatar
Deshui Yu committed
90
91
92
93
94
#machineList can be empty if the platform is local
machineList:
  - ip: 10.1.1.1
    username: bob
    passwd: bob123
95
96
    #port can be skip if using default ssh port 22
    #port: 22
Deshui Yu's avatar
Deshui Yu committed
97
98
99
100
101
102
103
  - ip: 10.1.1.2
    username: bob
    passwd: bob123
  - ip: 10.1.1.3
    username: bob
    passwd: bob123
```
Chi Song's avatar
Chi Song committed
104

QuanluZhang's avatar
QuanluZhang committed
105
Files in `codeDir` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
106
107

```bash
108
nnictl create --config examples/trials/mnist-annotation/config_remote.yml
109
```
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185

### Configure python environment

By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use __preCommand__ to specify a python environment on your remote machine. 

Use `examples/trials/mnist-tfv2` as the example. Below is content of `examples/trials/mnist-tfv2/config_remote.yml`:

```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner
  #SMAC (SMAC should be installed through nnictl)
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
trial:
  command: python3 mnist.py
  codeDir: .
  gpuNum: 0
#machineList can be empty if the platform is local
machineList:
  - ip: ${replace_to_your_remote_machine_ip}
    username: ${replace_to_your_remote_machine_username}
    sshKeyPath: ${replace_to_your_remote_machine_sshKeyPath}
    # Pre-command will be executed before the remote machine executes other commands.
    # Below is an example of specifying python environment.
    # If you want to execute multiple commands, please use "&&" to connect them.
    # preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
    # preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
    preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
```

The __preCommand__ will be executed before the remote machine executes other commands. So you can configure python environment path like this:

```yaml
# Linux remote machine
preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
# Windows remote machine
preCommand: set path=${replace_to_python_environment_path_in_your_remote_machine};%path%
```

Or if you want to activate the `virtualenv` environment:

```yaml
# Linux remote machine
preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
# Windows remote machine
preCommand: ${replace_to_absolute_path_recommended_here}\\scripts\\activate
```

Or if you want to activate the `conda` environment:

```yaml
# Linux remote machine
preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
# Windows remote machine
preCommand: call activate ${replace_to_conda_env_name}
```

If you want multiple commands to be executed, you can use `&&` to connect these commands:

```yaml
preCommand: command1 && command2 && command3
```

__Note__: Because __preCommand__ will execute before other commands each time, it is strongly not recommended to set __preCommand__ that will make changes to system, i.e. `mkdir` or `touch`.