"vscode:/vscode.git/clone" did not exist on "c1a5b1eedec9cbfa3966c36ce62b4553afb10dd7"
RemoteMachineMode.rst 4.24 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
Run an Experiment on Remote Machines
====================================

NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.

The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``.

Requirements
------------


* 
  Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config.

* 
  Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usages, please refer to `machineList part of configuration <../Tutorial/ExperimentConfig.rst>`__.

* 
  Make sure the NNI version on each machine is consistent.

* 
  Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.

Linux
^^^^^


* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine.

Windows
^^^^^^^


* 
  Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine.

* 
  Install and start ``OpenSSH Server``.


  #. 
     Open ``Settings`` app on Windows.

  #. 
     Click ``Apps``\ , then click ``Optional features``.

  #. 
     Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.

  #. 
     Once it's installed, run below command to start and set to automatic start.

  .. code-block:: bat

     sc config sshd start=auto
     net start sshd

* 
  Make sure remote account is administrator, so that it can stop running trials.

* 
  Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``.

  The output like below is ok, when opening a new command window.

  .. code-block:: text

     Microsoft Windows [Version 10.0.17763.1192]
     (c) 2018 Microsoft Corporation. All rights reserved.

     (py37_default) C:\Users\AzureUser>

Run an experiment
-----------------

e.g. there are three machines, which can be logged in with username and password.

.. list-table::
   :header-rows: 1
   :widths: auto

   * - IP
     - Username
     - Password
   * - 10.1.1.1
     - bob
     - bob123
   * - 10.1.1.2
     - bob
     - bob123
   * - 10.1.1.3
     - bob
     - bob123


Install and run NNI on one of those three machines or another machine, which has network access to them.

liuzhe-lz's avatar
liuzhe-lz committed
98
Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``examples/trials/mnist-pytorch/config_remote.yml``\ :
99
100
101

.. code-block:: yaml

liuzhe-lz's avatar
liuzhe-lz committed
102
103
104
105
106
107
   searchSpaceFile: search_space.json
   trialCommand: python3 mnist.py
   trialCodeDirectory: .  # default value, can be omitted
   trialGpuNumber: 0
   trialConcurrency: 4
   maxTrialNumber: 20
108
   tuner:
liuzhe-lz's avatar
liuzhe-lz committed
109
     name: TPE
110
111
     classArgs:
       optimize_mode: maximize
liuzhe-lz's avatar
liuzhe-lz committed
112
113
114
115
116
117
118
119
120
121
122
123
124
   trainingService:
     platform: remote
     machineList:
       - host: 192.0.2.1
         user: alice
         ssh_key_file: ~/.ssh/id_rsa
       - host: 192.0.2.2
         port: 10022
         user: bob
         password: bob123
         pythonPath: /usr/bin

Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
125
126
127

.. code-block:: bash

liuzhe-lz's avatar
liuzhe-lz committed
128
   nnictl create --config examples/trials/mnist-pytorch/config_remote.yml
129
130
131
132

Configure python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

133
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine. 
134

liuzhe-lz's avatar
liuzhe-lz committed
135
For example, with anaconda you can specify:
136
137
138

.. code-block:: yaml

liuzhe-lz's avatar
liuzhe-lz committed
139
   pythonPath: /home/bob/.conda/envs/ENV-NAME/bin