@@ -6,7 +6,7 @@ An experiment can be created with command line tool ``nnictl`` or python APIs. N
...
@@ -6,7 +6,7 @@ An experiment can be created with command line tool ``nnictl`` or python APIs. N
Management with ``nnictl``
Management with ``nnictl``
--------------------------
--------------------------
The ability of ``nnictl`` on experiment management is almost equivalent to :doc:`web_portal/web_portal`. Users can refer to :doc:`../reference/nnictl` for detailed usage. It is highly suggested when visualization is not well supported in your environment (e.g., no GUI on your machine).
The ability of ``nnictl`` on experiment management is almost equivalent to :doc:`web_portal/web_portal`. Users can refer to :doc:`../reference/nnictl` for detailed usage. It is highly suggested when visualization is not well supported in your environment (e.g., web browser is not supported in your environment).
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yaml>`)
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yml>`)
Here is a template configuration specification to use AdaptDL as a training service.
Here is a template configuration specification to use AdaptDL as a training service.
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.rst>`__\ , `remote platfrom <RemoteMachineMode.rst>`__\ , `PAI platfrom <PaiMode.rst>`__\ , `kubeflow platform <KubeflowMode.rst>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./kubeflow` and :doc:`./frameworkcontroller`.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
...
@@ -185,6 +185,4 @@ When users submit a trial job to cloud platform, they should wrap their trial co
...
@@ -185,6 +185,4 @@ When users submit a trial job to cloud platform, they should wrap their trial co
Reference
Reference
---------
---------
For more information about how to debug, please `refer <../Tutorial/HowToDebug.rst>`__.
The guideline of how to contribute, please refer to :doc:`/notes/contributing`.
The guideline of how to contribute, please `refer <../Tutorial/Contributing.rst>`__.
@@ -6,7 +6,7 @@ NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai
...
@@ -6,7 +6,7 @@ NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai
Prerequisite
Prerequisite
------------
------------
1. Before starting to use OpenPAI training service, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. Please note that, on OpenPAI, your trial program will run in Docker containers.
1. Before starting to use OpenPAI training service, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. Please note that, on OpenPAI, your trial program will run in Docker containers.
2. Get token. Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
2. Get token. Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
@@ -8,7 +8,7 @@ PAI-DSW server performs the role to submit a job while PAI-DLC is where the trai
...
@@ -8,7 +8,7 @@ PAI-DSW server performs the role to submit a job while PAI-DLC is where the trai
Prerequisite
Prerequisite
------------
------------
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Step 1. Install NNI, follow the :doc:`install guide </installation>`.
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.
@@ -21,18 +21,18 @@ In addition, there are several steps for Windows server.
...
@@ -21,18 +21,18 @@ In addition, there are several steps for Windows server.
1. Install and start ``OpenSSH Server``.
1. Install and start ``OpenSSH Server``.
1) Open ``Settings`` app on Windows.
1) Open ``Settings`` app on Windows.
2) Click ``Apps``\ , then click ``Optional features``.
2) Click ``Apps``\ , then click ``Optional features``.
3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
4) Once it's installed, run below command to start and set to automatic start.
4) Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat
.. code-block:: bat
sc config sshd start=auto
sc config sshd start=auto
net start sshd
net start sshd
2. Make sure remote account is administrator, so that it can stop running trials.
2. Make sure remote account is administrator, so that it can stop running trials.
...
@@ -85,7 +85,7 @@ You can run below command on Windows, Linux, or macOS to spawn trials on remote
...
@@ -85,7 +85,7 @@ You can run below command on Windows, Linux, or macOS to spawn trials on remote
.. _nniignore:
.. _nniignore:
.. Note:: If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
.. Note:: If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your trial code directory contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
...
@@ -111,4 +111,4 @@ Remote training service support shared storage, which can help use your own stor
...
@@ -111,4 +111,4 @@ Remote training service support shared storage, which can help use your own stor
Monitor via TensorBoard
Monitor via TensorBoard
^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support trial visualization via TensorBoard. Follow the guide `here <./tensorboard.rst>`__ to learn how to use TensorBoard.
Remote training service support trial visualization via TensorBoard. Follow the guide :doc:`/experiment/web_portal/tensorboard` to learn how to use TensorBoard.
@@ -7,7 +7,7 @@ All the information generated by the experiment will be stored under ``/nni`` fo
...
@@ -7,7 +7,7 @@ All the information generated by the experiment will be stored under ``/nni`` fo
All the output produced by the trial will be located under ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}/nnioutput`` folder in your shared storage.
All the output produced by the trial will be located under ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}/nnioutput`` folder in your shared storage.
This saves you from finding for experiment-related information in various places.
This saves you from finding for experiment-related information in various places.
Remember that your trial working directory is ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}``, so if you upload your data in this shared storage, you can open it like a local file in your trial code without downloading it.
Remember that your trial working directory is ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}``, so if you upload your data in this shared storage, you can open it like a local file in your trial code without downloading it.
And we will develop more practical features in the future based on shared storage. The config reference can be found `here <../reference/experiment_config.html#sharedstorageconfig>`_.
And we will develop more practical features in the future based on shared storage. The config reference can be found :ref:`here <reference-sharedstorage-config-label>`.
.. note::
.. note::
Shared storage is currently in the experimental stage. We suggest use AzureBlob under Ubuntu/CentOS/RHEL, and NFS under Ubuntu/CentOS/RHEL/Fedora/Debian for remote.
Shared storage is currently in the experimental stage. We suggest use AzureBlob under Ubuntu/CentOS/RHEL, and NFS under Ubuntu/CentOS/RHEL/Fedora/Debian for remote.
Currently only support `LocalConfig`_, `RemoteConfig`_, `OpenpaiConfig`_ and `AmlConfig`_ . Detailed usage can be found :doc:`here </experiment/training_service/hybrid>`.
Currently only support `LocalConfig`_, `RemoteConfig`_, `OpenpaiConfig`_ and `AmlConfig`_ . Detailed usage can be found :doc:`here </experiment/training_service/hybrid>`.