overview.rst 2.13 KB
Newer Older
1
2
Overview
========
3
4
5
6
7
8
9
10
11

NNI has supported many training services listed below. Users can go through each page to learning how to configure the corresponding training service. NNI has high extensibility by design, users can customize new training service for their special resource, platform or needs.


..  list-table::
    :header-rows: 1

    * - Training Service
      - Description
Yuge Zhang's avatar
Yuge Zhang committed
12
    * - :doc:`Local <local>`
13
      - The whole experiment runs on your dev machine (i.e., a single local machine)
Yuge Zhang's avatar
Yuge Zhang committed
14
    * - :doc:`Remote <remote>`
QuanluZhang's avatar
QuanluZhang committed
15
      - The trials are dispatched to your configured SSH servers
Yuge Zhang's avatar
Yuge Zhang committed
16
    * - :doc:`OpenPAI <openpai>`
17
      - Running trials on OpenPAI, a DNN model training platform based on Kubernetes
Yuge Zhang's avatar
Yuge Zhang committed
18
    * - :doc:`Kubeflow <kubeflow>`
19
      - Running trials with Kubeflow, a DNN model training framework based on Kubernetes
Yuge Zhang's avatar
Yuge Zhang committed
20
    * - :doc:`AdaptDL <adaptdl>`
21
      - Running trials on AdaptDL, an elastic DNN model training platform
Yuge Zhang's avatar
Yuge Zhang committed
22
    * - :doc:`FrameworkController <frameworkcontroller>`
23
      - Running trials with FrameworkController, a DNN model training framework on Kubernetes
Yuge Zhang's avatar
Yuge Zhang committed
24
    * - :doc:`AML <aml>`
QuanluZhang's avatar
QuanluZhang committed
25
      - Running trials on Azure Machine Learning (AML) cloud service
Yuge Zhang's avatar
Yuge Zhang committed
26
    * - :doc:`PAI-DLC <paidlc>`
27
      - Running trials on PAI-DLC, which is deep learning containers based on Alibaba ACK
Yuge Zhang's avatar
Yuge Zhang committed
28
29
30
31
32
33
34
35
36
37
38
    * - :doc:`Hybrid <hybrid>`
      - Support jointly using multiple above training services

.. _training-service-reuse:

Training Service Under Reuse Mode
---------------------------------

Since NNI v2.0, there are two sets of training service implementations in NNI. The new one is called *reuse mode*. When reuse mode is enabled, a cluster, such as a remote machine or a computer instance on AML, will launch a long-running environment, so that NNI will submit trials to these environments iteratively, which saves the time to create new jobs. For instance, using OpenPAI training platform under reuse mode can avoid the overhead of pulling docker images, creating containers, and downloading data repeatedly.

.. note:: In the reuse mode, users need to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).