HowToImplementTrainingService.md 7.8 KB
Newer Older
Chi Song's avatar
Chi Song committed
1
# **实现 NNI TrainingService**
Chi Song's avatar
Chi Song committed
2

Chi Song's avatar
Chi Song committed
3
## 概述
Chi Song's avatar
Chi Song committed
4

Chi Song's avatar
Chi Song committed
5
TrainingService 是与平台管理、任务调度相关的模块。 TrainingService 在设计上为了便于实现,将平台相关的公共属性抽象成类。用户只需要继承这个抽象类,并根据平台特点实现子类,便能够实现 TrainingService。
Chi Song's avatar
Chi Song committed
6

Chi Song's avatar
Chi Song committed
7
## 系统架构
Chi Song's avatar
Chi Song committed
8
9
10

![](../img/NNIDesign.jpg)

Chi Song's avatar
Chi Song committed
11
12
13
NNI 的架构如图所示。 NNIManager 是系统的核心管理模块,负责调用 TrainingService 来管理 Trial,并负责不同模块之间的通信。 Dispatcher 是消息处理中心。 TrainingService 是管理任务的模块,它和 NNIManager 通信,并且根据平台的特点有不同的实现。 当前,NNI 支持本机,[远程平台](RemoteMachineMode.md)[OpenPAI 平台](PaiMode.md)[Kubeflow 平台](KubeflowMode.md) 以及 [FrameworkController 平台](FrameworkController.md)

本文中,会介绍 TrainingService 的简要设计。 如果要添加新的 TrainingService,只需要继承 TrainingServcie 类并实现相应的方法,不需要理解NNIManager、Dispatcher 等其它模块的细节。
Chi Song's avatar
Chi Song committed
14

Chi Song's avatar
Chi Song committed
15
## 代码文件夹结构
Chi Song's avatar
Chi Song committed
16

Chi Song's avatar
Chi Song committed
17
NNI 的文件夹结构如下:
Chi Song's avatar
Chi Song committed
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

    nni
      |- deployment
      |- docs
      |- examaples
      |- src
      | |- nni_manager
      | | |- common
      | | |- config
      | | |- core
      | | |- coverage
      | | |- dist
      | | |- rest_server
      | | |- training_service
      | | | |- common
      | | | |- kubernetes
      | | | |- local
      | | | |- pai
      | | | |- remote_machine
      | | | |- test
      | |- sdk
      | |- webui
      |- test
      |- tools
      | |-nni_annotation
      | |-nni_cmd
      | |-nni_gpu_tool
      | |-nni_trial_tool
    

Chi Song's avatar
Chi Song committed
48
`nni/src` 文件夹存储 NNI 的大部分源代码。 这个文件夹中的代码和 NNIManager、TrainingService、SDK、WebUI 等模块有关。 用户可以在 `nni/src/nni_manager/common/trainingService.ts` 文件中找到 TrainingService 抽象类的代码,并且把自己实现的子类放到 `nni/src/nni_manager/training_service` 文件夹下。 如果用户实现了自己的 TrainingService,还需要同时实现相应的单元测试代码,并把单元测试放到 `nni/src/nni_manager/training_service/test` 文件夹下。
Chi Song's avatar
Chi Song committed
49

Chi Song's avatar
Chi Song committed
50
## TrainingService 函数解释
Chi Song's avatar
Chi Song committed
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

    abstract class TrainingService {
        public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
        public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
        public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
        public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
        public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>;
        public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>;
        public abstract get isMultiPhaseJobSupported(): boolean;
        public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>;
        public abstract setClusterMetadata(key: string, value: string): Promise<void>;
        public abstract getClusterMetadata(key: string): Promise<string>;
        public abstract cleanUp(): Promise<void>;
        public abstract run(): Promise<void>;
    }
    

Chi Song's avatar
Chi Song committed
68
TrainingService 父类有一些抽象方法,用户需要继承并实现这些抽象方法。
Chi Song's avatar
Chi Song committed
69

Chi Song's avatar
Chi Song committed
70
71
72
**setClusterMetadata(key: string, value: string)**

ClusterMetadata 是与平台细节相关的数据,例如,ClusterMetadata 在远程服务器的定义是:
Chi Song's avatar
Chi Song committed
73
74
75
76
77
78
79
80
81
82
83
84

    export class RemoteMachineMeta {
        public readonly ip : string;
        public readonly port : number;
        public readonly username : string;
        public readonly passwd?: string;
        public readonly sshKeyPath?: string;
        public readonly passphrase?: string;
        public gpuSummary : GPUSummary | undefined;
        /* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
        public gpuReservation : Map<number, string>;
    
Chi Song's avatar
Chi Song committed
85
        constructor(ip : string, port : number, username : string, passwd : string,
Chi Song's avatar
Chi Song committed
86
87
88
89
90
91
92
93
94
95
96
97
            sshKeyPath : string, passphrase : string) {
            this.ip = ip;
            this.port = port;
            this.username = username;
            this.passwd = passwd;
            this.sshKeyPath = sshKeyPath;
            this.passphrase = passphrase;
            this.gpuReservation = new Map<number, string>();
        }
    }
    

Chi Song's avatar
Chi Song committed
98
Metadata 中包括了主机地址,用户名和其它平台相关配置。 用户需要定义自己的 Metadata 格式,并在这个方法中相应实现。 这个方法在 Experiment 启动之前调用。
Chi Song's avatar
Chi Song committed
99

Chi Song's avatar
Chi Song committed
100
101
102
103
104
**getClusterMetadata(key: string)**

此函数将返回相应值的元数据值,如果不需要使用,可留空。

**submitTrialJob(form: JobApplicationForm)**
Chi Song's avatar
Chi Song committed
105

Chi Song's avatar
Chi Song committed
106
SubmitTrialJob 是用来提交新 Trial 任务的函数,需要生成一个 TrialJobDetail 类型的任务实例。 TrialJobDetail 定义如下:
Chi Song's avatar
Chi Song committed
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122

    interface TrialJobDetail {
        readonly id: string;
        readonly status: TrialJobStatus;
        readonly submitTime: number;
        readonly startTime?: number;
        readonly endTime?: number;
        readonly tags?: string[];
        readonly url?: string;
        readonly workingDirectory: string;
        readonly form: JobApplicationForm;
        readonly sequenceId: number;
        isEarlyStopped?: boolean;
    }
    

Chi Song's avatar
Chi Song committed
123
根据不同的实现,用户可能需要把 Trial 任务放入队列中,并不断地从队里中取出任务进行提交。 或者也可以直接在这个方法中完成作业提交过程。
Chi Song's avatar
Chi Song committed
124

Chi Song's avatar
Chi Song committed
125
126
127
128
129
130
131
132
133
**cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)**

如果此函数被调用,应取消平台启动的 Trial。 不同的平台有不同的取消作业的方式,这个方法应该根据不同平台的特点,实现相应的细节。

**updateTrialJob(trialJobId: string, form: JobApplicationForm)**

调用此函数可更新 Trial 的任务状态,Trial 任务状态根据不同的平台来检测,并需要更新为 `RUNNING`, `SUCCEED`, `FAILED` 等状态。

**getTrialJob(trialJobId: string)**
Chi Song's avatar
Chi Song committed
134

Chi Song's avatar
Chi Song committed
135
此函数根据 trialJobId 返回 trialJob 的实例。
Chi Song's avatar
Chi Song committed
136

Chi Song's avatar
Chi Song committed
137
**listTrialJobs()**
Chi Song's avatar
Chi Song committed
138

Chi Song's avatar
Chi Song committed
139
用户需要将所有 Trial 任务详情存入列表并返回。
Chi Song's avatar
Chi Song committed
140

Chi Song's avatar
Chi Song committed
141
**addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
Chi Song's avatar
Chi Song committed
142

Chi Song's avatar
Chi Song committed
143
144
145
146
147
148
149
NNI 会启动一个 EventEmitter 来处理任务的指标数据,如果有检测到有新的数据,EventEmitter就会被触发,来执行相应的事件。 用户需要在这个方法中开始 EventEmitter。

**removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**

关闭 EventEmitter。

**run()**
Chi Song's avatar
Chi Song committed
150

Chi Song's avatar
Chi Song committed
151
Run() 函数是 TrainingService 的主循环,用户可以在这个函数中循环执行他们的代码逻辑,这个函数在实验结束前会一直循环执行。
Chi Song's avatar
Chi Song committed
152

Chi Song's avatar
Chi Song committed
153
154
155
**cleanUp()**

当实验结束后,此方法用来清除实验环境。 用户需要在这个方法中实现与平台相关的清除操作。
Chi Song's avatar
Chi Song committed
156

Chi Song's avatar
Chi Song committed
157
## TrialKeeper 工具
Chi Song's avatar
Chi Song committed
158

Chi Song's avatar
Chi Song committed
159
160
161
162
163
164
165
NNI 提供了 TrialKeeper 工具,用来帮助维护 Trial 任务。 可以在 `nni/tools/nni_trial_tool` 文件夹中找到 TrialKeeper 的源代码。 如果想要运行在云平台上,这是维护任务的好工具。

TrialKeeper 的架构如下:

![](../img/trialkeeper.jpg)

当用户需要在远程云平台上运行作业,要把作业启动的命令行传入 TrailKeeper 中,并在远程云平台上启动 TrailKeeper 进程。 注意,TrialKeeper 在远程平台中使用 RESTful 服务来和 TrainingService 进行通信,用户需要在本地机器启动一个 RESTful 服务来接受 TrialKeeper 的请求。 关于 RESTful 服务的源代码可以在 `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts` 文件夹中找到.
Chi Song's avatar
Chi Song committed
166

Chi Song's avatar
Chi Song committed
167
## 参考
Chi Song's avatar
Chi Song committed
168

Chi Song's avatar
Chi Song committed
169
170
171
有关调试的进一步信息,可参考[这里](HowToDebug.md)

如何参与贡献的指南,请参考[这里](Contributing.md)