HowToImplementTrainingService.md 7.89 KB
Newer Older
Chi Song's avatar
Chi Song committed
1
# **实现 NNI TrainingService**
Chi Song's avatar
Chi Song committed
2

Chi Song's avatar
Chi Song committed
3
## 概述
Chi Song's avatar
Chi Song committed
4

Chi Song's avatar
Chi Song committed
5
TrainingService 是与平台管理、任务调度相关的模块。 TrainingService 在设计上为了便于实现,将平台相关的公共属性抽象成类。用户只需要继承这个抽象类,并根据平台特点实现子类,便能够实现 TrainingService。
Chi Song's avatar
Chi Song committed
6

Chi Song's avatar
Chi Song committed
7
## 系统架构
Chi Song's avatar
Chi Song committed
8
9
10

![](../img/NNIDesign.jpg)

Chi Song's avatar
Chi Song committed
11
NNI 的架构如图所示。 NNIManager 是系统的核心管理模块,负责调用 TrainingService 来管理 Trial,并负责不同模块之间的通信。 Dispatcher 是消息处理中心。 TrainingService 是管理任务的模块,它和 NNIManager 通信,并且根据平台的特点有不同的实现。 当前,NNI 支持本地平台、[远程平台](RemoteMachineMode.md)[OpenPAI 平台](PaiMode.md)[Kubeflow 平台](KubeflowMode.md)[FrameworkController 平台](FrameworkController.md)
Chi Song's avatar
Chi Song committed
12
在这个文档中,会简要介绍 TrainingService 的设计。 如果要添加新的 TrainingService,只需要继承 TrainingServcie 类并实现相应的方法,不需要理解NNIManager、Dispatcher 等其它模块的细节。
Chi Song's avatar
Chi Song committed
13

Chi Song's avatar
Chi Song committed
14
## 代码文件夹结构
Chi Song's avatar
Chi Song committed
15

Chi Song's avatar
Chi Song committed
16
NNI 的文件夹结构如下:
Chi Song's avatar
Chi Song committed
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

    nni
      |- deployment
      |- docs
      |- examaples
      |- src
      | |- nni_manager
      | | |- common
      | | |- config
      | | |- core
      | | |- coverage
      | | |- dist
      | | |- rest_server
      | | |- training_service
      | | | |- common
      | | | |- kubernetes
      | | | |- local
      | | | |- pai
      | | | |- remote_machine
      | | | |- test
      | |- sdk
      | |- webui
      |- test
      |- tools
      | |-nni_annotation
      | |-nni_cmd
      | |-nni_gpu_tool
      | |-nni_trial_tool
    

Chi Song's avatar
Chi Song committed
47
`nni/src` 文件夹存储 NNI 的大部分源代码。 这个文件夹中的代码和 NNIManager、TrainingService、SDK、WebUI 等模块有关。 用户可以在 `nni/src/nni_manager/common/trainingService.ts` 文件中找到 TrainingService 抽象类的代码,并且把自己实现的子类放到 `nni/src/nni_manager/training_service` 文件夹下。 如果用户实现了自己的 TrainingService,还需要同时实现相应的单元测试代码,并把单元测试放到 `nni/src/nni_manager/training_service/test` 文件夹下。
Chi Song's avatar
Chi Song committed
48

Chi Song's avatar
Chi Song committed
49
## TrainingService 函数解释
Chi Song's avatar
Chi Song committed
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

    abstract class TrainingService {
        public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
        public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
        public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
        public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
        public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>;
        public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>;
        public abstract get isMultiPhaseJobSupported(): boolean;
        public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>;
        public abstract setClusterMetadata(key: string, value: string): Promise<void>;
        public abstract getClusterMetadata(key: string): Promise<string>;
        public abstract cleanUp(): Promise<void>;
        public abstract run(): Promise<void>;
    }
    

Chi Song's avatar
Chi Song committed
67
TrainingService 父类有一些抽象方法,用户需要继承并实现这些抽象方法。
Chi Song's avatar
Chi Song committed
68
69

**setClusterMetadata(key: string, value: string)**  
Chi Song's avatar
Chi Song committed
70
ClusterMetadata 是与平台数据有关的方法,例如,在远程平台上的 ClusterMetadata 定义是:
Chi Song's avatar
Chi Song committed
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

    export class RemoteMachineMeta {
        public readonly ip : string;
        public readonly port : number;
        public readonly username : string;
        public readonly passwd?: string;
        public readonly sshKeyPath?: string;
        public readonly passphrase?: string;
        public gpuSummary : GPUSummary | undefined;
        /* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
        public gpuReservation : Map<number, string>;
    
        constructor(ip : string, port : number, username : string, passwd : string, 
            sshKeyPath : string, passphrase : string) {
            this.ip = ip;
            this.port = port;
            this.username = username;
            this.passwd = passwd;
            this.sshKeyPath = sshKeyPath;
            this.passphrase = passphrase;
            this.gpuReservation = new Map<number, string>();
        }
    }
    

Chi Song's avatar
Chi Song committed
96
Metadata 中包括了主机地址,用户名和其它平台相关配置。 用户需要定义自己的 Metadata 格式,并在这个方法中相应实现。 这个方法在 Experiment 启动之前调用。
Chi Song's avatar
Chi Song committed
97
98

**getClusterMetadata(key: string)**  
Chi Song's avatar
Chi Song committed
99
这个方法返回 metadata 的内容,如果不需要使用这个方法,可将方法内容留空。
Chi Song's avatar
Chi Song committed
100
101

**submitTrialJob(form: JobApplicationForm)**  
Chi Song's avatar
Chi Song committed
102
SubmitTrialJob 是用来提交 Trial 任务的方法,用户需要在这个方法中生成 TrialJobDetail 类型的实例。 TrialJobDetail 定义如下:
Chi Song's avatar
Chi Song committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

    interface TrialJobDetail {
        readonly id: string;
        readonly status: TrialJobStatus;
        readonly submitTime: number;
        readonly startTime?: number;
        readonly endTime?: number;
        readonly tags?: string[];
        readonly url?: string;
        readonly workingDirectory: string;
        readonly form: JobApplicationForm;
        readonly sequenceId: number;
        isEarlyStopped?: boolean;
    }
    

Chi Song's avatar
Chi Song committed
119
根据不同的实现,用户可能需要把 Trial 任务放入队列中,并不断地从队里中取出任务进行提交。 或者也可以直接在这个方法中完成作业提交过程。
Chi Song's avatar
Chi Song committed
120
121

**cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)**  
Chi Song's avatar
Chi Song committed
122
如果这个方法被调用, Trial 应该被取消执行。 不同的平台有不同的取消作业的方式,这个方法应该根据不同平台的特点,实现相应的细节。
Chi Song's avatar
Chi Song committed
123
124

**updateTrialJob(trialJobId: string, form: JobApplicationForm)**  
Chi Song's avatar
Chi Song committed
125
这个方法用来更新 Trial 的状态,不同平台有不同的检测作业状态的方法,并把状态更新为`RUNNING`, `SUCCEED`, `FAILED` 等。
Chi Song's avatar
Chi Song committed
126
127

**getTrialJob(trialJobId: string)**  
Chi Song's avatar
Chi Song committed
128
这个方法用来根据 Trial Id 来返回相应的 Trial 实例。
Chi Song's avatar
Chi Song committed
129
130

**listTrialJobs()**  
Chi Song's avatar
Chi Song committed
131
用户需要在这个方法中把所有的 Trial 实例放入一个列表中,并返回。
Chi Song's avatar
Chi Song committed
132
133

**addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**  
Chi Song's avatar
Chi Song committed
134
NNI 会启动一个 EventEmitter 来处理作业的指标数据,如果有检测到有新的数据,EventEmitter就会被触发,来执行相应的事件。 用户需要在这个方法中开始 EventEmitter。
Chi Song's avatar
Chi Song committed
135
136

**removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**  
Chi Song's avatar
Chi Song committed
137
移除 EventEmitter。
Chi Song's avatar
Chi Song committed
138
139

**run()**  
Chi Song's avatar
Chi Song committed
140
Run() 函数是 TrainingService 的主循环,用户可以在这个函数中循环执行他们的代码逻辑,这个函数在实验结束前会一直循环执行。
Chi Song's avatar
Chi Song committed
141
142

**cleanUp()**  
Chi Song's avatar
Chi Song committed
143
当实验结束后,这个方法用来清除实验环境。 用户需要在这个方法中实现与平台相关的清除操作。
Chi Song's avatar
Chi Song committed
144

Chi Song's avatar
Chi Song committed
145
## TrialKeeper 工具
Chi Song's avatar
Chi Song committed
146

Chi Song's avatar
Chi Song committed
147
NNI 提供了 TrialKeeper 工具,用来帮助维护 Trial 任务。 可以在 `nni/tools/nni_trial_tool` 文件夹中找到 TrialKeeper 的源代码。 如果想要运行在云平台上,这是维护任务的好工具。 TrialKeeper 的架构如下:  
Chi Song's avatar
Chi Song committed
148
![](../img/trialkeeper.jpg)  
Chi Song's avatar
Chi Song committed
149
当用户需要在远程云平台上运行作业,要把作业启动的命令行传入 TrailKeeper 中,并在远程云平台上启动 TriakKeeper 进程。 注意,TrialKeeper 在远程平台中使用 RESTful 服务来和 TrainingService 进行通信,用户需要在本地机器启动一个 RESTful 服务来接受 TrialKeeper 的请求。 关于 RESTful 服务的源代码可以在 `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts` 文件夹中找到.
Chi Song's avatar
Chi Song committed
150

Chi Song's avatar
Chi Song committed
151
## 参考
Chi Song's avatar
Chi Song committed
152

Chi Song's avatar
Chi Song committed
153
更多关于如何调试的信息,请[参考这里](HowToDebug.md)
Chi Song's avatar
Chi Song committed
154
关于如何贡献代码,请[参考这里](Contributing.md)