Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
nni
Commits
c5acd8c2
Unverified
Commit
c5acd8c2
authored
May 27, 2019
by
SparkSnail
Committed by
GitHub
May 27, 2019
Browse files
Merge pull request #173 from microsoft/master
merge master
parents
40bae6e2
d135d184
Changes
93
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
200 additions
and
89 deletions
+200
-89
docs/zh_CN/assessors.rst
docs/zh_CN/assessors.rst
+2
-2
docs/zh_CN/automl_practice_sharing.rst
docs/zh_CN/automl_practice_sharing.rst
+8
-0
docs/zh_CN/builtinTuner.rst
docs/zh_CN/builtinTuner.rst
+0
-18
docs/zh_CN/builtin_assessor.rst
docs/zh_CN/builtin_assessor.rst
+9
-0
docs/zh_CN/builtin_tuner.rst
docs/zh_CN/builtin_tuner.rst
+18
-0
docs/zh_CN/community_sharings.rst
docs/zh_CN/community_sharings.rst
+12
-0
docs/zh_CN/contribution.rst
docs/zh_CN/contribution.rst
+2
-2
docs/zh_CN/examples.rst
docs/zh_CN/examples.rst
+12
-0
docs/zh_CN/index.rst
docs/zh_CN/index.rst
+6
-6
docs/zh_CN/nni_practice_sharing.rst
docs/zh_CN/nni_practice_sharing.rst
+10
-0
docs/zh_CN/reference.rst
docs/zh_CN/reference.rst
+1
-1
docs/zh_CN/training_services.rst
docs/zh_CN/training_services.rst
+1
-1
docs/zh_CN/tuners.rst
docs/zh_CN/tuners.rst
+3
-3
docs/zh_CN/tutorials.rst
docs/zh_CN/tutorials.rst
+16
-0
install.ps1
install.ps1
+1
-1
src/nni_manager/common/utils.ts
src/nni_manager/common/utils.ts
+8
-8
src/nni_manager/core/nnimanager.ts
src/nni_manager/core/nnimanager.ts
+9
-6
src/nni_manager/training_service/common/util.ts
src/nni_manager/training_service/common/util.ts
+48
-7
src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts
...ng_service/remote_machine/remoteMachineTrainingService.ts
+29
-30
src/nni_manager/training_service/remote_machine/sshClientUtility.ts
...nager/training_service/remote_machine/sshClientUtility.ts
+5
-4
No files found.
docs/zh_CN/assessors.rst
View file @
c5acd8c2
...
...
@@ -15,5 +15,5 @@ Assessor 从 Trial 中接收中间结果,并通过指定的算法决定此 Tri
.. toctree::
:maxdepth: 2
内置 Assessor<
b
uiltinAssessor>
自定义 Assessor<Customize
_
Assessor>
内置 Assessor<
B
uiltinAssessor>
自定义 Assessor<CustomizeAssessor>
docs/zh_CN/automl_practice_sharing.rst
0 → 100644
View file @
c5acd8c2
#################
自动机器学习的经验分享
#################
.. toctree::
:maxdepth: 2
神经网络架构搜索的对比<CommunitySharings/AutomlPracticeSharing/NasComparison>
docs/zh_CN/builtinTuner.rst
deleted
100644 → 0
View file @
40bae6e2
内置 Tuner
==================
.. toctree::
:maxdepth: 1
介绍<Builtin_Tuner>
TPE<hyperoptTuner>
Random Search<hyperoptTuner>
Anneal<hyperoptTuner>
Naive Evolution<evolutionTuner>
SMAC<smacTuner>
Batch Tuner<batchTuner>
Grid Search<gridsearchTuner>
Hyperband<hyperbandAdvisor>
Network Morphism<networkmorphismTuner>
Metis Tuner<metisTuner>
BOHB<bohbAdvisor>
\ No newline at end of file
docs/zh_CN/builtin
A
ssessor.rst
→
docs/zh_CN/builtin
_a
ssessor.rst
View file @
c5acd8c2
...
...
@@ -4,6 +4,6 @@
.. toctree::
:maxdepth: 1
介绍<Builtin_Assessors>
Medianstop<medianstopAssessor>
Curvefitting<curvefittingAssessor>
\ No newline at end of file
介绍<BuiltinAssessors>
Medianstop<MedianstopAssessor>
Curvefitting<CurvefittingAssessor>
\ No newline at end of file
docs/zh_CN/builtin_tuner.rst
0 → 100644
View file @
c5acd8c2
内置 Tuner
==================
.. toctree::
:maxdepth: 1
介绍<BuiltinTuner>
TPE<HyperoptTuner>
Random Search<HyperoptTuner>
Anneal<HyperoptTuner>
Naive Evolution<EvolutionTuner>
SMAC<SmacTuner>
Batch Tuner<BatchTuner>
Grid Search<GridsearchTuner>
Hyperband<HyperbandAdvisor>
Network Morphism<NetworkmorphismTuner>
Metis Tuner<MetisTuner>
BOHB<BohbAdvisor>
\ No newline at end of file
docs/zh_CN/community_sharings.rst
0 → 100644
View file @
c5acd8c2
######################
社区分享
######################
除了官方的教程和示例之外,也支持社区贡献者分享自己的自动机器学习实践经验,特别是使用 NNI 的实践经验。
.. toctree::
:maxdepth: 2
NNI 经验分享<nni_practice_sharing>
神经网络结构搜索的对比<CommunitySharings/NasComparison>
超参调优算法的对比<CommunitySharings/HpoComparison>
docs/zh_CN/
C
ontribution.rst
→
docs/zh_CN/
c
ontribution.rst
View file @
c5acd8c2
...
...
@@ -3,5 +3,5 @@
###############################
.. toctree::
设置开发环境<SetupNNIDeveloperEnvironment>
贡献指南<CONTRIBUTING>
\ No newline at end of file
设置开发环境<SetupNniDeveloperEnvironment>
贡献指南<Contributing>
\ No newline at end of file
docs/zh_CN/
E
xamples.rst
→
docs/zh_CN/
e
xamples.rst
View file @
c5acd8c2
...
...
@@ -5,8 +5,8 @@
.. toctree::
:maxdepth: 2
MNIST<
m
nist
_e
xamples>
Cifar10<
c
ifar10
_e
xamples>
Scikit-learn<
s
klearn
_e
xamples>
EvolutionSQuAD<S
QuAD_e
volution
_e
xamples>
GBDT<
g
bdt
_e
xample>
MNIST<
M
nist
E
xamples>
Cifar10<
C
ifar10
E
xamples>
Scikit-learn<
S
klearn
E
xamples>
EvolutionSQuAD<S
quadE
volution
E
xamples>
GBDT<
G
bdt
E
xample>
docs/zh_CN/index.rst
View file @
c5acd8c2
...
...
@@ -13,10 +13,10 @@ Neural Network Intelligence(NNI)文档
概述<Overview>
入门<QuickStart>
教程<
T
utorials>
样
例<
E
xamples>
参考<
R
eference>
教程<
t
utorials>
示
例<
e
xamples>
参考<
r
eference>
常见问答<FAQ>
贡献<
C
ontribution>
版本
日志<R
ELEASE
>
博客<Blog/index
>
贡献<
c
ontribution>
更改
日志<R
elease
>
社区经验分享<community_sharings
>
docs/zh_CN/nni_practice_sharing.rst
0 → 100644
View file @
c5acd8c2
#################
教程
#################
分享使用 NNI 来调优模型和系统的经验
.. toctree::
:maxdepth: 2
在 NNI 上调优 Recommenders 的 SVD<CommunitySharings/NniPracticeSharing/RecommendersSvd>
\ No newline at end of file
docs/zh_CN/
R
eference.rst
→
docs/zh_CN/
r
eference.rst
View file @
c5acd8c2
...
...
@@ -4,7 +4,7 @@
.. toctree::
:maxdepth: 3
命令行<N
NICTLDOC
>
命令行<N
nictl
>
Python API<sdk_reference>
Annotation<AnnotationSpec>
配置<ExperimentConfig>
...
...
docs/zh_CN/training_services.rst
View file @
c5acd8c2
...
...
@@ -4,6 +4,6 @@ NNI 支持的训练平台介绍
.. toctree::
本机<LocalMode>
远程<RemoteMachineMode>
OpenPAI<P
AI
Mode>
OpenPAI<P
ai
Mode>
Kubeflow<KubeflowMode>
FrameworkController<FrameworkControllerMode>
\ No newline at end of file
docs/zh_CN/tuners.rst
View file @
c5acd8c2
...
...
@@ -13,6 +13,6 @@ Tuner 从 Trial 接收指标结果,来评估一组超参或网络结构的性
.. toctree::
:maxdepth: 2
内置 Tuner<builtinTuner>
自定义 Tuner<Customize_Tuner>
自定义 Advisor<Customize_Advisor>
\ No newline at end of file
内置 Tuner<BuiltinTuner>
自定义 Tuner<CustomizeTuner>
自定义 Advisor<CustomizeAdvisor>
\ No newline at end of file
docs/zh_CN/tutorials.rst
0 → 100644
View file @
c5acd8c2
######################
教程
######################
.. toctree::
:maxdepth: 2
安装<Installation>
实现 Trial<Trials>
Tuner<tuners>
Assessor<assessors>
Web 界面<WebUI>
训练平台<training_services>
如何使用 Docker <HowToUseDocker>
高级功能<advanced>
如何调试<HowToDebug>
\ No newline at end of file
install.ps1
View file @
c5acd8c2
...
...
@@ -15,7 +15,7 @@ $yarnUrl = "https://yarnpkg.com/latest.tar.gz"
$unzipNodeDir
=
"node-v*"
$unzipYarnDir
=
"yarn-v*"
$NNI_DEPENDENCY_FOLDER
=
"C:\tmp\
$
env
:
USERNAME
"
$NNI_DEPENDENCY_FOLDER
=
[
System.IO.Path
]::
GetTempPath
()
+
$
env
:
USERNAME
$WHICH_PYTHON
=
where.exe
python
if
(
$WHICH_PYTHON
-eq
$null
){
...
...
src/nni_manager/common/utils.ts
View file @
c5acd8c2
...
...
@@ -43,11 +43,11 @@ function getExperimentRootDir(): string {
.
getLogDir
();
}
function
getLogDir
():
string
{
function
getLogDir
():
string
{
return
path
.
join
(
getExperimentRootDir
(),
'
log
'
);
}
function
getLogLevel
():
string
{
function
getLogLevel
():
string
{
return
getExperimentStartupInfo
()
.
getLogLevel
();
}
...
...
@@ -149,7 +149,7 @@ function parseArg(names: string[]): string {
return
''
;
}
function
encodeCmdLineArgs
(
args
:
any
):
any
{
function
encodeCmdLineArgs
(
args
:
any
):
any
{
if
(
process
.
platform
===
'
win32
'
){
return
JSON
.
stringify
(
args
);
}
...
...
@@ -158,7 +158,7 @@ function encodeCmdLineArgs(args:any):any{
}
}
function
getCmdPy
():
string
{
function
getCmdPy
():
string
{
let
cmd
=
'
python3
'
;
if
(
process
.
platform
===
'
win32
'
){
cmd
=
'
python
'
;
...
...
@@ -390,7 +390,7 @@ async function getVersion(): Promise<string> {
/**
* run command as ChildProcess
*/
function
getTunerProc
(
command
:
string
,
stdio
:
StdioOptions
,
newCwd
:
string
,
newEnv
:
any
):
ChildProcess
{
function
getTunerProc
(
command
:
string
,
stdio
:
StdioOptions
,
newCwd
:
string
,
newEnv
:
any
):
ChildProcess
{
let
cmd
:
string
=
command
;
let
arg
:
string
[]
=
[];
let
newShell
:
boolean
=
true
;
...
...
@@ -411,7 +411,7 @@ function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newE
/**
* judge whether the process is alive
*/
async
function
isAlive
(
pid
:
any
):
Promise
<
boolean
>
{
async
function
isAlive
(
pid
:
any
):
Promise
<
boolean
>
{
let
deferred
:
Deferred
<
boolean
>
=
new
Deferred
<
boolean
>
();
let
alive
:
boolean
=
false
;
if
(
process
.
platform
===
'
win32
'
){
...
...
@@ -439,7 +439,7 @@ async function isAlive(pid:any): Promise<boolean>{
/**
* kill process
*/
async
function
killPid
(
pid
:
any
):
Promise
<
void
>
{
async
function
killPid
(
pid
:
any
):
Promise
<
void
>
{
let
deferred
:
Deferred
<
void
>
=
new
Deferred
<
void
>
();
try
{
if
(
process
.
platform
===
"
win32
"
)
{
...
...
@@ -455,7 +455,7 @@ async function killPid(pid:any): Promise<void>{
return
deferred
.
promise
;
}
function
getNewLine
():
string
{
function
getNewLine
():
string
{
if
(
process
.
platform
===
"
win32
"
)
{
return
"
\r\n
"
;
}
...
...
src/nni_manager/core/nnimanager.ts
View file @
c5acd8c2
...
...
@@ -58,7 +58,8 @@ class NNIManager implements Manager {
private
status
:
NNIManagerStatus
;
private
waitingTrials
:
string
[];
private
trialJobs
:
Map
<
string
,
TrialJobDetail
>
;
private
trialJobMetricListener
:
(
metric
:
TrialJobMetric
)
=>
void
;
constructor
()
{
this
.
currSubmittedTrialNum
=
0
;
this
.
trialConcurrencyChange
=
0
;
...
...
@@ -76,6 +77,11 @@ class NNIManager implements Manager {
status
:
'
INITIALIZED
'
,
errors
:
[]
};
this
.
trialJobMetricListener
=
(
metric
:
TrialJobMetric
)
=>
{
this
.
onTrialJobMetrics
(
metric
).
catch
((
err
:
Error
)
=>
{
this
.
criticalError
(
NNIError
.
FromError
(
err
,
'
Job metrics error:
'
));
});
};
}
public
updateExperimentProfile
(
experimentProfile
:
ExperimentProfile
,
updateType
:
ProfileUpdateType
):
Promise
<
void
>
{
...
...
@@ -342,6 +348,7 @@ class NNIManager implements Manager {
if
(
this
.
dispatcher
===
undefined
)
{
throw
new
Error
(
'
Error: tuner has not been setup
'
);
}
this
.
trainingService
.
removeTrialJobMetricListener
(
this
.
trialJobMetricListener
);
this
.
dispatcher
.
sendCommand
(
TERMINATE
);
let
tunerAlive
:
boolean
=
true
;
// gracefully terminate tuner and assessor here, wait at most 30 seconds.
...
...
@@ -589,11 +596,7 @@ class NNIManager implements Manager {
if
(
this
.
dispatcher
===
undefined
)
{
throw
new
Error
(
'
Error: tuner or job maintainer have not been setup
'
);
}
this
.
trainingService
.
addTrialJobMetricListener
((
metric
:
TrialJobMetric
)
=>
{
this
.
onTrialJobMetrics
(
metric
).
catch
((
err
:
Error
)
=>
{
this
.
criticalError
(
NNIError
.
FromError
(
err
,
'
Job metrics error:
'
));
});
});
this
.
trainingService
.
addTrialJobMetricListener
(
this
.
trialJobMetricListener
);
this
.
dispatcher
.
onCommand
((
commandType
:
string
,
content
:
string
)
=>
{
this
.
onTunerCommand
(
commandType
,
content
).
catch
((
err
:
Error
)
=>
{
...
...
src/nni_manager/training_service/common/util.ts
View file @
c5acd8c2
...
...
@@ -24,7 +24,10 @@ import { getLogger } from "common/log";
import
{
countFilesRecursively
}
from
'
../../common/utils
'
import
*
as
cpp
from
'
child-process-promise
'
;
import
*
as
cp
from
'
child_process
'
;
import
{
GPU_INFO_COLLECTOR_FORMAT_LINUX
,
GPU_INFO_COLLECTOR_FORMAT_WINDOWS
}
from
'
./gpuData
'
import
*
as
os
from
'
os
'
;
import
*
as
fs
from
'
fs
'
;
import
{
getNewLine
}
from
'
../../common/utils
'
;
import
{
GPU_INFO_COLLECTOR_FORMAT_LINUX
,
GPU_INFO_COLLECTOR_FORMAT_WINDOWS
}
from
'
./gpuData
'
;
import
*
as
path
from
'
path
'
;
import
{
String
}
from
'
typescript-string-operations
'
;
import
{
file
}
from
"
../../node_modules/@types/tmp
"
;
...
...
@@ -66,6 +69,20 @@ export async function execMkdir(directory: string): Promise<void> {
return
Promise
.
resolve
();
}
/**
* copy files to the directory
* @param source
* @param destination
*/
export
async
function
execCopydir
(
source
:
string
,
destination
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
await
cpp
.
exec
(
`powershell.exe Copy-Item
${
source
}
-Destination
${
destination
}
-Recurse`
);
}
else
{
await
cpp
.
exec
(
`cp -r
${
source
}
${
destination
}
`
);
}
return
Promise
.
resolve
();
}
/**
* crete a new file
* @param filename
...
...
@@ -91,8 +108,6 @@ export function execScript(filePath: string): cp.ChildProcess {
}
}
/**
* output the last line of a file
* @param filePath
...
...
@@ -111,9 +126,9 @@ export async function execTail(filePath: string): Promise<cpp.childProcessPromis
* delete a directory
* @param directory
*/
export
async
function
execRemove
(
directory
:
string
):
Promise
<
void
>
{
export
async
function
execRemove
(
directory
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
await
cpp
.
exec
(
`powershell.exe Remove-Item
${
directory
}
`
);
await
cpp
.
exec
(
`powershell.exe Remove-Item
${
directory
}
-Recurse -Force
`
);
}
else
{
await
cpp
.
exec
(
`rm -rf
${
directory
}
`
);
}
...
...
@@ -124,7 +139,7 @@ export async function execRemove(directory: string): Promise<void>{
* kill a process
* @param directory
*/
export
async
function
execKill
(
pid
:
string
):
Promise
<
void
>
{
export
async
function
execKill
(
pid
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
await
cpp
.
exec
(
`cmd /c taskkill /PID
${
pid
}
/T /F`
);
}
else
{
...
...
@@ -138,7 +153,7 @@ export async function execKill(pid: string): Promise<void>{
* @param variable
* @returns command string
*/
export
function
setEnvironmentVariable
(
variable
:
{
key
:
string
;
value
:
string
}):
string
{
export
function
setEnvironmentVariable
(
variable
:
{
key
:
string
;
value
:
string
}):
string
{
if
(
process
.
platform
===
'
win32
'
)
{
return
`$env:
${
variable
.
key
}
="
${
variable
.
value
}
"`
;
}
...
...
@@ -147,6 +162,32 @@ export function setEnvironmentVariable(variable: { key: string; value: string })
}
}
/**
* Compress files in directory to tar file
* @param source_path
* @param tar_path
*/
export
async
function
tarAdd
(
tar_path
:
string
,
source_path
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
tar_path
=
tar_path
.
split
(
'
\\
'
).
join
(
'
\\\\
'
);
source_path
=
source_path
.
split
(
'
\\
'
).
join
(
'
\\\\
'
);
let
script
:
string
[]
=
[];
script
.
push
(
`import os`
,
`import tarfile`
,
String
.
Format
(
`tar = tarfile.open("{0}","w:gz")\r\nfor root,dir,files in os.walk("{1}"):`
,
tar_path
,
source_path
),
` for file in files:`
,
` fullpath = os.path.join(root,file)`
,
` tar.add(fullpath, arcname=file)`
,
`tar.close()`
);
await
fs
.
promises
.
writeFile
(
path
.
join
(
os
.
tmpdir
(),
'
tar.py
'
),
script
.
join
(
getNewLine
()),
{
encoding
:
'
utf8
'
,
mode
:
0o777
});
const
tarScript
:
string
=
path
.
join
(
os
.
tmpdir
(),
'
tar.py
'
);
await
cpp
.
exec
(
`python
${
tarScript
}
`
);
}
else
{
await
cpp
.
exec
(
`tar -czf
${
tar_path
}
-C
${
source_path
}
.`
);
}
return
Promise
.
resolve
();
}
/**
* generate script file name
...
...
src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts
View file @
c5acd8c2
...
...
@@ -36,7 +36,7 @@ import { ObservableTimer } from '../../common/observableTimer';
import
{
HostJobApplicationForm
,
HyperParameters
,
JobApplicationForm
,
TrainingService
,
TrialJobApplicationForm
,
TrialJobDetail
,
TrialJobMetric
,
NNIManagerIpConfig
}
from
'
../../common/trainingService
'
;
import
{
delay
,
generateParamFileName
,
getExperimentRootDir
,
uniqueString
,
getJobCancelStatus
,
getRemoteTmpDir
,
getIPV4Address
}
from
'
../../common/utils
'
;
import
{
delay
,
generateParamFileName
,
getExperimentRootDir
,
uniqueString
,
getJobCancelStatus
,
getRemoteTmpDir
,
getIPV4Address
,
getVersion
,
unixPathJoin
}
from
'
../../common/utils
'
;
import
{
GPUSummary
}
from
'
../common/gpuData
'
;
import
{
TrialConfig
}
from
'
../common/trialConfig
'
;
import
{
TrialConfigMetadataKey
}
from
'
../common/trialConfigMetadataKey
'
;
...
...
@@ -48,10 +48,9 @@ import {
}
from
'
./remoteMachineData
'
;
import
{
GPU_INFO_COLLECTOR_FORMAT_LINUX
}
from
'
../common/gpuData
'
;
import
{
SSHClientUtility
}
from
'
./sshClientUtility
'
;
import
{
validateCodeDir
}
from
'
../common/util
'
;
import
{
validateCodeDir
,
execRemove
,
execMkdir
,
execCopydir
}
from
'
../common/util
'
;
import
{
RemoteMachineJobRestServer
}
from
'
./remoteMachineJobRestServer
'
;
import
{
CONTAINER_INSTALL_NNI_SHELL_FORMAT
}
from
'
../common/containerJobData
'
;
import
{
mkDirP
,
getVersion
}
from
'
../../common/utils
'
;
/**
* Training Service implementation for Remote Machine (Linux)
...
...
@@ -234,7 +233,7 @@ class RemoteMachineTrainingService implements TrainingService {
}
else
if
(
form
.
jobType
===
'
TRIAL
'
)
{
// Generate trial job id(random)
const
trialJobId
:
string
=
uniqueString
(
5
);
const
trialWorkingFolder
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialWorkingFolder
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialJobDetail
:
RemoteMachineTrialJobDetail
=
new
RemoteMachineTrialJobDetail
(
trialJobId
,
...
...
@@ -354,7 +353,7 @@ class RemoteMachineTrainingService implements TrainingService {
case
TrialConfigMetadataKey
.
MACHINE_LIST
:
await
this
.
setupConnections
(
value
);
//remove local temp files
await
cpp
.
exec
(
`rm -rf
${
this
.
getLocalGpuMetricCollectorDir
()
}
`
);
await
execRemove
(
this
.
getLocalGpuMetricCollectorDir
());
break
;
case
TrialConfigMetadataKey
.
TRIAL_CONFIG
:
const
remoteMachineTrailConfig
:
TrialConfig
=
<
TrialConfig
>
JSON
.
parse
(
value
);
...
...
@@ -417,7 +416,7 @@ class RemoteMachineTrainingService implements TrainingService {
private
async
cleanupConnections
():
Promise
<
void
>
{
try
{
for
(
const
[
rmMeta
,
sshClientManager
]
of
this
.
machineSSHClientMap
.
entries
())
{
let
jobpidPath
:
string
=
p
ath
.
j
oin
(
this
.
getRemoteScriptsPath
(
rmMeta
.
username
),
'
pid
'
);
let
jobpidPath
:
string
=
unixP
ath
J
oin
(
this
.
getRemoteScriptsPath
(
rmMeta
.
username
),
'
pid
'
);
let
client
:
Client
|
undefined
=
sshClientManager
.
getFirstSSHClient
();
if
(
client
)
{
await
SSHClientUtility
.
remoteExeCommand
(
`pkill -P
\`
cat
${
jobpidPath
}
\`
`
,
client
);
...
...
@@ -438,7 +437,7 @@ class RemoteMachineTrainingService implements TrainingService {
*/
private
getLocalGpuMetricCollectorDir
():
string
{
let
userName
:
string
=
path
.
basename
(
os
.
homedir
());
//get current user name of os
return
`
${
os
.
tmpdir
()
}
/
${
userName
}
/nni/
scripts
/`
;
return
path
.
join
(
os
.
tmpdir
()
,
userName
,
'
nni
'
,
'
scripts
'
)
;
}
/**
...
...
@@ -447,14 +446,14 @@ class RemoteMachineTrainingService implements TrainingService {
*/
private
async
generateGpuMetricsCollectorScript
(
userName
:
string
):
Promise
<
void
>
{
let
gpuMetricCollectorScriptFolder
:
string
=
this
.
getLocalGpuMetricCollectorDir
();
await
cpp
.
exec
(
`m
kdir
-p
${
path
.
join
(
gpuMetricCollectorScriptFolder
,
userName
)
}
`
);
await
exec
M
kdir
(
path
.
join
(
gpuMetricCollectorScriptFolder
,
userName
));
//generate gpu_metrics_collector.sh
let
gpuMetricsCollectorScriptPath
:
string
=
path
.
join
(
gpuMetricCollectorScriptFolder
,
userName
,
'
gpu_metrics_collector.sh
'
);
const
remoteGPUScriptsDir
:
string
=
this
.
getRemoteScriptsPath
(
userName
);
// This directory is used to store gpu_metrics and pid created by script
const
gpuMetricsCollectorScriptContent
:
string
=
String
.
Format
(
GPU_INFO_COLLECTOR_FORMAT_LINUX
,
remoteGPUScriptsDir
,
p
ath
.
j
oin
(
remoteGPUScriptsDir
,
'
pid
'
),
unixP
ath
J
oin
(
remoteGPUScriptsDir
,
'
pid
'
),
);
await
fs
.
promises
.
writeFile
(
gpuMetricsCollectorScriptPath
,
gpuMetricsCollectorScriptContent
,
{
encoding
:
'
utf8
'
});
}
...
...
@@ -481,7 +480,7 @@ class RemoteMachineTrainingService implements TrainingService {
private
async
initRemoteMachineOnConnected
(
rmMeta
:
RemoteMachineMeta
,
conn
:
Client
):
Promise
<
void
>
{
// Create root working directory after ssh connection is ready
await
this
.
generateGpuMetricsCollectorScript
(
rmMeta
.
username
);
//generate gpu script in local machine first, will copy to remote machine later
const
nniRootDir
:
string
=
`
${
os
.
t
mp
d
ir
(
)}
/
nni
`
;
const
nniRootDir
:
string
=
unixPathJoin
(
getRemoteT
mp
D
ir
(
this
.
remoteOS
),
'
nni
'
)
;
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
this
.
remoteExpRootDir
}
`
,
conn
);
// Copy NNI scripts to remote expeirment working directory
...
...
@@ -490,15 +489,15 @@ class RemoteMachineTrainingService implements TrainingService {
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
remoteGpuScriptCollectorDir
}
`
,
conn
);
await
SSHClientUtility
.
remoteExeCommand
(
`chmod 777
${
nniRootDir
}
${
nniRootDir
}
/*
${
nniRootDir
}
/scripts/*`
,
conn
);
//copy gpu_metrics_collector.sh to remote
await
SSHClientUtility
.
copyFileToRemote
(
path
.
join
(
localGpuScriptCollectorDir
,
rmMeta
.
username
,
'
gpu_metrics_collector.sh
'
),
p
ath
.
j
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
),
conn
);
await
SSHClientUtility
.
copyFileToRemote
(
path
.
join
(
localGpuScriptCollectorDir
,
rmMeta
.
username
,
'
gpu_metrics_collector.sh
'
),
unixP
ath
J
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
),
conn
);
//Begin to execute gpu_metrics_collection scripts
SSHClientUtility
.
remoteExeCommand
(
`bash
${
p
ath
.
j
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
)}
`
,
conn
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
unixP
ath
J
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
)}
`
,
conn
);
this
.
timer
.
subscribe
(
async
(
tick
:
number
)
=>
{
const
cmdresult
:
RemoteCommandResult
=
await
SSHClientUtility
.
remoteExeCommand
(
`tail -n 1
${
p
ath
.
j
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics
'
)}
`
,
conn
);
`tail -n 1
${
unixP
ath
J
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics
'
)}
`
,
conn
);
if
(
cmdresult
&&
cmdresult
.
stdout
)
{
rmMeta
.
gpuSummary
=
<
GPUSummary
>
JSON
.
parse
(
cmdresult
.
stdout
);
}
...
...
@@ -531,7 +530,7 @@ class RemoteMachineTrainingService implements TrainingService {
}
else
if
(
rmScheduleResult
.
resultType
===
ScheduleResultType
.
SUCCEED
&&
rmScheduleResult
.
scheduleInfo
!==
undefined
)
{
const
rmScheduleInfo
:
RemoteMachineScheduleInfo
=
rmScheduleResult
.
scheduleInfo
;
const
trialWorkingFolder
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialWorkingFolder
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
trialJobDetail
.
rmMeta
=
rmScheduleInfo
.
rmMeta
;
...
...
@@ -575,7 +574,7 @@ class RemoteMachineTrainingService implements TrainingService {
const
trialLocalTempFolder
:
string
=
path
.
join
(
this
.
expRootDir
,
'
trials-local
'
,
trialJobId
);
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
trialWorkingFolder
}
`
,
sshClient
);
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
p
ath
.
j
oin
(
trialWorkingFolder
,
'
.nni
'
)}
`
,
sshClient
);
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
unixP
ath
J
oin
(
trialWorkingFolder
,
'
.nni
'
)}
`
,
sshClient
);
// RemoteMachineRunShellFormat is the run shell format string,
// See definition in remoteMachineData.ts
...
...
@@ -603,20 +602,20 @@ class RemoteMachineTrainingService implements TrainingService {
getExperimentId
(),
trialJobDetail
.
sequenceId
.
toString
(),
this
.
isMultiPhase
,
p
ath
.
j
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
jobpid
'
),
unixP
ath
J
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
jobpid
'
),
command
,
nniManagerIp
,
this
.
remoteRestServerPort
,
version
,
this
.
logCollection
,
p
ath
.
j
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
code
'
)
unixP
ath
J
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
code
'
)
)
//create tmp trial working folder locally.
await
cpp
.
exec
(
`m
kdir
-p
${
path
.
join
(
trialLocalTempFolder
,
'
.nni
'
)
}
`
);
await
exec
M
kdir
(
path
.
join
(
trialLocalTempFolder
,
'
.nni
'
));
//create tmp trial working folder locally.
await
cpp
.
exec
(
`cp -r
${
this
.
trialConfig
.
codeDir
}
/*
${
trialLocalTempFolder
}
`
);
await
execCopydir
(
path
.
join
(
this
.
trialConfig
.
codeDir
,
'
*
'
),
trialLocalTempFolder
);
const
installScriptContent
:
string
=
CONTAINER_INSTALL_NNI_SHELL_FORMAT
;
// Write NNI installation file to local tmp files
await
fs
.
promises
.
writeFile
(
path
.
join
(
trialLocalTempFolder
,
'
install_nni.sh
'
),
installScriptContent
,
{
encoding
:
'
utf8
'
});
...
...
@@ -626,7 +625,7 @@ class RemoteMachineTrainingService implements TrainingService {
// Copy files in codeDir to remote working directory
await
SSHClientUtility
.
copyDirectoryToRemote
(
trialLocalTempFolder
,
trialWorkingFolder
,
sshClient
,
this
.
remoteOS
);
// Execute command in remote machine
SSHClientUtility
.
remoteExeCommand
(
`bash
${
p
ath
.
j
oin
(
trialWorkingFolder
,
'
run.sh
'
)}
`
,
sshClient
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
unixP
ath
J
oin
(
trialWorkingFolder
,
'
run.sh
'
)}
`
,
sshClient
);
}
private
async
runHostJob
(
form
:
HostJobApplicationForm
):
Promise
<
TrialJobDetail
>
{
...
...
@@ -646,8 +645,8 @@ class RemoteMachineTrainingService implements TrainingService {
);
await
fs
.
promises
.
writeFile
(
path
.
join
(
localDir
,
'
run.sh
'
),
runScriptContent
,
{
encoding
:
'
utf8
'
});
await
SSHClientUtility
.
copyFileToRemote
(
path
.
join
(
localDir
,
'
run.sh
'
),
p
ath
.
j
oin
(
remoteDir
,
'
run.sh
'
),
sshClient
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
p
ath
.
j
oin
(
remoteDir
,
'
run.sh
'
)}
`
,
sshClient
);
path
.
join
(
localDir
,
'
run.sh
'
),
unixP
ath
J
oin
(
remoteDir
,
'
run.sh
'
),
sshClient
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
unixP
ath
J
oin
(
remoteDir
,
'
run.sh
'
)}
`
,
sshClient
);
const
jobDetail
:
RemoteMachineTrialJobDetail
=
new
RemoteMachineTrialJobDetail
(
jobId
,
'
RUNNING
'
,
Date
.
now
(),
remoteDir
,
form
,
this
.
generateSequenceId
()
...
...
@@ -672,7 +671,7 @@ class RemoteMachineTrainingService implements TrainingService {
private
async
updateTrialJobStatus
(
trialJob
:
RemoteMachineTrialJobDetail
,
sshClient
:
Client
):
Promise
<
TrialJobDetail
>
{
const
deferred
:
Deferred
<
TrialJobDetail
>
=
new
Deferred
<
TrialJobDetail
>
();
const
jobpidPath
:
string
=
this
.
getJobPidPath
(
trialJob
.
id
);
const
trialReturnCodeFilePath
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJob
.
id
,
'
.nni
'
,
'
code
'
);
const
trialReturnCodeFilePath
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJob
.
id
,
'
.nni
'
,
'
code
'
);
try
{
const
killResult
:
number
=
(
await
SSHClientUtility
.
remoteExeCommand
(
`kill -0
\`
cat
${
jobpidPath
}
\`
`
,
sshClient
)).
exitCode
;
// if the process of jobpid is not alive any more
...
...
@@ -712,15 +711,15 @@ class RemoteMachineTrainingService implements TrainingService {
}
private
getRemoteScriptsPath
(
userName
:
string
):
string
{
return
p
ath
.
j
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
userName
,
'
nni
'
,
'
scripts
'
);
return
unixP
ath
J
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
userName
,
'
nni
'
,
'
scripts
'
);
}
private
getHostJobRemoteDir
(
jobId
:
string
):
string
{
return
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
hostjobs
'
,
jobId
);
return
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
hostjobs
'
,
jobId
);
}
private
getRemoteExperimentRootDir
():
string
{
return
p
ath
.
j
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
'
nni
'
,
'
experiments
'
,
getExperimentId
());
return
unixP
ath
J
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
'
nni
'
,
'
experiments
'
,
getExperimentId
());
}
public
get
MetricsEmitter
()
:
EventEmitter
{
...
...
@@ -735,9 +734,9 @@ class RemoteMachineTrainingService implements TrainingService {
let
jobpidPath
:
string
;
if
(
trialJobDetail
.
form
.
jobType
===
'
TRIAL
'
)
{
jobpidPath
=
p
ath
.
j
oin
(
trialJobDetail
.
workingDirectory
,
'
.nni
'
,
'
jobpid
'
);
jobpidPath
=
unixP
ath
J
oin
(
trialJobDetail
.
workingDirectory
,
'
.nni
'
,
'
jobpid
'
);
}
else
if
(
trialJobDetail
.
form
.
jobType
===
'
HOST
'
)
{
jobpidPath
=
p
ath
.
j
oin
(
this
.
getHostJobRemoteDir
(
jobId
),
'
jobpid
'
);
jobpidPath
=
unixP
ath
J
oin
(
this
.
getHostJobRemoteDir
(
jobId
),
'
jobpid
'
);
}
else
{
throw
new
Error
(
`Job type not supported:
${
trialJobDetail
.
form
.
jobType
}
`
);
}
...
...
@@ -751,14 +750,14 @@ class RemoteMachineTrainingService implements TrainingService {
throw
new
Error
(
'
sshClient is undefined.
'
);
}
const
trialWorkingFolder
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialWorkingFolder
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialLocalTempFolder
:
string
=
path
.
join
(
this
.
expRootDir
,
'
trials-local
'
,
trialJobId
);
const
fileName
:
string
=
generateParamFileName
(
hyperParameters
);
const
localFilepath
:
string
=
path
.
join
(
trialLocalTempFolder
,
fileName
);
await
fs
.
promises
.
writeFile
(
localFilepath
,
hyperParameters
.
value
,
{
encoding
:
'
utf8
'
});
await
SSHClientUtility
.
copyFileToRemote
(
localFilepath
,
p
ath
.
j
oin
(
trialWorkingFolder
,
fileName
),
sshClient
);
await
SSHClientUtility
.
copyFileToRemote
(
localFilepath
,
unixP
ath
J
oin
(
trialWorkingFolder
,
fileName
),
sshClient
);
}
private
generateSequenceId
():
number
{
...
...
src/nni_manager/training_service/remote_machine/sshClientUtility.ts
View file @
c5acd8c2
...
...
@@ -28,8 +28,9 @@ import * as stream from 'stream';
import
{
Deferred
}
from
'
ts-deferred
'
;
import
{
NNIError
,
NNIErrorNames
}
from
'
../../common/errors
'
;
import
{
getLogger
,
Logger
}
from
'
../../common/log
'
;
import
{
uniqueString
,
getRemoteTmpDir
}
from
'
../../common/utils
'
;
import
{
uniqueString
,
getRemoteTmpDir
,
unixPathJoin
}
from
'
../../common/utils
'
;
import
{
RemoteCommandResult
}
from
'
./remoteMachineData
'
;
import
{
execRemove
,
tarAdd
}
from
'
../common/util
'
;
/**
*
...
...
@@ -47,13 +48,13 @@ export namespace SSHClientUtility {
const
deferred
:
Deferred
<
void
>
=
new
Deferred
<
void
>
();
const
tmpTarName
:
string
=
`
${
uniqueString
(
10
)}
.tar.gz`
;
const
localTarPath
:
string
=
path
.
join
(
os
.
tmpdir
(),
tmpTarName
);
const
remoteTarPath
:
string
=
p
ath
.
j
oin
(
getRemoteTmpDir
(
remoteOS
),
tmpTarName
);
const
remoteTarPath
:
string
=
unixP
ath
J
oin
(
getRemoteTmpDir
(
remoteOS
),
tmpTarName
);
// Compress files in local directory to experiment root directory
await
cpp
.
exec
(
`tar -czf
${
localTarPath
}
-C
${
localDirectory
}
.`
);
await
tarAdd
(
localTarPath
,
localDirectory
);
// Copy the compressed file to remoteDirectory and delete it
await
copyFileToRemote
(
localTarPath
,
remoteTarPath
,
sshClient
);
await
cpp
.
exec
(
`rm
${
localTarPath
}
`
);
await
execRemove
(
localTarPath
);
// Decompress the remote compressed file in and delete it
await
remoteExeCommand
(
`tar -oxzf
${
remoteTarPath
}
-C
${
remoteDirectory
}
`
,
sshClient
);
await
remoteExeCommand
(
`rm
${
remoteTarPath
}
`
,
sshClient
);
...
...
Prev
1
2
3
4
5
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment